Benchmark
Benchmark Results
How each prediction strategy performs on the 317-case biotech catalyst benchmark
Benchmark Runs In Progress
Results below show the evaluation framework. Run benchmarks to populate real results. All strategies will be tested against 317 validated biotech catalyst cases with known stock price outcomes.
Scoring Methodology
Market-Cap-Adjusted Scoring
Each catalyst case is scored based on the stock's price movement, weighted by market capitalization to account for liquidity and volatility differences:
This ensures that a 20% move in a mega-cap biotech ($100B) is weighted more heavily than the same percentage move in a micro-cap ($50M), reflecting the greater difficulty and significance of moving large-cap stocks.
Example Calculations:
Micro-cap: heavily discounted due to high volatility
Mid-cap: baseline weighting
Mega-cap: even small moves matter
Categorical Evaluation Metrics
Numeric Evaluation Metrics
Strategy Leaderboard
Strategies evaluated on 317 validated catalyst cases
| Rank | Strategy | Type | Steps | Exact Match | Close Match | Direction | MAE | Correlation | Status |
|---|---|---|---|---|---|---|---|---|---|
| 1 | Direct Categorical | Direct Categorical | 1 | -- | -- | -- | -- | -- | Pending |
| 2 | Chain-of-Thought Categorical | Chain-of-Thought | 1 | -- | -- | -- | -- | -- | Pending |
| 3 | Multi-Step Agent (Categorical) | Agent (Categorical) | 5 | -- | -- | -- | -- | -- | Pending |
| 4 | Direct 0-10 Scores | Direct 0-10 Scores | 1 | -- | -- | -- | -- | -- | Pending |
| 5 | Multi-Step Agent (0-10 Scores) | Agent (0-10 Scores) | 5 | -- | -- | -- | -- | -- | Pending |
| 6 | Linear Regression (LLM Features) | Linear Regression | 1 | -- | -- | -- | -- | -- | Pending |
| 7 | Linear Regression + Search | Linear Regression + Search | 2 | -- | -- | -- | -- | -- | Pending |
Evaluation Framework
Dataset Overview
How Predictions Are Compared
- 1Strategy makes prediction: Each strategy analyzes the press release and clinical trial data to predict stock impact
- 2Actual outcome measured: Stock price change is calculated from market close before announcement to close on day of announcement
- 3Market-cap adjustment: Price change is weighted by market cap multiplier to normalize for volatility
- 4Metrics calculated: Exact match, close match, direction accuracy, MAE, and correlation metrics are computed
7-Category Impact Scale
The benchmark uses a standardized 7-category scale to classify catalyst outcomes. Categories are derived from the adjusted score, which weights raw percentage price changes by a market-cap multiplier to normalize for volatility across micro-cap and large-cap stocks.
Note: Categories are mapped from the adjusted score (not raw price change). The adjusted score = percent_change × market_cap_multiplier / 5, where the multiplier ranges from 0.25 (micro-cap) to 1.5 (mega-cap). Predictions from categorical strategies are compared to these bins, while numeric predictions are evaluated using MAE and correlation against the adjusted score.
De-identification & Data Cleaning
Before AI models score cases, each press release is de-identified to prevent memorization of real stock outcomes. Here are the benchmark results validating that process.
Three-Stage De-identification Pipeline
Regex pre-processing
Strips wire service attributions, location datelines, URLs, email addresses, stock exchange references, NCT trial IDs. Pre-redacts known trial names, institutions, cities, and conferences with placeholders.
LLM redaction (GPT-5)
Identifies and replaces identifying information with standardized placeholders. Generalizes unique identifiers (first-in-class mechanisms, unique trial names) to broader categories.
Regex post-processing
Catches any remaining identifiers missed by the LLM: trial names, city names, institution names, companion diagnostics, and stock exchange references.
Placeholder tokens
What Gets Preserved vs. Removed
What Gets Preserved
- Efficacy data (response rates, survival, p-values)
- Safety signals and adverse event profiles
- Trial design (endpoints, enrollment, arms)
- Mechanism of action (generalized)
- Regulatory context and decision rationale
What Gets Removed
- Company and drug names
- Ticker symbols and financial details
- Named investigators and executives
- Specific trial names (e.g., KEYNOTE-XXX)
- Institution and conference names
- Dates, locations, and URLs
Validation Results
Key finding: GPT-5 re-identified 53% of tickers from de-identified text (169/317 cases), but could not recall actual price impacts (only 4.8% within 2% of actual, Pearson 0.465). We do our best to mitigate re-identification while preserving enough scientific content for models to reason about the catalyst.
Ready to Run Benchmarks?
Execute strategies against the 317-case benchmark to see which approach performs best at predicting biotech catalyst outcomes.
View All Strategies