Benchmark
Benchmark Results
How each prediction strategy performs on the 655-case biotech catalyst benchmark
Scoring Methodology
Market-Cap-Adjusted Scoring
Each catalyst case is scored based on the stock's price movement, weighted by market capitalization to account for liquidity and volatility differences:
This ensures that a 20% move in a mega-cap biotech ($100B) is weighted more heavily than the same percentage move in a micro-cap ($50M), reflecting the greater difficulty and significance of moving large-cap stocks.
Example Calculations:
Micro-cap: heavily discounted due to high volatility
Mid-cap: baseline weighting
Mega-cap: even small moves matter
Categorical Evaluation Metrics
Numeric Evaluation Metrics
Evaluation Framework
Dataset Overview
How Predictions Are Compared
- 1Strategy makes prediction: Each strategy analyzes the press release and clinical trial data to predict stock impact
- 2Actual outcome measured: Stock price change is calculated from market close before announcement to close on day of announcement
- 3Market-cap adjustment: Price change is weighted by market cap multiplier to normalize for volatility
- 4Metrics calculated: Exact match, close match, direction accuracy, MAE, and correlation metrics are computed
7-Category Impact Scale
The benchmark uses a standardized 7-category scale to classify catalyst outcomes. Categories are derived from the adjusted score, which weights raw percentage price changes by a market-cap multiplier to normalize for volatility across micro-cap and large-cap stocks.
Note: Categories are mapped from the adjusted score (not raw price change). The adjusted score = percent_change × market_cap_multiplier / 5, where the multiplier ranges from 0.25 (micro-cap) to 1.5 (mega-cap). Predictions from categorical strategies are compared to these bins, while numeric predictions are evaluated using MAE and correlation against the adjusted score.
De-identification & Data Cleaning
Before AI models score cases, each press release is de-identified to prevent memorization of real stock outcomes. Here are the benchmark results validating that process.
Three-Stage De-identification Pipeline
Regex pre-processing
Strips wire service attributions, location datelines, URLs, email addresses, stock exchange references, NCT trial IDs. Pre-redacts known trial names, institutions, cities, and conferences with placeholders.
LLM redaction (GPT-5)
Identifies and replaces identifying information with standardized placeholders. Generalizes unique identifiers (first-in-class mechanisms, unique trial names) to broader categories.
Regex post-processing
Catches any remaining identifiers missed by the LLM: trial names, city names, institution names, companion diagnostics, and stock exchange references.
Placeholder tokens
What Gets Preserved vs. Removed
What Gets Preserved
- Efficacy data (response rates, survival, p-values)
- Safety signals and adverse event profiles
- Trial design (endpoints, enrollment, arms)
- Mechanism of action (generalized)
- Regulatory context and decision rationale
What Gets Removed
- Company and drug names
- Ticker symbols and financial details
- Named investigators and executives
- Specific trial names (e.g., KEYNOTE-XXX)
- Institution and conference names
- Dates, locations, and URLs
Validation Results
Key finding: GPT-5 re-identified 53% of tickers from de-identified text (169/304 cases), but could not recall actual price impacts (only 4.8% within 2% of actual, Pearson 0.465). We do our best to mitigate re-identification while preserving enough scientific content for models to reason about the catalyst.
Ready to Run Benchmarks?
Execute strategies against the 655-case benchmark to see which approach performs best at predicting biotech catalyst outcomes.
View All Strategies