Open Benchmark · 655 Validated Cases · 212 Companies
Can AI predict
biotech stock moves?
An open benchmark for evaluating how well LLMs reason about clinical trial results, FDA decisions, and biotech catalysts, then predict the stock price reaction.
Why biotech
The hardest domain
for stock prediction
Biotech is unusually event-driven. FDA decisions, clinical trial readouts, safety updates, or even changes in trial design can move a stock 20–80% in a single day. Our dataset includes moves ranging from −78% to +70%.
Interpreting these catalysts requires biology expertise and clinical context. A press release that sounds “positive” can still lead to a selloff if the results don't meet the bar the market has set.
This makes biotech a uniquely challenging testbed for AI reasoning. Models must go beyond sentiment analysis and actually understand the science.
Why “positive” results can mean a selloff
The setup
Real catalysts,
de-identified
Each case gives the model the actual catalyst (the press release announcing trial results, an FDA decision, or a safety update) along with the trial design, prior data, and related literature. The model must interpret the news and predict the stock price reaction.
To prevent models from simply recalling memorized outcomes, press releases are de-identified. Company names, drugs, tickers, dates, and executives are replaced with placeholders, so the model has to reason from the science, not its training data.
How scoring worksDe-identified press release
Actual announcements from biotech companies. A three-stage pipeline (regex pre-processing, LLM redaction, regex post-processing) strips identifying information while preserving scientific content.
Linked clinical trial data
Structured data from ClinicalTrials.gov: endpoints, trial design, enrollment, arms, interventions, eligibility criteria, and site locations.
PubMed articles
Related research papers providing scientific context for the therapeutic area, mechanism of action, and prior results.
Market-cap adjusted scoring
Price impact is weighted by market cap. A 20% move in a $100B company is scored higher than the same move in a $50M micro-cap, reflecting true market significance.
The dataset
655 validated biotech catalysts
The benchmark spans Phase 1–3 readouts, FDA approvals and rejections, and topline results across 212 companies of all sizes, from micro-cap to mega-cap.
By catalyst type
By therapeutic area
We focused heavily on oncology, where we found better generalization within a broad disease area compared to mixing unrelated indications. The dataset also spans companies of different sizes, since large-cap biotech tends to exhibit much lower volatility than small and mid-cap names.
By stock price impact
Market-cap adjustedAdjusted scoring
Impact categories are adjusted for market capitalization. A 3% move in a mega-cap stock (>$200B) can be classified as “positive” while the same move in a small-cap would be “neutral.” This reflects the reality that large-cap biotech stocks move less on individual catalysts, but each percentage point represents billions in market value.
De-identified press releases
Three-stage cleaning pipeline (regex pre-processing, GPT-5 redaction, regex post-processing) mitigates re-identification of real stock outcomes. GPT-5 re-identified 53% of tickers but could not recall price impacts.
655 linked clinical trials
Every case has a matched ClinicalTrials.gov entry with structured endpoints, trial design, patient populations, arms, interventions, and eligibility criteria.
Magnitude matters
Measures directional accuracy and magnitude error, not just up/down. Penalizes overconfident but wrong predictions. Adjusts for market cap.
Leaderboard
Top performers
| # | Strategy | Model | Type | Close Match | Direction | Cases |
|---|---|---|---|---|---|---|
| 1 | Chain-of-Thought Categorical | GPT-5 | Chain-of-Thought | 70.9% | 38.9% | 285 |
| 2 | Direct 0-10 Scores | Claude Sonnet 4 | Direct 0-10 Scores | 65.5% | 39.3% | 290 |
| 3 | Direct Categorical | GPT-5 | Direct Categorical | 64.7% | 36% | 303 |
| 4 | Chain-of-Thought Categorical | Claude Sonnet 4 | Chain-of-Thought | 51.6% | 31.9% | 304 |
| 5 | Direct Categorical | Claude Sonnet 4 | Direct Categorical | 33.4% | 30.3% | 287 |
Test your own strategy
Submit your approach to be benchmarked against 655 validated catalyst cases. Compare head-to-head with Claude Sonnet 4, GPT-5, and multi-step agent pipelines.