Open Benchmark · 511 Validated Cases · 182 Companies

Can AI predict
biotech stock moves?

An open benchmark for evaluating how well LLMs reason about clinical trial results, FDA decisions, and biotech catalysts, then predict the stock price reaction.

Why biotech

The hardest domain
for stock prediction

Biotech is unusually event-driven. FDA decisions, clinical trial readouts, safety updates, or even changes in trial design can move a stock 20–80% in a single day. Our dataset includes moves ranging from −78% to +70%.

Interpreting these catalysts requires biology expertise and clinical context. A press release that sounds “positive” can still lead to a selloff if the results don't meet the bar the market has set.

This makes biotech a uniquely challenging testbed for AI reasoning. Models must go beyond sentiment analysis and actually understand the science.

Why “positive” results can mean a selloff

The effect size is weaker than expected
Results apply only to a narrow subgroup
Safety signals appear in the data
Endpoints don't meaningfully de-risk later phases
The readout doesn't materially change approval odds

The setup

Real catalysts,
de-identified

Each case gives the model the actual catalyst (the press release announcing trial results, an FDA decision, or a safety update) along with the trial design, prior data, and related literature. The model must interpret the news and predict the stock price reaction.

To prevent models from simply recalling memorized outcomes, press releases are de-identified. Company names, drugs, tickers, dates, and executives are replaced with placeholders, so the model has to reason from the science, not its training data.

How scoring works
01

De-identified press release

Actual announcements from biotech companies. A three-stage pipeline (regex pre-processing, LLM redaction, regex post-processing) strips identifying information while preserving scientific content.

02

Linked clinical trial data

Structured data from ClinicalTrials.gov: endpoints, trial design, enrollment, arms, interventions, eligibility criteria, and site locations.

03

PubMed articles

Related research papers providing scientific context for the therapeutic area, mechanism of action, and prior results.

04

Market-cap adjusted scoring

Price impact is weighted by market cap. A 20% move in a $100B company is scored higher than the same move in a $50M micro-cap, reflecting true market significance.

The dataset

511 validated biotech catalysts

The benchmark spans Phase 1–3 readouts, FDA approvals and rejections, and topline results across 182 companies of all sizes, from micro-cap to mega-cap.

By catalyst type

FDA Approvals
198
Phase 3 Data
120
Phase 2 Data
89
Topline Results
84
FDA Rejections
20

By therapeutic area

289
Oncology
57% of dataset
222
Other areas
Cardio, neuro, rare disease, etc.

We focused heavily on oncology, where we found better generalization within a broad disease area compared to mixing unrelated indications. The dataset also spans companies of different sizes, since large-cap biotech tends to exhibit much lower volatility than small and mid-cap names.

By stock price impact

Market-cap adjusted
Very Positive
36 (7%)
Positive
61 (11.9%)
Neutral
330 (64.6%)
Negative
55 (10.8%)
Very Negative
29 (5.7%)

Adjusted scoring

Impact categories are adjusted for market capitalization. A 3% move in a mega-cap stock (>$200B) can be classified as “positive” while the same move in a small-cap would be “neutral.” This reflects the reality that large-cap biotech stocks move less on individual catalysts, but each percentage point represents billions in market value.

De-identified press releases

Three-stage cleaning pipeline (regex pre-processing, GPT-5 redaction, regex post-processing) mitigates re-identification of real stock outcomes. GPT-5 re-identified 53% of tickers but could not recall price impacts.

511 linked clinical trials

Every case has a matched ClinicalTrials.gov entry with structured endpoints, trial design, patient populations, arms, interventions, and eligibility criteria.

Magnitude matters

Measures directional accuracy and magnitude error, not just up/down. Penalizes overconfident but wrong predictions. Adjusts for market cap.

Leaderboard

Top performers

#StrategyModelTypeClose MatchDirectionCases
1Chain-of-Thought CategoricalGPT-5Chain-of-Thought70.9%38.9%285
2Direct 0-10 ScoresClaude Sonnet 4Direct 0-10 Scores65.5%39.3%290
3Direct CategoricalGPT-5Direct Categorical64.7%36%303
4Chain-of-Thought CategoricalClaude Sonnet 4Chain-of-Thought51.6%31.9%304
5Direct CategoricalClaude Sonnet 4Direct Categorical33.4%30.3%287

Test your own strategy

Submit your approach to be benchmarked against 511 validated catalyst cases. Compare head-to-head with Claude Sonnet 4, GPT-5, and multi-step agent pipelines.