Open Benchmark · 655 Validated Cases · 212 Companies

Can AI predict
biotech stock moves?

An open benchmark for evaluating how well LLMs reason about clinical trial results, FDA decisions, and biotech catalysts, then predict the stock price reaction.

Why biotech

The hardest domain
for stock prediction

Biotech is unusually event-driven. FDA decisions, clinical trial readouts, safety updates, or even changes in trial design can move a stock 20–80% in a single day. Our dataset includes moves ranging from −78% to +70%.

Interpreting these catalysts requires biology expertise and clinical context. A press release that sounds “positive” can still lead to a selloff if the results don't meet the bar the market has set.

This makes biotech a uniquely challenging testbed for AI reasoning. Models must go beyond sentiment analysis and actually understand the science.

Why “positive” results can mean a selloff

The effect size is weaker than expected
Results apply only to a narrow subgroup
Safety signals appear in the data
Endpoints don't meaningfully de-risk later phases
The readout doesn't materially change approval odds

The setup

Real catalysts,
de-identified

Each case gives the model the actual catalyst (the press release announcing trial results, an FDA decision, or a safety update) along with the trial design, prior data, and related literature. The model must interpret the news and predict the stock price reaction.

To prevent models from simply recalling memorized outcomes, press releases are de-identified. Company names, drugs, tickers, dates, and executives are replaced with placeholders, so the model has to reason from the science, not its training data.

How scoring works
01

De-identified press release

Actual announcements from biotech companies. A three-stage pipeline (regex pre-processing, LLM redaction, regex post-processing) strips identifying information while preserving scientific content.

02

Linked clinical trial data

Structured data from ClinicalTrials.gov: endpoints, trial design, enrollment, arms, interventions, eligibility criteria, and site locations.

03

PubMed articles

Related research papers providing scientific context for the therapeutic area, mechanism of action, and prior results.

04

Market-cap adjusted scoring

Price impact is weighted by market cap. A 20% move in a $100B company is scored higher than the same move in a $50M micro-cap, reflecting true market significance.

The dataset

655 validated biotech catalysts

The benchmark spans Phase 1–3 readouts, FDA approvals and rejections, and topline results across 212 companies of all sizes, from micro-cap to mega-cap.

By catalyst type

FDA Approvals
250
Phase 3 Data
135
Topline Results
123
Phase 2 Data
121
FDA Rejections
26

By therapeutic area

393
Oncology
60% of dataset
262
Other areas
Cardio, neuro, rare disease, etc.

We focused heavily on oncology, where we found better generalization within a broad disease area compared to mixing unrelated indications. The dataset also spans companies of different sizes, since large-cap biotech tends to exhibit much lower volatility than small and mid-cap names.

By stock price impact

Market-cap adjusted
Very Positive
41 (6.3%)
Positive
82 (12.5%)
Neutral
429 (65.5%)
Negative
69 (10.5%)
Very Negative
34 (5.2%)

Adjusted scoring

Impact categories are adjusted for market capitalization. A 3% move in a mega-cap stock (>$200B) can be classified as “positive” while the same move in a small-cap would be “neutral.” This reflects the reality that large-cap biotech stocks move less on individual catalysts, but each percentage point represents billions in market value.

De-identified press releases

Three-stage cleaning pipeline (regex pre-processing, GPT-5 redaction, regex post-processing) mitigates re-identification of real stock outcomes. GPT-5 re-identified 53% of tickers but could not recall price impacts.

655 linked clinical trials

Every case has a matched ClinicalTrials.gov entry with structured endpoints, trial design, patient populations, arms, interventions, and eligibility criteria.

Magnitude matters

Measures directional accuracy and magnitude error, not just up/down. Penalizes overconfident but wrong predictions. Adjusts for market cap.

Leaderboard

Top performers

#StrategyModelTypeClose MatchDirectionCases
1Chain-of-Thought CategoricalGPT-5Chain-of-Thought70.9%38.9%285
2Direct 0-10 ScoresClaude Sonnet 4Direct 0-10 Scores65.5%39.3%290
3Direct CategoricalGPT-5Direct Categorical64.7%36%303
4Chain-of-Thought CategoricalClaude Sonnet 4Chain-of-Thought51.6%31.9%304
5Direct CategoricalClaude Sonnet 4Direct Categorical33.4%30.3%287

Test your own strategy

Submit your approach to be benchmarked against 655 validated catalyst cases. Compare head-to-head with Claude Sonnet 4, GPT-5, and multi-step agent pipelines.