Open Benchmark · 511 Validated Cases · 182 Companies

Can AI predict
biotech stock moves?

An open benchmark for evaluating how well LLMs reason about clinical trial results, FDA decisions, and biotech catalysts, then predict the stock price reaction.

Explore Cases View Benchmark

Why biotech

The hardest domain
for stock prediction

Biotech is unusually event-driven. FDA decisions, clinical trial readouts, safety updates, or even changes in trial design can move a stock 20–80% in a single day. Our dataset includes moves ranging from −78% to +70%.

Interpreting these catalysts requires biology expertise and clinical context. A press release that sounds “positive” can still lead to a selloff if the results don't meet the bar the market has set.

This makes biotech a uniquely challenging testbed for AI reasoning. Models must go beyond sentiment analysis and actually understand the science.

Why “positive” results can mean a selloff

The effect size is weaker than expected

Results apply only to a narrow subgroup

Safety signals appear in the data

Endpoints don't meaningfully de-risk later phases

The readout doesn't materially change approval odds

Example

Sarepta's EXONDYS 51

FDA approval · Rare disease

+69.9%

Aprea's eprenetapopt

Phase 3 negative · MDS

−78.1%

The setup

Real catalysts,
de-identified

Each case gives the model the actual catalyst (the press release announcing trial results, an FDA decision, or a safety update) along with the trial design, prior data, and related literature. The model must interpret the news and predict the stock price reaction.

To prevent models from simply recalling memorized outcomes, press releases are de-identified. Company names, drugs, tickers, dates, and executives are replaced with placeholders, so the model has to reason from the science, not its training data.

How scoring works

De-identified press release

Actual announcements from biotech companies. A three-stage pipeline (regex pre-processing, LLM redaction, regex post-processing) strips identifying information while preserving scientific content.

Linked clinical trial data

Structured data from ClinicalTrials.gov: endpoints, trial design, enrollment, arms, interventions, eligibility criteria, and site locations.

PubMed articles

Related research papers providing scientific context for the therapeutic area, mechanism of action, and prior results.

Market-cap adjusted scoring

Price impact is weighted by market cap. A 20% move in a $100B company is scored higher than the same move in a $50M micro-cap, reflecting true market significance.

The dataset

511 validated biotech catalysts

The benchmark spans Phase 1–3 readouts, FDA approvals and rejections, and topline results across 182 companies of all sizes, from micro-cap to mega-cap.

By catalyst type

FDA Approvals

198

Phase 3 Data

120

Phase 2 Data

Topline Results

FDA Rejections

By therapeutic area

289

Oncology

57% of dataset

222

Other areas

Cardio, neuro, rare disease, etc.

We focused heavily on oncology, where we found better generalization within a broad disease area compared to mixing unrelated indications. The dataset also spans companies of different sizes, since large-cap biotech tends to exhibit much lower volatility than small and mid-cap names.

By stock price impact

Market-cap adjusted

Very Positive

36 (7%)

Positive

61 (11.9%)

Neutral

330 (64.6%)

Negative

55 (10.8%)

Very Negative

29 (5.7%)

Adjusted scoring

Impact categories are adjusted for market capitalization. A 3% move in a mega-cap stock (>$200B) can be classified as “positive” while the same move in a small-cap would be “neutral.” This reflects the reality that large-cap biotech stocks move less on individual catalysts, but each percentage point represents billions in market value.

De-identified press releases

Three-stage cleaning pipeline (regex pre-processing, GPT-5 redaction, regex post-processing) mitigates re-identification of real stock outcomes. GPT-5 re-identified 53% of tickers but could not recall price impacts.

511 linked clinical trials

Every case has a matched ClinicalTrials.gov entry with structured endpoints, trial design, patient populations, arms, interventions, and eligibility criteria.

Magnitude matters

Measures directional accuracy and magnitude error, not just up/down. Penalizes overconfident but wrong predictions. Adjusts for market cap.

Leaderboard

Top performers

Full leaderboard

#	Strategy	Model	Type	Close Match	Direction	Cases
1	Chain-of-Thought Categorical	GPT-5	Chain-of-Thought	70.9%	38.9%	285
2	Direct 0-10 Scores	Claude Sonnet 4	Direct 0-10 Scores	65.5%	39.3%	290
3	Direct Categorical	GPT-5	Direct Categorical	64.7%	36%	303
4	Chain-of-Thought Categorical	Claude Sonnet 4	Chain-of-Thought	51.6%	31.9%	304
5	Direct Categorical	Claude Sonnet 4	Direct Categorical	33.4%	30.3%	287

Test your own strategy

Submit your approach to be benchmarked against 511 validated catalyst cases. Compare head-to-head with Claude Sonnet 4, GPT-5, and multi-step agent pipelines.

View Benchmark How Strategies Work

Can AI predictbiotech stock moves?

The hardest domainfor stock prediction