Benchmark

Benchmark Results

How each prediction strategy performs on the 317-case biotech catalyst benchmark

Benchmark Runs In Progress

Results below show the evaluation framework. Run benchmarks to populate real results. All strategies will be tested against 317 validated biotech catalyst cases with known stock price outcomes.

Scoring Methodology

Market-Cap-Adjusted Scoring

Each catalyst case is scored based on the stock's price movement, weighted by market capitalization to account for liquidity and volatility differences:

score = clamp(pct_change × multiplier / 5, -10, 10)

multiplier = clamp(1 + 0.5 × log₁₀(market_cap / $1B), 0.25, 3.0)

This ensures that a 20% move in a mega-cap biotech ($100B) is weighted more heavily than the same percentage move in a micro-cap ($50M), reflecting the greater difficulty and significance of moving large-cap stocks.

Example Calculations:

Market Cap:

$50M

Multiplier:

0.25 (min)

% Change:

+20%

Calculation:

20 × 0.25 / 5 = 1.0

Final Score:

1.0

Micro-cap: heavily discounted due to high volatility

Market Cap:

$1B

Multiplier:

1.0

% Change:

+20%

Calculation:

20 × 1.0 / 5 = 4.0

Final Score:

4.0

Mid-cap: baseline weighting

Market Cap:

$100B

Multiplier:

3.0 (max)

% Change:

+5%

Calculation:

5 × 3.0 / 5 = 3.0

Final Score:

3.0

Mega-cap: even small moves matter

Categorical Evaluation Metrics

Exact Match:

Prediction matches the actual category exactly (e.g., predicted "positive" and actual was "positive")

Close Match:

Prediction is within 1 category (e.g., predicted "positive" and actual was "very_positive" or "slightly_positive")

Direction Correct:

Prediction has the correct sign (positive/neutral/negative)

Numeric Evaluation Metrics

MAE:

Mean Absolute Error between predicted and actual scores (lower is better)

Pearson Correlation:

Linear correlation between predictions and actual outcomes (higher is better, range -1 to +1)

Strategy Leaderboard

Strategies evaluated on 317 validated catalyst cases

Rank	Strategy	Type	Steps	Exact Match	Close Match	Direction	MAE	Correlation	Status
1	Direct Categorical	Direct Categorical	1	--	--	--	--	--	Pending
2	Chain-of-Thought Categorical	Chain-of-Thought	1	--	--	--	--	--	Pending
3	Multi-Step Agent (Categorical)	Agent (Categorical)	5	--	--	--	--	--	Pending
4	Direct 0-10 Scores	Direct 0-10 Scores	1	--	--	--	--	--	Pending
5	Multi-Step Agent (0-10 Scores)	Agent (0-10 Scores)	5	--	--	--	--	--	Pending
6	Linear Regression (LLM Features)	Linear Regression	1	--	--	--	--	--	Pending
7	Linear Regression + Search	Linear Regression + Search	2	--	--	--	--	--	Pending

Evaluation Framework

Dataset Overview

317

Validated Catalyst Cases

De-identified events with known outcomes

168

Oncology Cases

Cancer drug trials and FDA decisions

149

Non-Oncology Cases

Other therapeutic areas

How Predictions Are Compared

1
Strategy makes prediction: Each strategy analyzes the press release and clinical trial data to predict stock impact
2
Actual outcome measured: Stock price change is calculated from market close before announcement to close on day of announcement
3
Market-cap adjustment: Price change is weighted by market cap multiplier to normalize for volatility
4
Metrics calculated: Exact match, close match, direction accuracy, MAE, and correlation metrics are computed

7-Category Impact Scale

The benchmark uses a standardized 7-category scale to classify catalyst outcomes. Categories are derived from the adjusted score, which weights raw percentage price changes by a market-cap multiplier to normalize for volatility across micro-cap and large-cap stocks.

VERY NEGATIVE

Adjusted score: < -3

NEGATIVE

Adjusted score: -3 to -1

SLIGHTLY NEGATIVE

Adjusted score: -1 to -0.4

NEUTRAL

Adjusted score: -0.4 to +0.4

SLIGHTLY POSITIVE

Adjusted score: +0.4 to +1

POSITIVE

Adjusted score: +1 to +3

VERY POSITIVE

Adjusted score: > +3

Note: Categories are mapped from the adjusted score (not raw price change). The adjusted score = percent_change × market_cap_multiplier / 5, where the multiplier ranges from 0.25 (micro-cap) to 1.5 (mega-cap). Predictions from categorical strategies are compared to these bins, while numeric predictions are evaluated using MAE and correlation against the adjusted score.

De-identification & Data Cleaning

Before AI models score cases, each press release is de-identified to prevent memorization of real stock outcomes. Here are the benchmark results validating that process.

Three-Stage De-identification Pipeline

Regex pre-processing

Strips wire service attributions, location datelines, URLs, email addresses, stock exchange references, NCT trial IDs. Pre-redacts known trial names, institutions, cities, and conferences with placeholders.

LLM redaction (GPT-5)

Identifies and replaces identifying information with standardized placeholders. Generalizes unique identifiers (first-in-class mechanisms, unique trial names) to broader categories.

Regex post-processing

Catches any remaining identifiers missed by the LLM: trial names, city names, institution names, companion diagnostics, and stock exchange references.

Placeholder tokens

[COMPANY][DRUG][TICKER][DATE][EXECUTIVE][TRIAL_NAME][LOCATION][INSTITUTION][TRIAL_ID][CONFERENCE][FINANCIAL_DETAIL]

What Gets Preserved vs. Removed

What Gets Preserved

Efficacy data (response rates, survival, p-values)
Safety signals and adverse event profiles
Trial design (endpoints, enrollment, arms)
Mechanism of action (generalized)
Regulatory context and decision rationale

What Gets Removed

Company and drug names
Ticker symbols and financial details
Named investigators and executives
Specific trial names (e.g., KEYNOTE-XXX)
Institution and conference names
Dates, locations, and URLs

Validation Results

53%

Tickers re-identified by GPT-5

4.8%

Price recall within 2%

Key finding: GPT-5 re-identified 53% of tickers from de-identified text (169/317 cases), but could not recall actual price impacts (only 4.8% within 2% of actual, Pearson 0.465). We do our best to mitigate re-identification while preserving enough scientific content for models to reason about the catalyst.

Ready to Run Benchmarks?

Execute strategies against the 317-case benchmark to see which approach performs best at predicting biotech catalyst outcomes.

View All Strategies