Benchmark

Benchmark Results

How each prediction strategy performs on the 317-case biotech catalyst benchmark

Benchmark Runs In Progress

Results below show the evaluation framework. Run benchmarks to populate real results. All strategies will be tested against 317 validated biotech catalyst cases with known stock price outcomes.

Scoring Methodology

Market-Cap-Adjusted Scoring

Each catalyst case is scored based on the stock's price movement, weighted by market capitalization to account for liquidity and volatility differences:

score = clamp(pct_change × multiplier / 5, -10, 10)
multiplier = clamp(1 + 0.5 × log₁₀(market_cap / $1B), 0.25, 3.0)

This ensures that a 20% move in a mega-cap biotech ($100B) is weighted more heavily than the same percentage move in a micro-cap ($50M), reflecting the greater difficulty and significance of moving large-cap stocks.

Example Calculations:

Market Cap:
$50M
Multiplier:
0.25 (min)
% Change:
+20%
Calculation:
20 × 0.25 / 5 = 1.0
Final Score:
1.0

Micro-cap: heavily discounted due to high volatility

Market Cap:
$1B
Multiplier:
1.0
% Change:
+20%
Calculation:
20 × 1.0 / 5 = 4.0
Final Score:
4.0

Mid-cap: baseline weighting

Market Cap:
$100B
Multiplier:
3.0 (max)
% Change:
+5%
Calculation:
5 × 3.0 / 5 = 3.0
Final Score:
3.0

Mega-cap: even small moves matter

Categorical Evaluation Metrics

Exact Match:
Prediction matches the actual category exactly (e.g., predicted "positive" and actual was "positive")
Close Match:
Prediction is within 1 category (e.g., predicted "positive" and actual was "very_positive" or "slightly_positive")
Direction Correct:
Prediction has the correct sign (positive/neutral/negative)

Numeric Evaluation Metrics

MAE:
Mean Absolute Error between predicted and actual scores (lower is better)
Pearson Correlation:
Linear correlation between predictions and actual outcomes (higher is better, range -1 to +1)

Strategy Leaderboard

Strategies evaluated on 317 validated catalyst cases

RankStrategyTypeStepsExact MatchClose MatchDirectionMAECorrelationStatus
1Direct CategoricalDirect Categorical1----------Pending
2Chain-of-Thought CategoricalChain-of-Thought1----------Pending
3Multi-Step Agent (Categorical)Agent (Categorical)5----------Pending
4Direct 0-10 ScoresDirect 0-10 Scores1----------Pending
5Multi-Step Agent (0-10 Scores)Agent (0-10 Scores)5----------Pending
6Linear Regression (LLM Features)Linear Regression1----------Pending
7Linear Regression + SearchLinear Regression + Search2----------Pending

Evaluation Framework

Dataset Overview

317
Validated Catalyst Cases
De-identified events with known outcomes
168
Oncology Cases
Cancer drug trials and FDA decisions
149
Non-Oncology Cases
Other therapeutic areas

How Predictions Are Compared

  1. 1
    Strategy makes prediction: Each strategy analyzes the press release and clinical trial data to predict stock impact
  2. 2
    Actual outcome measured: Stock price change is calculated from market close before announcement to close on day of announcement
  3. 3
    Market-cap adjustment: Price change is weighted by market cap multiplier to normalize for volatility
  4. 4
    Metrics calculated: Exact match, close match, direction accuracy, MAE, and correlation metrics are computed

7-Category Impact Scale

The benchmark uses a standardized 7-category scale to classify catalyst outcomes. Categories are derived from the adjusted score, which weights raw percentage price changes by a market-cap multiplier to normalize for volatility across micro-cap and large-cap stocks.

VERY NEGATIVE
Adjusted score: < -3
NEGATIVE
Adjusted score: -3 to -1
SLIGHTLY NEGATIVE
Adjusted score: -1 to -0.4
NEUTRAL
Adjusted score: -0.4 to +0.4
SLIGHTLY POSITIVE
Adjusted score: +0.4 to +1
POSITIVE
Adjusted score: +1 to +3
VERY POSITIVE
Adjusted score: > +3

Note: Categories are mapped from the adjusted score (not raw price change). The adjusted score = percent_change × market_cap_multiplier / 5, where the multiplier ranges from 0.25 (micro-cap) to 1.5 (mega-cap). Predictions from categorical strategies are compared to these bins, while numeric predictions are evaluated using MAE and correlation against the adjusted score.

De-identification & Data Cleaning

Before AI models score cases, each press release is de-identified to prevent memorization of real stock outcomes. Here are the benchmark results validating that process.

Three-Stage De-identification Pipeline

01

Regex pre-processing

Strips wire service attributions, location datelines, URLs, email addresses, stock exchange references, NCT trial IDs. Pre-redacts known trial names, institutions, cities, and conferences with placeholders.

02

LLM redaction (GPT-5)

Identifies and replaces identifying information with standardized placeholders. Generalizes unique identifiers (first-in-class mechanisms, unique trial names) to broader categories.

03

Regex post-processing

Catches any remaining identifiers missed by the LLM: trial names, city names, institution names, companion diagnostics, and stock exchange references.

Placeholder tokens

[COMPANY][DRUG][TICKER][DATE][EXECUTIVE][TRIAL_NAME][LOCATION][INSTITUTION][TRIAL_ID][CONFERENCE][FINANCIAL_DETAIL]

What Gets Preserved vs. Removed

What Gets Preserved

  • Efficacy data (response rates, survival, p-values)
  • Safety signals and adverse event profiles
  • Trial design (endpoints, enrollment, arms)
  • Mechanism of action (generalized)
  • Regulatory context and decision rationale

What Gets Removed

  • Company and drug names
  • Ticker symbols and financial details
  • Named investigators and executives
  • Specific trial names (e.g., KEYNOTE-XXX)
  • Institution and conference names
  • Dates, locations, and URLs

Validation Results

53%
Tickers re-identified by GPT-5
4.8%
Price recall within 2%

Key finding: GPT-5 re-identified 53% of tickers from de-identified text (169/317 cases), but could not recall actual price impacts (only 4.8% within 2% of actual, Pearson 0.465). We do our best to mitigate re-identification while preserving enough scientific content for models to reason about the catalyst.

Ready to Run Benchmarks?

Execute strategies against the 317-case benchmark to see which approach performs best at predicting biotech catalyst outcomes.

View All Strategies