Benchmark

Benchmark Results

How each prediction strategy performs on the 655-case biotech catalyst benchmark

Scoring Methodology

Market-Cap-Adjusted Scoring

Each catalyst case is scored based on the stock's price movement, weighted by market capitalization to account for liquidity and volatility differences:

score = clamp(pct_change × multiplier / 5, -10, 10)
multiplier = clamp(1 + 0.5 × log₁₀(market_cap / $1B), 0.25, 3.0)

This ensures that a 20% move in a mega-cap biotech ($100B) is weighted more heavily than the same percentage move in a micro-cap ($50M), reflecting the greater difficulty and significance of moving large-cap stocks.

Example Calculations:

Market Cap:
$50M
Multiplier:
0.25 (min)
% Change:
+20%
Calculation:
20 × 0.25 / 5 = 1.0
Final Score:
1.0

Micro-cap: heavily discounted due to high volatility

Market Cap:
$1B
Multiplier:
1.0
% Change:
+20%
Calculation:
20 × 1.0 / 5 = 4.0
Final Score:
4.0

Mid-cap: baseline weighting

Market Cap:
$100B
Multiplier:
3.0 (max)
% Change:
+5%
Calculation:
5 × 3.0 / 5 = 3.0
Final Score:
3.0

Mega-cap: even small moves matter

Categorical Evaluation Metrics

Exact Match:
Prediction matches the actual category exactly (e.g., predicted "positive" and actual was "positive")
Close Match:
Prediction is within 1 category (e.g., predicted "positive" and actual was "very_positive" or "slightly_positive")
Direction Correct:
Prediction has the correct sign (positive/neutral/negative)

Numeric Evaluation Metrics

MAE:
Mean Absolute Error between predicted and actual scores (lower is better)
Pearson Correlation:
Linear correlation between predictions and actual outcomes (higher is better, range -1 to +1)

Evaluation Framework

Dataset Overview

655
Validated Catalyst Cases
De-identified events with known outcomes
393
Oncology Cases
Cancer drug trials and FDA decisions
262
Non-Oncology Cases
Other therapeutic areas

How Predictions Are Compared

  1. 1
    Strategy makes prediction: Each strategy analyzes the press release and clinical trial data to predict stock impact
  2. 2
    Actual outcome measured: Stock price change is calculated from market close before announcement to close on day of announcement
  3. 3
    Market-cap adjustment: Price change is weighted by market cap multiplier to normalize for volatility
  4. 4
    Metrics calculated: Exact match, close match, direction accuracy, MAE, and correlation metrics are computed

7-Category Impact Scale

The benchmark uses a standardized 7-category scale to classify catalyst outcomes. Categories are derived from the adjusted score, which weights raw percentage price changes by a market-cap multiplier to normalize for volatility across micro-cap and large-cap stocks.

VERY NEGATIVE
Adjusted score: < -3
NEGATIVE
Adjusted score: -3 to -1
SLIGHTLY NEGATIVE
Adjusted score: -1 to -0.4
NEUTRAL
Adjusted score: -0.4 to +0.4
SLIGHTLY POSITIVE
Adjusted score: +0.4 to +1
POSITIVE
Adjusted score: +1 to +3
VERY POSITIVE
Adjusted score: > +3

Note: Categories are mapped from the adjusted score (not raw price change). The adjusted score = percent_change × market_cap_multiplier / 5, where the multiplier ranges from 0.25 (micro-cap) to 1.5 (mega-cap). Predictions from categorical strategies are compared to these bins, while numeric predictions are evaluated using MAE and correlation against the adjusted score.

De-identification & Data Cleaning

Before AI models score cases, each press release is de-identified to prevent memorization of real stock outcomes. Here are the benchmark results validating that process.

Three-Stage De-identification Pipeline

01

Regex pre-processing

Strips wire service attributions, location datelines, URLs, email addresses, stock exchange references, NCT trial IDs. Pre-redacts known trial names, institutions, cities, and conferences with placeholders.

02

LLM redaction (GPT-5)

Identifies and replaces identifying information with standardized placeholders. Generalizes unique identifiers (first-in-class mechanisms, unique trial names) to broader categories.

03

Regex post-processing

Catches any remaining identifiers missed by the LLM: trial names, city names, institution names, companion diagnostics, and stock exchange references.

Placeholder tokens

[COMPANY][DRUG][TICKER][DATE][EXECUTIVE][TRIAL_NAME][LOCATION][INSTITUTION][TRIAL_ID][CONFERENCE][FINANCIAL_DETAIL]

What Gets Preserved vs. Removed

What Gets Preserved

  • Efficacy data (response rates, survival, p-values)
  • Safety signals and adverse event profiles
  • Trial design (endpoints, enrollment, arms)
  • Mechanism of action (generalized)
  • Regulatory context and decision rationale

What Gets Removed

  • Company and drug names
  • Ticker symbols and financial details
  • Named investigators and executives
  • Specific trial names (e.g., KEYNOTE-XXX)
  • Institution and conference names
  • Dates, locations, and URLs

Validation Results

53%
Tickers re-identified by GPT-5
4.8%
Price recall within 2%

Key finding: GPT-5 re-identified 53% of tickers from de-identified text (169/304 cases), but could not recall actual price impacts (only 4.8% within 2% of actual, Pearson 0.465). We do our best to mitigate re-identification while preserving enough scientific content for models to reason about the catalyst.

Ready to Run Benchmarks?

Execute strategies against the 655-case benchmark to see which approach performs best at predicting biotech catalyst outcomes.

View All Strategies