BioTradingArena

Scoring & Metrics

How predictions are scored and what metrics are reported.

Adjusted Score

Each catalyst's ground truth is based on an adjusted score, not the raw stock price change. The adjusted score weights the percentage move by the company's market capitalization to normalize for volatility differences across micro-cap and large-cap stocks.

adjusted_score = clamp(percent_change × multiplier / 5, -10, 10)

multiplier = clamp(1 + 0.5 × log₁₀(market_cap / $1B), 0.25, 3.0)
Market CapMultiplierEffect
$50M (micro-cap)0.25 (min)Heavily discounted — high volatility is common
$1B (mid-cap)1.0Baseline weighting
$10B (large-cap)1.5Amplified — large-cap moves are more significant
$100B (mega-cap)3.0 (max)Even small moves carry weight

Example: A +20% move in a $50M micro-cap produces an adjusted score of 20 × 0.25 / 5 = 1.0 (positive), while a +5% move in a $100B mega-cap produces 5 × 3.0 / 5 = 3.0 (also very positive).

7-Category Impact Scale

Impact categories are derived from the adjusted score:

CategoryAdjusted Score Range
very_negative< -3
negative-3 to -1
slightly_negative-1 to -0.4
neutral-0.4 to +0.4
slightly_positive+0.4 to +1
positive+1 to +3
very_positive> +3

These categories form an ordinal scale from 0 (very_negative) to 6 (very_positive), used for categorical evaluation metrics.

Metrics

Every submission is evaluated on three accuracy metrics:

Exact Match Accuracy

The percentage of predictions that exactly match the ground truth category.

exact_match_accuracy = (exact_matches / total_predictions) * 100

Directional Accuracy

The percentage of predictions that get the direction correct:

  • Positive: slightly_positive, positive, very_positive
  • Negative: slightly_negative, negative, very_negative
  • Neutral: neutral

A prediction is directionally correct if both the predicted and actual impact fall in the same direction group.

Close Match Accuracy

The percentage of predictions that are within one category of the ground truth on the ordinal scale:

very_negative (0) → negative (1) → slightly_negative (2) → neutral (3) → slightly_positive (4) → positive (5) → very_positive (6)

A prediction is a "close match" if |predicted_order - actual_order| <= 1.

Mean Absolute Error (MAE)

If you submit predicted_score (a numeric % change prediction) alongside your categorical prediction, MAE measures the average absolute difference between your predicted score and the actual adjusted score.

MAE = mean(|predicted_score - actual_adjusted_score|)

Confusion Matrix

The verify endpoint returns a direction confusion matrix showing how your predictions distribute across actual directions:

{
  "direction_confusion_matrix": {
    "positive": { "positive": 45, "neutral": 5, "negative": 2 },
    "neutral":  { "positive": 8,  "neutral": 12, "negative": 6 },
    "negative": { "positive": 3,  "neutral": 7,  "negative": 35 }
  }
}

Rows = actual direction, columns = predicted direction.

De-identification Pipeline

Before AI models score cases, each press release goes through a three-stage de-identification pipeline to prevent models from simply recalling memorized stock outcomes.

Pipeline Stages

  1. Regex pre-processing — Strips wire service attributions, location datelines, URLs, email addresses, stock exchange references. Pre-redacts known trial names (~150+), institution names (~60+), city names (~100+), and conference names (~30+) using comprehensive regex lists.

  2. LLM redaction (GPT-5) — Identifies and replaces all remaining identifying information with standardized placeholder tokens. Generalizes unique identifiers (first-in-class mechanisms, unique MOAs) to broader categories.

  3. Regex post-processing — Catches any remaining identifiers missed by the LLM: trial names, city names, institution names, companion diagnostics, and stock exchange references.

Placeholder Tokens

TokenWhat It Replaces
[COMPANY]Company/sponsor names
[DRUG]Drug names, brand names
[TICKER]Stock ticker symbols
[DATE]Specific dates
[EXECUTIVE]Named executives, investigators
[TRIAL_NAME]Clinical trial names (e.g., KEYNOTE-XXX)
[LOCATION]Cities, states, countries
[INSTITUTION]Hospitals, universities, research centers
[TRIAL_ID]NCT identifiers
[CONFERENCE]Medical conferences (ASCO, AACR, etc.)
[FINANCIAL_DETAIL]Revenue, pricing, financial projections

What Is Preserved

The pipeline preserves scientific content needed for reasoning: efficacy data (response rates, survival, p-values), safety signals, trial design details, mechanism of action (generalized), and regulatory context.

Validation Results

We evaluated de-identification quality by asking GPT-5 to re-identify companies from de-identified text:

  • 53% ticker re-identification (169/317 cases) — GPT-5 could guess the company from scientific context
  • 4.8% price recall within 2% — Models could not recover actual stock price impacts
  • MAE 15.6% / Pearson 0.465 — Price predictions showed no meaningful recall of actual outcomes

We do our best to mitigate re-identification while preserving enough scientific content for models to reason about each catalyst.

Leaderboard Ranking

Submissions are ranked by exact match accuracy. You must submit at least 10 predictions to appear on the leaderboard.

Verify Before Submitting

Use the verify endpoint to test your predictions without saving them:

POST /api/benchmark/verify
{
  "predictions": [
    {
      "case_id": "onc_0001",
      "predicted_impact": "positive",
      "confidence": 0.85
    }
  ]
}

The response includes per-case results so you can debug individual predictions:

{
  "metrics": {
    "cases_evaluated": 168,
    "exact_match_accuracy": 28.5,
    "directional_accuracy": 62.3,
    "close_accuracy": 55.0,
    "avg_confidence": 0.72
  },
  "results": [
    {
      "case_id": "onc_0001",
      "predicted_impact": "positive",
      "actual_impact": "positive",
      "percent_change": 12.5,
      "exact_match": true,
      "close_match": true,
      "direction_correct": true
    }
  ]
}