Scoring & Metrics
How predictions are scored and what metrics are reported.
Adjusted Score
Each catalyst's ground truth is based on an adjusted score, not the raw stock price change. The adjusted score weights the percentage move by the company's market capitalization to normalize for volatility differences across micro-cap and large-cap stocks.
adjusted_score = clamp(percent_change × multiplier / 5, -10, 10)
multiplier = clamp(1 + 0.5 × log₁₀(market_cap / $1B), 0.25, 3.0)| Market Cap | Multiplier | Effect |
|---|---|---|
| $50M (micro-cap) | 0.25 (min) | Heavily discounted — high volatility is common |
| $1B (mid-cap) | 1.0 | Baseline weighting |
| $10B (large-cap) | 1.5 | Amplified — large-cap moves are more significant |
| $100B (mega-cap) | 3.0 (max) | Even small moves carry weight |
Example: A +20% move in a $50M micro-cap produces an adjusted score of 20 × 0.25 / 5 = 1.0 (positive), while a +5% move in a $100B mega-cap produces 5 × 3.0 / 5 = 3.0 (also very positive).
7-Category Impact Scale
Impact categories are derived from the adjusted score:
| Category | Adjusted Score Range |
|---|---|
very_negative | < -3 |
negative | -3 to -1 |
slightly_negative | -1 to -0.4 |
neutral | -0.4 to +0.4 |
slightly_positive | +0.4 to +1 |
positive | +1 to +3 |
very_positive | > +3 |
These categories form an ordinal scale from 0 (very_negative) to 6 (very_positive), used for categorical evaluation metrics.
Metrics
Every submission is evaluated on three accuracy metrics:
Exact Match Accuracy
The percentage of predictions that exactly match the ground truth category.
exact_match_accuracy = (exact_matches / total_predictions) * 100Directional Accuracy
The percentage of predictions that get the direction correct:
- Positive:
slightly_positive,positive,very_positive - Negative:
slightly_negative,negative,very_negative - Neutral:
neutral
A prediction is directionally correct if both the predicted and actual impact fall in the same direction group.
Close Match Accuracy
The percentage of predictions that are within one category of the ground truth on the ordinal scale:
very_negative (0) → negative (1) → slightly_negative (2) → neutral (3) → slightly_positive (4) → positive (5) → very_positive (6)A prediction is a "close match" if |predicted_order - actual_order| <= 1.
Mean Absolute Error (MAE)
If you submit predicted_score (a numeric % change prediction) alongside your categorical prediction, MAE measures the average absolute difference between your predicted score and the actual adjusted score.
MAE = mean(|predicted_score - actual_adjusted_score|)Confusion Matrix
The verify endpoint returns a direction confusion matrix showing how your predictions distribute across actual directions:
{
"direction_confusion_matrix": {
"positive": { "positive": 45, "neutral": 5, "negative": 2 },
"neutral": { "positive": 8, "neutral": 12, "negative": 6 },
"negative": { "positive": 3, "neutral": 7, "negative": 35 }
}
}Rows = actual direction, columns = predicted direction.
De-identification Pipeline
Before AI models score cases, each press release goes through a three-stage de-identification pipeline to prevent models from simply recalling memorized stock outcomes.
Pipeline Stages
-
Regex pre-processing — Strips wire service attributions, location datelines, URLs, email addresses, stock exchange references. Pre-redacts known trial names (~150+), institution names (~60+), city names (~100+), and conference names (~30+) using comprehensive regex lists.
-
LLM redaction (GPT-5) — Identifies and replaces all remaining identifying information with standardized placeholder tokens. Generalizes unique identifiers (first-in-class mechanisms, unique MOAs) to broader categories.
-
Regex post-processing — Catches any remaining identifiers missed by the LLM: trial names, city names, institution names, companion diagnostics, and stock exchange references.
Placeholder Tokens
| Token | What It Replaces |
|---|---|
[COMPANY] | Company/sponsor names |
[DRUG] | Drug names, brand names |
[TICKER] | Stock ticker symbols |
[DATE] | Specific dates |
[EXECUTIVE] | Named executives, investigators |
[TRIAL_NAME] | Clinical trial names (e.g., KEYNOTE-XXX) |
[LOCATION] | Cities, states, countries |
[INSTITUTION] | Hospitals, universities, research centers |
[TRIAL_ID] | NCT identifiers |
[CONFERENCE] | Medical conferences (ASCO, AACR, etc.) |
[FINANCIAL_DETAIL] | Revenue, pricing, financial projections |
What Is Preserved
The pipeline preserves scientific content needed for reasoning: efficacy data (response rates, survival, p-values), safety signals, trial design details, mechanism of action (generalized), and regulatory context.
Validation Results
We evaluated de-identification quality by asking GPT-5 to re-identify companies from de-identified text:
- 53% ticker re-identification (169/317 cases) — GPT-5 could guess the company from scientific context
- 4.8% price recall within 2% — Models could not recover actual stock price impacts
- MAE 15.6% / Pearson 0.465 — Price predictions showed no meaningful recall of actual outcomes
We do our best to mitigate re-identification while preserving enough scientific content for models to reason about each catalyst.
Leaderboard Ranking
Submissions are ranked by exact match accuracy. You must submit at least 10 predictions to appear on the leaderboard.
Verify Before Submitting
Use the verify endpoint to test your predictions without saving them:
POST /api/benchmark/verify{
"predictions": [
{
"case_id": "onc_0001",
"predicted_impact": "positive",
"confidence": 0.85
}
]
}The response includes per-case results so you can debug individual predictions:
{
"metrics": {
"cases_evaluated": 168,
"exact_match_accuracy": 28.5,
"directional_accuracy": 62.3,
"close_accuracy": 55.0,
"avg_confidence": 0.72
},
"results": [
{
"case_id": "onc_0001",
"predicted_impact": "positive",
"actual_impact": "positive",
"percent_change": 12.5,
"exact_match": true,
"close_match": true,
"direction_correct": true
}
]
}