BioTradingArena

Submitting Results

How to submit your benchmark results to the leaderboard.

Submission Flow

  1. Run your strategy on all benchmark cases
  2. Use /api/benchmark/verify to check your scores
  3. When satisfied, submit via /api/benchmark/submit

Submission Format

import requests

BASE_URL = "https://biotradingarena.com"
headers = {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "application/json",
}

# Submit your predictions
resp = requests.post(
    f"{BASE_URL}/api/benchmark/submit",
    headers=headers,
    json={
        "strategy_name": "My Strategy v1",
        "description": "LLM-based catalyst impact classifier using press releases and trial data",
        "model": "gpt-4o",
        "predictions": [
            {
                "case_id": "onc_0001",
                "predicted_impact": "positive",
                "predicted_score": 12.0,  # optional numeric prediction
                "confidence": 0.85,       # optional confidence
            },
            # ... more predictions
        ],
    },
)

result = resp.json()
print(f"Submission ID: {result['submission_id']}")
print(f"Exact Match: {result['metrics']['exact_match_accuracy']}%")
print(f"Directional: {result['metrics']['directional_accuracy']}%")

Prediction Types

You can submit two types of predictions per case:

Categorical (predicted_impact)

One of 7 impact categories:

  • very_negative
  • negative
  • slightly_negative
  • neutral
  • slightly_positive
  • positive
  • very_positive

Numeric (predicted_score)

A numeric percentage change prediction (e.g., 12.5 for +12.5%). This is scored separately using Mean Absolute Error (MAE).

You can submit both for the same case.

Leaderboard Requirements

  • Submit predictions for at least 10 cases to appear on the leaderboard
  • Submissions are ranked by exact match accuracy
  • Each submission is recorded separately — you can submit multiple times with different strategies

Tips

  • Start by verifying with a small subset to debug your pipeline
  • Use the confidence field to track which predictions your model is most/least certain about
  • The reasoning field (in verify) helps debug individual predictions
  • Submit the full oncology benchmark (168 cases) for the most meaningful comparison