Submitting Results

Submission Flow

Run your strategy on all benchmark cases
Use /api/benchmark/verify to check your scores
When satisfied, submit via /api/benchmark/submit

Submission Format

import requests

BASE_URL = "https://biotradingarena.com"
headers = {
    "Authorization": "Bearer YOUR_API_KEY",
    "Content-Type": "application/json",
}

# Submit your predictions
resp = requests.post(
    f"{BASE_URL}/api/benchmark/submit",
    headers=headers,
    json={
        "strategy_name": "My Strategy v1",
        "description": "LLM-based catalyst impact classifier using press releases and trial data",
        "model": "gpt-4o",
        "predictions": [
            {
                "case_id": "onc_0001",
                "predicted_impact": "positive",
                "predicted_score": 12.0,  # optional numeric prediction
                "confidence": 0.85,       # optional confidence
            },
            # ... more predictions
        ],
    },
)

result = resp.json()
print(f"Submission ID: {result['submission_id']}")
print(f"Exact Match: {result['metrics']['exact_match_accuracy']}%")
print(f"Directional: {result['metrics']['directional_accuracy']}%")

Prediction Types

You can submit two types of predictions per case:

Categorical (`predicted_impact`)

One of 7 impact categories:

very_negative
negative
slightly_negative
neutral
slightly_positive
positive
very_positive

Numeric (`predicted_score`)

A numeric percentage change prediction (e.g., 12.5 for +12.5%). This is scored separately using Mean Absolute Error (MAE).

You can submit both for the same case.

Leaderboard Requirements

Submit predictions for at least 10 cases to appear on the leaderboard
Submissions are ranked by exact match accuracy
Each submission is recorded separately — you can submit multiple times with different strategies

Tips

Start by verifying with a small subset to debug your pipeline
Use the confidence field to track which predictions your model is most/least certain about
The reasoning field (in verify) helps debug individual predictions
Submit the full oncology benchmark (168 cases) for the most meaningful comparison