New Evaluation Run
Leaderboard
| Status | Label / Eval ID | Model | Dataset | Questions | Accuracy | Date |
|---|---|---|---|---|---|---|
By level
By category
| Status | Label / Eval ID | Model | Dataset | Questions | Accuracy | Date |
|---|---|---|---|---|---|---|
GAIA (General AI Assistants) evaluates real-world AI assistant capability. Every question requires a chain of concrete actions — searching the web, reading documents, parsing data — to produce a single verifiable short answer.
Paper: arxiv.org/abs/2311.12983 · Dataset: huggingface.co/datasets/gaia-benchmark/GAIA
submission.jsonl to the HF leaderboard for an official score.The model must output FINAL ANSWER: <value> — then the value is normalised and compared to ground truth. No LLM-as-judge.
| Rule | Example |
|---|---|
| Numbers normalised | 42,000 = 42000 = 42000.0 |
| Case-insensitive | Paris = paris |
| Comma lists order-insensitive | a, b, c = c, a, b |
Known failure: 17000 vs 17 | Model added units — enforced in default prompt |
8-question sample is too small to be conclusive. Main gaps vs top agents: code execution for arithmetic-heavy L2/L3, and YouTube/video questions.
17000 vs 17) is handled by "digits only, no separators" in the default GAIA prompt.bench_traces are raw SFT/DPO fine-tuning material — every correct answer + tool chain is a training example.