Evaluation suite & results tracker
New Evaluation Run
Leaderboard
Evaluation Runs
Status Label / Eval ID Model Dataset Questions Accuracy Date
By level
By category

GAIA Benchmark — Overview

What is GAIA?

GAIA (General AI Assistants) evaluates real-world AI assistant capability. Every question requires a chain of concrete actions — searching the web, reading documents, parsing data — to produce a single verifiable short answer.

Paper: arxiv.org/abs/2311.12983  ·  Dataset: huggingface.co/datasets/gaia-benchmark/GAIA

Splits

Validation (165 questions)
Ground-truth answers are public. Use for local scoring, model comparisons, and prompt tuning. Populates the leaderboard in this dashboard.
Test (301 questions)
Answers hidden. Submit submission.jsonl to the HF leaderboard for an official score.

Difficulty Levels

L1
53 questions. Single-hop web search or document read. 1–2 tool calls typically enough. Expect 70–90% with a capable agent. Good for fast iteration.
L2
86 questions. Multi-hop reasoning: cross-reference sources, extract from PDFs/spreadsheets/images, chain multiple searches. Where agents diverge most.
L3
26 questions. Expert planning, code execution, or specialised knowledge over many steps. Aspirational for most current agents.

Attachment types (38 of 165 validation questions have a file)

📄 PDF / DOCX / PPTX
Text extracted server-side, injected into context
📊 XLSX / CSV
Tab-separated rows (up to 20k rows)
🖼 Images
Vision model — Gemini 3 Flash or better required
🎵 MP3 / WAV
Whisper transcription (needs OPENAI_API_KEY)
🐍 .py / .json / .txt
Uploaded as text/plain, full content in context
🗜 ZIP
Extracted, each member uploaded individually

Scoring — quasi-exact match

The model must output FINAL ANSWER: <value> — then the value is normalised and compared to ground truth. No LLM-as-judge.

RuleExample
Numbers normalised42,000 = 42000 = 42000.0
Case-insensitiveParis = paris
Comma lists order-insensitivea, b, c = c, a, b
Known failure: 17000 vs 17Model added units — enforced in default prompt

Iris vs leaderboard

87.5%
Gemini 3.5 Flash / 3.1 Pro / DeepSeek V4 Flash (Iris, 8q sample)
62.5%
Gemini 3 Flash (Iris, 8q sample)
~74%
Top open agents (GPT-4o + tools)
~92%
OpenAI o3 (leaderboard #1)

8-question sample is too small to be conclusive. Main gaps vs top agents: code execution for arithmetic-heavy L2/L3, and YouTube/video questions.

Named datasets

Loading…

Quick tips

  • Run L1 only first (~53 questions, ~30 min) to baseline a new model or prompt.
  • The number normalisation failure (17000 vs 17) is handled by "digits only, no separators" in the default GAIA prompt.
  • Keep max tool rounds ≥ 15 — reduce only if you need faster/cheaper runs at the cost of accuracy.
  • Enable pubmed_search in Agent Config for scientific literature questions.
  • CLI runs always default to DeepSeek V4 Flash. Use Agent Config or the eval suite UI for model experiments.
  • Reasoning traces in bench_traces are raw SFT/DPO fine-tuning material — every correct answer + tool chain is a training example.