Iris Benchmarks

New Evaluation Run

Dataset

Agent

Model (overrides agent default)

Level filter

Attachments

Limit (0 = all)

Concurrency

Rate limit (req/s, 0 = unlimited)

Label (optional)

Leaderboard

Status	Label / Eval ID	Model	Dataset	Questions	Accuracy	Date

By level

By category

GAIA Benchmark — Overview

What is GAIA?

GAIA (General AI Assistants) evaluates real-world AI assistant capability. Every question requires a chain of concrete actions — searching the web, reading documents, parsing data — to produce a single verifiable short answer.

Paper: arxiv.org/abs/2311.12983 · Dataset: huggingface.co/datasets/gaia-benchmark/GAIA

Splits

Validation (165 questions)

Ground-truth answers are public. Use for local scoring, model comparisons, and prompt tuning. Populates the leaderboard in this dashboard.

Test (301 questions)

Answers hidden. Submit submission.jsonl to the HF leaderboard for an official score.

Difficulty Levels

53 questions. Single-hop web search or document read. 1–2 tool calls typically enough. Expect 70–90% with a capable agent. Good for fast iteration.

86 questions. Multi-hop reasoning: cross-reference sources, extract from PDFs/spreadsheets/images, chain multiple searches. Where agents diverge most.

26 questions. Expert planning, code execution, or specialised knowledge over many steps. Aspirational for most current agents.

Attachment types (38 of 165 validation questions have a file)

📄 PDF / DOCX / PPTX

Text extracted server-side, injected into context

📊 XLSX / CSV

Tab-separated rows (up to 20k rows)

🖼 Images

Vision model — Gemini 3 Flash or better required

🎵 MP3 / WAV

Whisper transcription (needs OPENAI_API_KEY)

🐍 .py / .json / .txt

Uploaded as text/plain, full content in context

🗜 ZIP

Extracted, each member uploaded individually

Scoring — quasi-exact match

The model must output FINAL ANSWER: <value> — then the value is normalised and compared to ground truth. No LLM-as-judge.

Rule	Example
Numbers normalised	`42,000` = `42000` = `42000.0`
Case-insensitive	`Paris` = `paris`
Comma lists order-insensitive	`a, b, c` = `c, a, b`
Known failure: `17000` vs `17`	Model added units — enforced in default prompt

Iris vs leaderboard

87.5%

Gemini 3.5 Flash / 3.1 Pro / DeepSeek V4 Flash (Iris, 8q sample)

62.5%

Gemini 3 Flash (Iris, 8q sample)

~74%

Top open agents (GPT-4o + tools)

~92%

OpenAI o3 (leaderboard #1)

8-question sample is too small to be conclusive. Main gaps vs top agents: code execution for arithmetic-heavy L2/L3, and YouTube/video questions.

Named datasets

Loading…

Quick tips

Run L1 only first (~53 questions, ~30 min) to baseline a new model or prompt.
The number normalisation failure (17000 vs 17) is handled by "digits only, no separators" in the default GAIA prompt.
Keep max tool rounds ≥ 15 — reduce only if you need faster/cheaper runs at the cost of accuracy.
Enable pubmed_search in Agent Config for scientific literature questions.
CLI runs always default to DeepSeek V4 Flash. Use Agent Config or the eval suite UI for model experiments.
Reasoning traces in bench_traces are raw SFT/DPO fine-tuning material — every correct answer + tool chain is a training example.