Executive model board

Benchmarks

A current snapshot of the key models, what each is best for, and the public benchmark scores that matter most when you are deciding where to spend time and budget.

Last updated 24 May 2026

TL;DR

What to pick right now

GPT-5.5 now leads the quality index, but Gemini 3.1 Pro costs less and wins on reasoning. Claude Opus 4.7 remains best for high-stakes analysis. Open-weight models (Kimi K2.6, DeepSeek V4) are frontier-competitive for sensitive, cost-constrained work.

As of May 2026, GPT-5.5 (launched April 23) leads the Artificial Analysis Intelligence Index at 60, with Gemini 3.1 Pro and Claude Opus 4.7 close behind. Gemini 3.1 Pro leads on mathematical reasoning (HLE 44.4%) and GPQA Diamond (94.3%). Claude Opus 4.7 maintains the best coding benchmark (SWE-bench Pro 64.3%). Open-weight models like Kimi K2.6 have narrowed the gap dramatically, with Kimi matching Gemini on SWE-bench Pro.

Best overall quality

GPT-5.5

Launched April 23, 2026. Now the leader on the Artificial Analysis Intelligence Index (score: 60). Reliable across reasoning, coding, and writing. The safest newest premium choice for organizations standardizing on OpenAI.

Best mathematical reasoning

Gemini 3.1 Pro

Leads every published reasoning benchmark: GPQA Diamond 94.3%, ARC-AGI-2 77.1%, HLE (mathematical) 44.4%. The strongest all-round quality per dollar in the premium tier at $1/M input.

Best for engineering

Claude Opus 4.7

SWE-bench Pro 64.3% — 5.7 points ahead of GPT-5.5. Only model where engineering speed and output quality justify the $5/M cost. Essential for teams building high-quality software.

Best value for frontier work

Kimi K2.6

Open-weight, 1T context window, SWE-bench Pro 58.6% (matches Gemini 3.1 Pro level, within 6 points of Claude Opus 4.7). Frontier-competitive quality without vendor lock-in. Best for teams comfortable with open ecosystems.

A current comparison of premium AI models across coding, research, reasoning, and pricing benchmarks.
Eval	OpenAIGPT-5.5$2.00 in / $8.00 out	GoogleGemini 3.1 Pro$1.00 in / $10.00 out	AnthropicClaude Opus 4.7$5.00 in / $25.00 out	DeepSeekDeepSeek V4$0.14 in / $0.28 out
GPQA DiamondGraduate-level science reasoning — frontier differentiator for research-grade work	~92%	94.3%	94.2%	~91%
SWE-bench ProProduction-grade software engineering benchmark — real GitHub issue resolution	~58.6%	~71.8%	64.3%	~56%
HLE (Mathematics)Humanity's Last Exam — graduate-level reasoning across all domains	42.0%	44.4%	~41%	~40%
Context windowMaximum tokens in a single request (larger = handle full documents/knowledge bases)	1M tokens	1M+ tokens	200K tokens	256K tokens

GPT-5.5 pricing: OpenAI pricing (May 2026)

Gemini 3.1 Pro pricing: Gemini API pricing

Claude Opus 4.7 pricing: Anthropic pricing

DeepSeek V4 pricing: DeepSeek API pricing (May 2026)

Current picks

Model notes

Use this section for the buying lens: what each model is actually good for, the headline numbers worth remembering, and the standard API pricing to keep in mind.

GPT-5.5

Launched April 23, 2026. Now leads the Artificial Analysis Intelligence Index (score: 60). Solid all-rounder for organizations standardizing on OpenAI. The newest, most actively optimized model.

Input$2.00

Output$8.00

Artificial Analysis Index60 (Leader)

Context window1M tokens

ReleasedApril 23, 2026

Sources: GPT-5.5 launch — OpenAI, Artificial Analysis Intelligence Index

Gemini 3.1 Pro

Leads every published reasoning benchmark: GPQA Diamond 94.3%, ARC-AGI-2 77.1%, mathematical reasoning (HLE) 44.4%. Half the cost of Claude Opus at $1/M input. Best benchmark-per-dollar in premium tier.

Input$1.00

Output$10.00

GPQA Diamond94.3%

HLE (Math)44.4%

Context window1M+ tokens

Sources: Best AI Models in May 2026 — felloai.com, Gemini 3.1 Pro — Pluralsight

Claude Opus 4.7

SWE-bench Pro 64.3% — the unmatched coding leader, 5.7 points ahead of GPT-5.5. Essential for teams building or maintaining quality codebases. Worth the $5/M cost when engineering productivity or code quality is the constraint.

Input$5.00

Output$25.00

SWE-bench Pro64.3%

GPQA Diamond94.2%

Context window200K tokens

Sources: Claude Opus 4.7 — CloudZero, Claude pricing — Anthropic

DeepSeek V4

Frontier-competitive scores at $0.14/M input — 14–36x cheaper than proprietary alternatives. The emerging disruptor for cost-sensitive, non-sensitive workloads. Data residency and vendor risk are the real questions, not capability.

Input$0.14

Output$0.28

Cost vs Claude35x cheaper

Context window256K tokens

Frontier-level reasoning~91% GPQA

Sources: DeepSeek V4 — Open Router, DeepSeek API pricing (May 2026)

Kimi K2.6

Open-weight, 1 trillion token context (largest in the market), SWE-bench Pro 58.6% — within 6 points of Claude Opus 4.7. No per-token vendor lock-in. Best for organizations wanting infrastructure control and document-scale processing.

Input$0.10

Output$0.30

Context window1 trillion tokens

SWE-bench Pro58.6%

ArchitectureOpen-weight, agentic-first

Sources: Best AI Models of May 2026 — buildfastwithai.com, Kimi K2.6 — Moonshot AI

Reading the board

Benchmark explainers

Scores are useful as signals, but leaders still need to map them back to workflow fit, governance, review load, and integration cost.

leadership strategy

What AI benchmarks tell executives, and what they do not

Benchmarks show direction, not whether a product should be bought or rolled out.

5 min read · 5/20/2026

risk governance

How to read benchmark comparisons before buying an AI product

Use benchmark charts to frame diligence questions, not to shortcut buying decisions.

6 min read · 5/18/2026

Benchmarks

What to pick right now

GPT-5.5

Gemini 3.1 Pro

Claude Opus 4.7

Kimi K2.6

Premium model comparison

Model notes

GPT-5.5

Gemini 3.1 Pro

Claude Opus 4.7

DeepSeek V4

Kimi K2.6

Benchmark explainers

What AI benchmarks tell executives, and what they do not

How to read benchmark comparisons before buying an AI product