Executive model board

Benchmarks

A current snapshot of the key models, what each is best for, and the public benchmark scores that matter most when you are deciding where to spend time and budget.

Last updated 24 May 2026

TL;DR

What to pick right now

GPT-5.5 now leads the quality index, but Gemini 3.1 Pro costs less and wins on reasoning. Claude Opus 4.7 remains best for high-stakes analysis. Open-weight models (Kimi K2.6, DeepSeek V4) are frontier-competitive for sensitive, cost-constrained work.

As of May 2026, GPT-5.5 (launched April 23) leads the Artificial Analysis Intelligence Index at 60, with Gemini 3.1 Pro and Claude Opus 4.7 close behind. Gemini 3.1 Pro leads on mathematical reasoning (HLE 44.4%) and GPQA Diamond (94.3%). Claude Opus 4.7 maintains the best coding benchmark (SWE-bench Pro 64.3%). Open-weight models like Kimi K2.6 have narrowed the gap dramatically, with Kimi matching Gemini on SWE-bench Pro.

Best overall quality

GPT-5.5

Launched April 23, 2026. Now the leader on the Artificial Analysis Intelligence Index (score: 60). Reliable across reasoning, coding, and writing. The safest newest premium choice for organizations standardizing on OpenAI.

Best mathematical reasoning

Gemini 3.1 Pro

Leads every published reasoning benchmark: GPQA Diamond 94.3%, ARC-AGI-2 77.1%, HLE (mathematical) 44.4%. The strongest all-round quality per dollar in the premium tier at $1/M input.

Best for engineering

Claude Opus 4.7

SWE-bench Pro 64.3% — 5.7 points ahead of GPT-5.5. Only model where engineering speed and output quality justify the $5/M cost. Essential for teams building high-quality software.

Best value for frontier work

Kimi K2.6

Open-weight, 1T context window, SWE-bench Pro 58.6% (matches Gemini 3.1 Pro level, within 6 points of Claude Opus 4.7). Frontier-competitive quality without vendor lock-in. Best for teams comfortable with open ecosystems.

Shared public evals

Premium model comparison

Comparison source: Best AI Models in May 2026 — buildfastwithai.com

A current comparison of premium AI models across coding, research, reasoning, and pricing benchmarks.
Eval
OpenAIGPT-5.5$2.00 in / $8.00 out
GoogleGemini 3.1 Pro$1.00 in / $10.00 out
AnthropicClaude Opus 4.7$5.00 in / $25.00 out
DeepSeekDeepSeek V4$0.14 in / $0.28 out
GPQA DiamondGraduate-level science reasoning — frontier differentiator for research-grade work
~92%94.3%94.2%~91%
SWE-bench ProProduction-grade software engineering benchmark — real GitHub issue resolution
~58.6%~71.8%64.3%~56%
HLE (Mathematics)Humanity's Last Exam — graduate-level reasoning across all domains
42.0%44.4%~41%~40%
Context windowMaximum tokens in a single request (larger = handle full documents/knowledge bases)
1M tokens1M+ tokens200K tokens256K tokens

GPT-5.5 pricing: OpenAI pricing (May 2026)

Gemini 3.1 Pro pricing: Gemini API pricing

Claude Opus 4.7 pricing: Anthropic pricing

DeepSeek V4 pricing: DeepSeek API pricing (May 2026)

Current picks

Model notes

Use this section for the buying lens: what each model is actually good for, the headline numbers worth remembering, and the standard API pricing to keep in mind.

OpenAI

Latest frontier model, leads quality index

GPT-5.5

Launched April 23, 2026. Now leads the Artificial Analysis Intelligence Index (score: 60). Solid all-rounder for organizations standardizing on OpenAI. The newest, most actively optimized model.

Input$2.00
Output$8.00
Artificial Analysis Index60 (Leader)
Context window1M tokens
ReleasedApril 23, 2026

Sources: GPT-5.5 launch — OpenAI, Artificial Analysis Intelligence Index

Google

Best reasoning, best value for premium work

Gemini 3.1 Pro

Leads every published reasoning benchmark: GPQA Diamond 94.3%, ARC-AGI-2 77.1%, mathematical reasoning (HLE) 44.4%. Half the cost of Claude Opus at $1/M input. Best benchmark-per-dollar in premium tier.

Input$1.00
Output$10.00
GPQA Diamond94.3%
HLE (Math)44.4%
Context window1M+ tokens

Sources: Best AI Models in May 2026 — felloai.com, Gemini 3.1 Pro — Pluralsight

Anthropic

Best for production engineering, deep analysis

Claude Opus 4.7

SWE-bench Pro 64.3% — the unmatched coding leader, 5.7 points ahead of GPT-5.5. Essential for teams building or maintaining quality codebases. Worth the $5/M cost when engineering productivity or code quality is the constraint.

Input$5.00
Output$25.00
SWE-bench Pro64.3%
GPQA Diamond94.2%
Context window200K tokens

Sources: Claude Opus 4.7 — CloudZero, Claude pricing — Anthropic

DeepSeek

Frontier quality at minimal cost, 256K context

DeepSeek V4

Frontier-competitive scores at $0.14/M input — 14–36x cheaper than proprietary alternatives. The emerging disruptor for cost-sensitive, non-sensitive workloads. Data residency and vendor risk are the real questions, not capability.

Input$0.14
Output$0.28
Cost vs Claude35x cheaper
Context window256K tokens
Frontier-level reasoning~91% GPQA

Sources: DeepSeek V4 — Open Router, DeepSeek API pricing (May 2026)

Moonshot AI

1 trillion token context, open-weight frontier quality

Kimi K2.6

Open-weight, 1 trillion token context (largest in the market), SWE-bench Pro 58.6% — within 6 points of Claude Opus 4.7. No per-token vendor lock-in. Best for organizations wanting infrastructure control and document-scale processing.

Input$0.10
Output$0.30
Context window1 trillion tokens
SWE-bench Pro58.6%
ArchitectureOpen-weight, agentic-first

Sources: Best AI Models of May 2026 — buildfastwithai.com, Kimi K2.6 — Moonshot AI

Reading the board

Benchmark explainers

Scores are useful as signals, but leaders still need to map them back to workflow fit, governance, review load, and integration cost.