Claude Opus 4.6 Benchmarks: #1 in Text, Grok Enters Top 10

Today’s Shakeup

Claude Opus 4.6 has officially overtaken its own thinking variant to claim the #1 spot in text generation, while xAI quietly dropped two new image editing models into the top 10. If you’re building with text or image editing APIs, today’s rankings have direct implications for your stack.

What Changed Today

🏆 New #1 in Text Generation: claude-opus-4-6 overtook claude-opus-4-6-thinking for the top spot
🖼️ New in Image Editing Top 10: grok-imagine-image-pro entered at #5 (ELO: 1321)
🖼️ New in Image Editing Top 10: grok-imagine-image entered at #6 (ELO: 1317)

Text Generation: Claude Opus 4.6 Is Now Undisputed #1

Claude Opus 4.6 from Anthropic now sits alone at the top of the text generation leaderboard with an ELO of 1504, edging past its thinking-mode sibling claude-opus-4-6-thinking which shares the same ELO score but falls to #2 based on head-to-head performance and vote distribution. This is a notable signal: the non-thinking (standard) variant is now outperforming the extended reasoning mode in the arena’s blind human evaluations — suggesting that for general-purpose text tasks, the additional thinking overhead may not be buying you better outputs.

For developers, the practical call here is straightforward. Use claude-opus-4-6 as your default for agents, complex generation, and high-quality text tasks at $5/$25 per million tokens (input/output). You’ll get the top-ranked model without paying the latency and token cost penalty of extended thinking. Reserve the thinking variant for tasks where you’ve verified it actually improves your specific use case — chain-of-thought math, multi-step planning, or complex code reasoning. Google’s Gemini 3.1 Pro Preview remains extremely competitive at #3 (ELO: 1500) and comes in cheaper at $2/$12 per MTok if budget is a factor.

FREE GUIDE

Stop Writing Design Specs by Hand

Get the free visual guide: how AI tools generate GAMP 5 documentation directly from your PLC and DCS exports. Used by Life Sciences engineers who are done doing it manually.

No spam. Unsubscribe anytime.

Image Editing: xAI Enters the Arena with Grok Imagine

xAI has landed two new models in the image editing top 10: grok-imagine-image-pro at #5 (ELO: 1321) and grok-imagine-image at #6 (ELO: 1317). Both carry a February 7, 2026 release date but are only now accumulating enough arena votes to rank — with roughly 11K and 7K votes respectively. They’ve slotted in above established players like Bytedance’s Seedream 4.5 and Tencent’s Hunyuan Image 3.0, which is a strong debut.

That said, the gap to the leaders is significant. OpenAI’s ChatGPT Image (gpt-image-1) still dominates at ELO 1413, and Google’s Gemini 3 Pro Image variants hold firm at #2 and #3. The Grok models are worth watching — especially if xAI prices them aggressively or they continue climbing as more votes come in — but they’re not yet a reason to switch from gpt-image-1 or gemini-3-pro-image-preview for production image editing workloads. Keep an eye on xAI’s API docs for pricing and availability details as they mature.

Current Leaders at a Glance

Category	#1 Model	Provider	API Model ID	Score (ELO)
Text Generation	Claude Opus 4.6	Anthropic	`claude-opus-4-6`	1504
Image Editing	ChatGPT Image (High Fidelity)	OpenAI	`gpt-image-1`	1413

So What?

If you’re currently routing text generation calls through claude-opus-4-6-thinking as your default, today’s data gives you a concrete reason to benchmark the standard claude-opus-4-6 — you may get equal or better output quality with lower latency and fewer tokens burned. For image editing pipelines, the xAI Grok Imagine models are a new entrant worth adding to your evaluation set, but with vote counts still relatively low and a ~90 ELO gap to the leader, they’re not ready to unseat gpt-image-1 or the Gemini image models in production. The broader trend here is clear: the text generation leaderboard is incredibly tight at the top (4 ELO points separating #1 from #3 across three providers), which means your choice increasingly comes down to pricing, latency, and ecosystem fit rather than raw quality. Build your evals accordingly.