What Is Claude Opus 4.6? Anthropic's February 2026 Flagship Explained
Claude Opus 4.6 is the largest and most capable model that Anthropic has ever shipped. It launched on February 11, 2026, alongside its smaller sibling Claude Sonnet 4.6, and it now sits at the very top of the Claude lineup — the model Anthropic itself reaches for when a problem is genuinely hard. If you've used Claude 4 Opus or Claude 3.7 Sonnet over the past year, Opus 4.6 is what you get when Anthropic compounds eighteen months of post-training research, scales the underlying network, extends the context to a full million tokens, and bakes extended thinking in as a default rather than an opt-in.
The naming, as always with Anthropic, is more meaningful than it looks. The "4.6" places this release on the same architectural family tree as Claude 4 (May 2025) and Claude 4.5 (October 2025), which means it's not the start of a new generation — it's the refined, polished, production-ready version of the work that started with Claude 4. Anthropic skipped the ".7" suffix this time and went straight from 4.5 to 4.6 because the gains were larger than a typical incremental release: roughly 11 points on SWE-bench Verified, a full doubling of usable context length (from 500K to 1M tokens), and an extended thinking budget that now stretches to 200,000 reasoning tokens before any answer is produced.
The "Opus" tier matters too. Anthropic has consistently used three weights of model in each generation — Haiku at the bottom (fast, cheap, embedded), Sonnet in the middle (the workhorse most people actually use), and Opus at the top (the heaviest network, reserved for the hardest tasks). Opus 4.6 is the only model in the 4.6 family with the full parameter count, the full extended thinking budget, and the deepest training on agentic tool use. Sonnet 4.6 covers most general use, but when developers reach for "the best Claude," they reach for Opus 4.6.
What makes this release feel different from the half-step bumps Anthropic usually ships is that Opus 4.6 is the first model the company has positioned as a true general-purpose flagship — competitive with GPT-5.4 on reasoning, with Gemini 3 Pro on multimodal tasks, and ahead of both on long-horizon coding work. Anthropic spent most of 2024 and 2025 conceding the "biggest model" crown and competing on reliability, alignment, and developer experience instead. With 4.6, they're openly claiming the top spot, and the early benchmark numbers back it up.
Opus 4.6 vs Sonnet 4.6: How Anthropic Split the 4.6 Family
The 4.6 release came as a paired drop — Opus 4.6 and Sonnet 4.6 on the same day, with Haiku 4.6 following two weeks later. The split between the two flagship-class models is more intentional than in previous generations, and understanding it is the difference between paying for capability you need and paying for capability you'll never use.
Sonnet 4.6: The Workhorse, Not the Lightweight
Sonnet 4.6 is not a "small" model. It's a frontier-class model in its own right — roughly comparable to Claude 4 Opus from May 2025 on most benchmarks, but at a fraction of the cost and latency. For 90% of production workloads — chat assistants, content generation, document analysis, single-file code edits, customer support automation — Sonnet 4.6 is the correct choice. It runs about 2.5x faster than Opus 4.6, costs roughly 5x less per token, and is available with the same 1M context window and the same extended thinking mode.
Anthropic explicitly designed Sonnet 4.6 to be the default. The Claude.ai consumer app uses Sonnet 4.6 for the free tier and standard Pro tier, and most API customers route the bulk of their traffic to Sonnet 4.6 unless a specific request escalates to Opus.
Opus 4.6: When the Problem Is Genuinely Hard
Opus 4.6 earns its premium on a narrow but important set of tasks. Anthropic's own internal routing — used inside Claude Code and the Claude desktop app — escalates to Opus 4.6 when the request involves: multi-file refactoring across more than ten files, mathematical proofs longer than a few steps, agentic loops with more than fifteen tool calls, scientific literature synthesis across dozens of papers, or any task where the user explicitly requests "deep" or "thorough" reasoning. On these tasks, the gap between Sonnet 4.6 and Opus 4.6 is not subtle — it's the difference between a confident-but-wrong answer and a correct one.
The clearest illustration is SWE-bench Verified. Sonnet 4.6 scores 71% — already an excellent number. Opus 4.6 scores 79.6%. That eight-point gap represents the long tail of hard bugs that require holding more context in working memory than Sonnet can comfortably manage. If you're writing a Slack bot, you'll never see the difference. If you're refactoring a 50,000-line TypeScript monorepo, you will.
Pricing Reflects the Split
The pricing gap between the two models is sharp on purpose:
- Sonnet 4.6: $3 per million input tokens, $15 per million output tokens (unchanged from Sonnet 4.5).
- Opus 4.6: $15 per million input tokens, $75 per million output tokens (a slight discount from Opus 4 at $20/$80, reflecting efficiency gains in the new architecture).
That 5x premium is real money at scale, which is why Anthropic invested so heavily in making Sonnet 4.6 capable enough that Opus 4.6 stays a specialist tool. For a full breakdown of how this maps to consumer plans, see our Claude pricing comparison for 2026.
What's New in 4.6: Six Improvements That Actually Change Workflows
It's tempting to summarize a model release as "smarter on benchmarks" and move on. But the practical changes in Claude 4.6 — the things that change what you can build and how you build it — are concrete enough to enumerate. There are six that matter.
1. The 1 Million Token Context Window
Claude 4.6 doubles the usable context from Claude 4.5's 500,000 tokens to a full 1,000,000 tokens — roughly 750,000 words, or about 2,500 pages of dense technical content. This is the same headline number that Gemini 1.5 Pro shipped back in early 2024, but Anthropic took an extra eighteen months to get there because they refused to ship long context until "needle in a haystack" recall stayed above 99% across the full window. The result is a 1M context that actually behaves like 1M tokens — not the 1M that degrades to 200K of practical recall like some competing models.
What this unlocks: dropping an entire codebase into a single prompt, ingesting a year of meeting transcripts for a single retro, loading an entire legal case file (briefs, exhibits, depositions) and asking cross-document questions, or feeding a complete book and asking for a chapter-by-chapter critique. Tasks that previously required RAG, chunking, or summarization pipelines can now be done with a single API call.
2. Extended Thinking Budget Up to 200K Tokens
Extended thinking — the chain-of-thought reasoning mode that Anthropic introduced with Claude 3.7 Sonnet (we covered it in depth in our Sonnet 3.7 review) — has been expanded substantially. The maximum thinking budget in Claude 4.6 is now 200,000 tokens, up from 64K in the 4.5 family. In practice, Opus 4.6 with a 200K thinking budget can solve research-mathematics problems that previously required dedicated reasoning models, and it can plan agentic workflows that span several hours of execution before producing a final answer.
The thinking traces themselves are also cleaner. Anthropic trained Claude 4.6 to compress its internal reasoning more efficiently — fewer "wait, let me reconsider" loops, fewer redundant restatements, and more explicit branching when the model considers multiple approaches.
3. Native Tool Use and Agentic Loops
Claude 4 introduced robust tool use; Claude 4.6 makes it native to the model's planning behavior. The model now treats tool calls as first-class actions in its reasoning, which means it plans tool sequences before executing them rather than deciding what tool to call one step at a time. For developers building agents, this dramatically reduces wasted tool calls and the kind of "lost in the loop" failures where an agent forgets why it was doing something halfway through.
Specifically, Anthropic claims a 40% reduction in tool-call errors and a 60% reduction in agentic loop length on TAU-bench compared to Claude 4.5 — meaning the same task gets accomplished in fewer steps, with fewer mistakes.
4. Improved Multimodal Understanding
Claude 4.6 doesn't add new modalities (no audio output, no image generation), but its existing image understanding has improved noticeably. Chart parsing, diagram interpretation, and document OCR are all more reliable. The model now correctly handles handwritten notes, low-resolution screenshots, and complex layouts (multi-column PDFs, tables that span pages) where Claude 4.5 frequently stumbled.
5. Reduced Refusals on Legitimate Requests
This one is less glamorous but practically important. Anthropic spent significant post-training effort reducing what they call "false refusals" — cases where the model declines a request that's actually benign. Claude 4.6 is noticeably more willing to engage with security research, medical questions, legal hypotheticals, and edgy creative writing without retreating behind boilerplate warnings. The hard safety guarantees are unchanged; the over-cautious hedging is mostly gone.
6. Better Memory of Earlier Context in Long Conversations
In previous Claude versions, long conversations could degrade as the model's attention drifted toward more recent turns. Claude 4.6 holds onto earlier context far more reliably, which matters for multi-day projects where you build up a lot of shared history with the model. The improvement is most visible in coding sessions that span hundreds of turns — the model still remembers the architectural decisions you made an hour ago.
Benchmarks and Real-World Coding: Where the Numbers Actually Matter
Benchmarks are imperfect, but they're the only consistent way to compare models across labs. Here's where Claude Opus 4.6 stands at launch, with the caveat that all of these numbers come from official Anthropic publications and independent reproductions through April 2026.
SWE-bench Verified: The Headline Coding Number
SWE-bench Verified tests a model's ability to fix real GitHub issues from open-source projects — the closest thing the field has to a real-world software engineering benchmark. Opus 4.6 scored 79.6%, which is a substantial jump from Opus 4.5's 68.4% and the highest score from any model at launch.
| Model | SWE-bench Verified | Released |
|---|---|---|
| Claude Opus 4.6 | 79.6% | Feb 2026 |
| GPT-5.4 | 76.1% | Jan 2026 |
| Gemini 3 Pro | 72.8% | Dec 2025 |
| Claude Sonnet 4.6 | 71.0% | Feb 2026 |
| Claude Opus 4.5 | 68.4% | Oct 2025 |
| DeepSeek V4 | 67.2% | Mar 2026 |
An eleven-point jump in a year is not normal. To put it in perspective, the entire field went from sub-30% on SWE-bench Verified in early 2024 to nearly 80% by early 2026 — roughly doubling each year, which is the kind of compounding curve that makes long-term planning hard.
Reasoning and Math
With extended thinking enabled, Opus 4.6 posts strong numbers across the reasoning suite:
- GPQA Diamond (graduate-level science questions): 84.7%, marginally ahead of GPT-5.4's 83.2%.
- AIME 2026 (competition mathematics): 91.3% with extended thinking — first-place tier among general-purpose models.
- MATH: 96.4% with extended thinking, near the practical ceiling for this benchmark.
- MMLU-Pro: 87.1%, a small lead over the rest of the frontier.
Agentic and Tool Use
This is where Claude 4.6 separates itself most clearly from the competition. On TAU-bench (a multi-turn benchmark involving customer service tasks with tool use), Opus 4.6 scored 78.4% — over ten points ahead of GPT-5.4. On SWE-Lancer (a benchmark of real freelance software tasks priced in dollars), Opus 4.6 successfully completed tasks worth $620K out of a possible $1M, easily the top result among any tested model.
Real-World Coding: The Anecdotal Evidence Lines Up
Benchmarks aside, the developer feedback in the first eight weeks since launch has been consistently positive on three points: Opus 4.6 holds large codebases in working memory better than any prior model, it makes fewer "confidently wrong" architectural suggestions, and it recovers from its own mistakes gracefully when shown a failing test or an error message. For complex refactors, the gap between Opus 4.6 and the next-best model is genuinely felt by experienced engineers — not just measured on benchmarks.
Several large engineering teams have publicly reported using Opus 4.6 inside Claude Code as their primary development assistant, with measurable productivity gains on the order of 30–40% for complex refactoring work compared to their previous Claude 4.5-based workflows.
The Claude 'Mythos' Rumor: What Project Glasswing Actually Leaked
No discussion of Claude Opus 4.6 is complete without addressing the rumor that has consumed AI Twitter since mid-March 2026. We're going to be careful here, because the line between confirmed fact and credible speculation matters for a story like this.
What We Know
In early March 2026, an internal Anthropic project codename — Project Glasswing — surfaced in screenshots posted to several AI research forums. Glasswing is, according to multiple independent sources who claim early access, an internal early-access program for what Anthropic engineers refer to as "the next thing after Opus 4.6." The leaked documents (which Anthropic has neither confirmed nor denied) suggest the existence of an unreleased model internally codenamed Claude Mythos.
The most repeated claim from the leak is the parameter count: 10 trillion parameters. If accurate, that would make Mythos roughly an order of magnitude larger than the largest publicly disclosed Claude model and one of the largest dense (non-mixture-of-experts) language models ever trained. The leaked documents reportedly describe Mythos as "the model Opus 4.6 was distilled from" — implying that Opus 4.6 is a distilled, deployment-friendly version of a much larger teacher network.
Why the Distillation Story Is Plausible
The distillation framing makes more sense than the alternative. Training a 10T-parameter dense model and then deploying it directly to API customers would be uneconomic at the prices Anthropic charges for Opus 4.6 — the inference costs alone would force much higher pricing than what's shipped. Distilling a giant teacher down to a smaller, faster student model is a well-established research technique, and it would explain how Opus 4.6's capability jump is so much larger than what scaling the previous architecture would predict.
Anthropic has historically trained larger experimental models that never ship publicly — they've alluded to this in public papers, and former employees have hinted at internal research models that exist purely as teachers for the production lineup. Mythos, if it exists, would fit that pattern.
Why You Should Be Skeptical
That said, "10 trillion parameters" is exactly the kind of round, dramatic number that gets fabricated for attention. The original screenshots have not been independently verified, no current Anthropic employee has confirmed the model's existence on the record, and the source forums where Glasswing first appeared have a mixed track record on prior leaks. Anthropic's public response — when asked at a press event in late March — was a polite "no comment," which is neither confirmation nor denial.
What seems most likely, based on the pattern of past Anthropic communications and the technical plausibility of distillation, is that some version of the leak is true: there is a larger internal model, it does serve as a teacher for production releases, and the architecture is dense rather than mixture-of-experts. Whether it's specifically 10 trillion parameters and specifically called "Mythos" — those details deserve a wait-and-see attitude until Anthropic confirms or refutes them on the record.
What It Would Mean If True
If the Mythos story is essentially accurate, it tells us two things about Anthropic's roadmap. First, the company has been investing heavily in the largest possible base models — abandoning the idea (popular in 2023–2024) that mixture-of-experts and clever architecture would substitute for raw scale. Second, the next generation of Claude (likely "Claude 5" later in 2026) will probably be a direct deployment of something closer to the teacher model itself, with all the capability and cost implications that follow.
API Pricing, Access, and Where to Use Opus 4.6
Pricing and availability are often glossed over in model reviews, but they determine whether you can actually use a model in production. Here's the practical information for Claude Opus 4.6 as of April 2026.
API Pricing
Claude Opus 4.6 is priced at $15 per million input tokens and $75 per million output tokens. That's slightly cheaper than Claude Opus 4 ($20/$80) but significantly more expensive than Sonnet 4.6 ($3/$15) — about 5x more for input and 5x more for output. Cached inputs (using Anthropic's prompt caching feature) drop to roughly $1.50 per million tokens for the cached portion, which makes long-context applications dramatically more affordable on repeat queries.
For agentic workflows, the output pricing is what dominates the bill — agents tend to be input-heavy initially but become output-heavy as they execute. A typical Claude Code session lasting an hour with Opus 4.6 will run somewhere between $2 and $8 in API costs depending on how much code the model writes and how often extended thinking is engaged. That's expensive enough to make routing decisions matter, which is why Anthropic recommends using Sonnet 4.6 by default and escalating to Opus only when the task warrants it.
Where Opus 4.6 Is Available
- Anthropic API: Direct access at console.anthropic.com, with the model ID
claude-opus-4-6-20260211. - Amazon Bedrock: Available in US, EU, and APAC regions starting February 18, 2026. Pricing matches the direct Anthropic API.
- Google Cloud Vertex AI: Available since February 25, 2026, with the same pricing.
- Claude.ai (consumer): Pro and Max subscribers can select Opus 4.6 from the model picker. Free tier users are limited to Sonnet 4.6.
- Claude Code: Anthropic's official CLI tool for developers uses Opus 4.6 automatically for the hardest planning steps and Sonnet 4.6 for routine edits.
Rate Limits and Tier Access
Opus 4.6 has tighter rate limits than Sonnet 4.6, as you'd expect. New API customers start with 50K tokens per minute on Opus 4.6, scaling up through usage tiers to 400K tokens per minute on Tier 4. For high-volume production workloads, Anthropic recommends contacting sales for custom rate limits — particularly if your use case involves agentic loops that need to burst above the standard limits.
Consumer Access via Claude.ai
If you're not a developer, the easiest way to try Opus 4.6 is through Claude.ai's Pro plan ($20/month) or Max plan ($200/month). The Pro plan gives you a generous but capped number of Opus 4.6 messages per day, while the Max plan effectively removes the cap for all but the heaviest power users. The full pricing matrix — including the new Max tiers introduced with the 4.6 launch — is covered in our complete Claude pricing comparison for 2026.
Claude Opus 4.6 vs GPT-5.4, Gemini 3 Pro, and DeepSeek V4
The frontier in April 2026 has four serious contenders, and each one is genuinely the best at something. Here's how Claude Opus 4.6 fits into that picture — and where it doesn't lead.
vs GPT-5.4 (OpenAI)
OpenAI's GPT-5.4 launched in January 2026 and is the model Opus 4.6 was clearly built to beat. The two are remarkably close on most benchmarks — GPT-5.4 wins on a handful of pure-knowledge tasks (MMLU, certain language understanding evals), Opus 4.6 wins on most coding and agentic benchmarks. The clearest separation is in tool use: Opus 4.6's TAU-bench lead is substantial, and developers building agentic systems consistently report fewer hallucinated tool calls and cleaner reasoning traces.
GPT-5.4's biggest advantage is the broader OpenAI ecosystem — DALL-E 4 image generation, Sora 2 video, advanced voice mode, and Operator (OpenAI's agentic browser tool). If you need a single model that does everything, GPT-5.4 is probably the better pick. If you need the best raw text and code model, Opus 4.6 has the edge.
vs Gemini 3 Pro (Google DeepMind)
Gemini 3 Pro launched in December 2025 and is the strongest multimodal model on the market. It handles audio (input and output), video understanding, and image generation natively — areas where Claude 4.6 has nothing comparable. For multimodal applications — analyzing video clips, transcribing and reasoning about audio, generating images alongside text — Gemini 3 Pro is the obvious choice.
For text and code, Opus 4.6 leads. Gemini 3 Pro's coding scores trail by a meaningful margin (about seven points on SWE-bench Verified), and its tool use is less reliable in agentic settings. The pricing is comparable — Gemini 3 Pro is slightly cheaper on input tokens and slightly more expensive on output.
vs DeepSeek V4
DeepSeek V4 launched in March 2026 and is the most interesting open-weight competitor. It's a mixture-of-experts model with about 800B total parameters and 100B active per token, and it scores within shouting distance of the closed frontier models on most benchmarks — at roughly one-tenth the API cost. For cost-sensitive workloads, DeepSeek V4 is genuinely competitive with Sonnet 4.6 and meaningfully cheaper.
Where it falls behind is on the long tail of hard tasks. Opus 4.6's lead on SWE-bench Verified, TAU-bench, and graduate-level reasoning is real. If you care about the absolute frontier, Opus 4.6 wins. If you care about cost-per-capability and you can tolerate slightly lower reliability, DeepSeek V4 is hard to beat.
The Practical Verdict on Comparisons
Pick Opus 4.6 when: you're building agentic systems, doing serious software engineering, working with extremely long documents, or need the most reliable reasoning available. Pick GPT-5.4 when: you need a single ecosystem with image, video, and voice. Pick Gemini 3 Pro when: multimodal is your primary use case. Pick DeepSeek V4 when: cost matters more than the last few points of capability.
Real-World Use Cases and the Final Verdict on Opus 4.6
After two months of hands-on use across multiple workflows — coding, research, agentic systems, writing — here's where Claude Opus 4.6 actually delivers and where it doesn't.
Software Engineering: Where Opus 4.6 Shines
This is the model's strongest suit by a significant margin. For complex refactoring, architectural decisions, debugging across multiple files, and any task that requires holding a large codebase in working memory, Opus 4.6 is now the default choice for serious engineering work. The combination of the 1M context window (you can drop an entire mid-sized codebase into a single prompt), the extended thinking mode (which catches its own mistakes before producing code), and the improved tool use (cleaner agentic loops in Claude Code) makes it noticeably better than Claude 4.5 was for the same tasks.
Specific wins observed in real work: untangling a 15-file React refactor in a single session, finding a subtle race condition in a Go service by analyzing the entire concurrency model in one prompt, and writing a working Postgres migration that touched seven tables and three foreign keys without a single iteration. These are tasks that previously required either careful prompt engineering or multiple back-and-forth turns.
Research and Long-Document Analysis
The 1M context window is genuinely transformative for research workflows. Loading a hundred academic papers and asking cross-document questions, ingesting an entire annual report and asking forensic accounting questions, or feeding a complete legal case file and asking for inconsistency detection — all of these are now single-prompt operations rather than multi-stage RAG pipelines. The model's recall across the full window holds up well, and extended thinking helps it synthesize across sources rather than just retrieving from one.
Agentic Workflows and Tool Use
If you're building autonomous agents — long-running loops where the model plans, calls tools, evaluates results, and continues — Opus 4.6 is clearly the best model available right now. The reduction in wasted tool calls, the cleaner planning behavior, and the improved recovery from errors all add up to agents that finish more tasks successfully. For Claude Code users specifically, Opus 4.6 makes the experience meaningfully more reliable on hard problems.
Writing, Research Support, and Everyday Use
Honestly? For most everyday tasks, Sonnet 4.6 is the right choice and Opus 4.6 is overkill. Drafting an email, summarizing a meeting, brainstorming ideas, writing marketing copy, answering general knowledge questions — these are tasks where you won't see a meaningful difference between the two models, but you'll pay 5x more for Opus. Save Opus 4.6 for the moments that genuinely require it.
Where Opus 4.6 Falls Short
Three honest weaknesses: there's still no native image generation (you'll need DALL-E or Midjourney for that), no audio output (you'll need Gemini or GPT for voice), and no video understanding (Gemini is the only frontier model that handles video well). If your application requires any of these, Opus 4.6 is incomplete on its own and needs to be paired with something else.
The Final Verdict
Claude Opus 4.6 is the best general-purpose text-and-code model available in April 2026 — measurably ahead of GPT-5.4 on coding and agentic tasks, ahead of Gemini 3 Pro on reasoning, and worth its premium over Sonnet 4.6 specifically for the hardest work. It's not a universal winner — Gemini 3 Pro is better for multimodal, GPT-5.4 has a broader ecosystem, DeepSeek V4 is cheaper — but for serious engineering, agentic systems, and research workflows, Opus 4.6 is what we'd reach for.
The Mythos rumor, if it turns out to be true, suggests that 2026 will end with even larger Anthropic models in the picture. For now, Opus 4.6 is what's actually shipping, and it's very good. If you've been on the fence about upgrading from Claude 4.5 or migrating from another provider for serious technical work, the answer in April 2026 is yes — this is the release that earns its flagship status.