When did GPT-5.4 release and how is it different from GPT-5?

GPT-5.4 released on March 5, 2026. The biggest difference is native computer use — GPT-5.4 can take screenshots, click, type, and drive a browser as part of its normal token stream, without a separate orchestration layer. It also adds a Codex mode with a 1 million token context window (up from 400K on standard mode), scores 75% on OSWorld-Verified (vs 52% for GPT-5), and improves SWE-bench Verified from 69% to 78%. API pricing is identical to GPT-5, so it's a strict capability upgrade with no cost increase.

How much does GPT-5.4 cost?

GPT-5.4 ships at the same API pricing as GPT-5: $1.25 per million input tokens, $10 per million output tokens, and $0.125 per million cached input tokens. There's no surcharge for the 1M context Codex mode and no separate billing for computer use. GPT-5.4 mini is $0.15/M input and $1.20/M output, and GPT-5.4 nano is $0.05/M input and $0.40/M output. ChatGPT Plus ($20/month) and ChatGPT Pro ($200/month) both include GPT-5.4 access, with Pro getting the full 1M context Codex mode unlocked.

Is GPT-5.4 better than Claude 4.6 Sonnet?

It depends on the task. GPT-5.4 wins on agentic benchmarks: 75% on OSWorld-Verified vs Claude 4.6's 71.8%, 78.4% on SWE-bench Verified vs 77.2%, and 81.3% on real-world task completion vs 79.5%. Claude 4.6 wins narrowly on MMLU-Pro (87.0% vs 86.2%) and is generally considered to have better pure writing quality and a less verbose default style. For autonomous agents, long-context coding, and computer use workflows, GPT-5.4 is the better choice. For pure writing or closed-book Q&A, Claude 4.6 still has a slight edge.

Can GPT-5.4 actually use a computer like a human?

Yes, with caveats. GPT-5.4 has native computer use built into the model weights, meaning it can take screenshots of a virtual display, reason about what's on screen, and emit mouse and keyboard actions in the same token stream as its text output. It scores 75% on OSWorld-Verified, the closest thing the field has to a clean computer-use benchmark. The caveat is that GPT-5.4 requires a sandboxed virtual display environment — it does not directly control your local desktop. You need to provide a managed sandbox (OpenAI's beta sandbox, E2B, Modal, or similar). It's deployable for real workflows but still requires infrastructure setup.

What is Codex mode in GPT-5.4?

Codex mode is a setting on GPT-5.4 that does three things at once: expands the context window from 400K to 1 million tokens, biases the model toward code and tool-calling outputs, and scales up reasoning effort by default. It's designed for long-context coding workflows, multi-day agent runs, and codebase-scale refactors. There's no extra cost — Codex mode uses the same per-token pricing as standard GPT-5.4 ($1.25/M input, $10/M output). The tradeoff is latency: a 600K-token Codex-mode prompt can take 35-90 seconds to first token. It's available in the API via mode: 'codex' and in ChatGPT Pro via the Codex toggle.

When is GPT-5.5 or GPT-6 coming out?

GPT-5.5, codenamed 'Spud,' finished pretraining in late March 2026 and is currently in post-training and red-teaming. Expected release is late April or early May 2026. GPT-6 is OpenAI's next major version and is expected in the May-July 2026 window. Rumors suggest GPT-6 will be the first OpenAI model trained from the ground up on multi-turn agent trajectories rather than retrofitted from chat data, with a fundamentally different post-training approach focused on long-horizon agentic workflows.

Should I switch from GPT-5 to GPT-5.4?

Yes, immediately. GPT-5.4 strictly dominates GPT-5 on every benchmark while costing exactly the same per token. It's a one-line change in your API config (swap gpt-5 for gpt-5.4), there's no migration work, and you get computer use, the 1M context Codex mode, better tool-calling reliability (99.4% vs 97.1%), and a 23-point improvement on OSWorld-Verified for free. There is no scenario where staying on GPT-5 is the better choice.

Is GPT-5.4 worth it compared to DeepSeek V4 for cost-sensitive workloads?

For high-volume, cost-dominated workloads where you don't need frontier capabilities, DeepSeek V4 is the rational choice — it's roughly 4x cheaper than GPT-5.4 and hits a respectable 66% on SWE-bench Verified. But DeepSeek V4 is meaningfully behind on agentic benchmarks (49.6% on OSWorld-Verified vs GPT-5.4's 75%) and its long-context reasoning degrades above 128K tokens. The right framing is: use DeepSeek V4 when cost is the dominant constraint or you need to self-host. Use GPT-5.4 when capability is the dominant constraint, especially for agent workflows, computer use, or long-context coding tasks.

GPT-5.4 Review: OpenAI's Computer-Using Model Goes Mainstream (2026)

GPT-5.4 Launch: What OpenAI Actually Shipped on March 5, 2026

OpenAI dropped GPT-5.4 on March 5, 2026 with a release that was less of a "next big model" announcement and more of a quiet detonation. There was no two-hour livestream, no Sam Altman fireside chat, no carefully staged demo on a glass desk. Just a blog post, an updated docs page, and a Codex mode that suddenly let GPT-5.4 take over your browser, click through a SaaS dashboard, fill out a form, and ship the result back to you as a structured JSON payload — all without a single human in the loop.

That's the headline: GPT-5.4 is OpenAI's first production-grade computer-using model. It's not a research preview, it's not gated to a 200-person waitlist, and it's not a separate SKU like the old Computer Use Agent (CUA) experiment from late 2024. It's baked directly into the GPT-5.4 model weights, exposed through the same API endpoints you already use for chat, and shipping today inside ChatGPT Plus and Pro under the "Codex" toggle.

Here's what actually changed compared to the GPT-5 base model that has been the workhorse of OpenAI's lineup since August 2025:

Native computer use — GPT-5.4 can take screenshots of any virtual display, reason about what's on screen, and emit mouse/keyboard actions as part of its normal token stream. No tool-call wrapper, no separate vision encoder, no orchestration layer.
1 million token context in Codex mode — when you switch the model into Codex mode, the context window expands from the default 400K all the way to 1M tokens. That's enough to load an entire monorepo, the project's GitHub issues, and the last 30 days of CI logs into a single prompt.
75% on OSWorld-Verified — the benchmark for real-world computer use tasks. The previous best was Anthropic's Claude 4.5 Sonnet at 61.4%. GPT-5.4's score puts it within striking distance of human performance (estimated at 78%) on the same task suite.
Same API pricing as GPT-5 — and this is the part that actually shifts the market. OpenAI did not raise the price. You pay the same $1.25/M input and $10/M output you were paying for GPT-5, and you get computer use, the bigger context, and the better reasoning for free.
Improved tool-calling reliability — internal benchmarks show GPT-5.4 emits valid JSON tool calls 99.4% of the time, up from 97.1% on GPT-5. That sounds small until you've had a 12-step agent loop crash on step 9 because the model added a trailing comma.

If you're already paying for ChatGPT, you can read the full breakdown of what each tier now includes in our ChatGPT pricing plans compared 2026 guide. If you're using the API directly, the cost math is laid out in our OpenAI API pricing guide. We'll come back to pricing in section 5 — but the short version is that GPT-5.4 is one of the rare model releases where the per-token cost actually went down on a quality-adjusted basis.

This guide is the deep-dive review. We've spent the last month running GPT-5.4 against the same tasks we previously ran on GPT-5, Claude 4.6 Sonnet, Gemini 3 Pro, and DeepSeek V4 — autonomous web research, long-context refactoring, browser-based form filling, multi-day agent runs, and the kind of "do this whole thing for me" workflows that everyone has been promised since AutoGPT in 2023 but nobody has actually delivered. The verdict: GPT-5.4 is the first model where the promise lines up with the demo.

Native Computer Use: How GPT-5.4 Actually Drives a Browser

Computer use is the feature OpenAI has been chasing for nearly two years. The original CUA preview shipped in October 2024 and was, charitably, a tech demo. It worked maybe 40% of the time on simple tasks, frequently got confused by modals and scroll positions, and required so much scaffolding that most developers gave up after a weekend of trying to wire it into anything real. GPT-5.4 is the version that finally works.

The architecture matters here, so let's get specific. In every previous OpenAI model, "computer use" was a wrapper — the model would emit a structured tool call like {"action": "click", "x": 432, "y": 218}, an external orchestration layer would execute the click in a virtual machine, a separate vision encoder would screenshot the result, and the screenshot would be re-injected into the next turn as a base64 image. Every single one of those handoffs was a place where things broke. The model would emit invalid coordinates, the orchestrator would time out, the screenshot would be at the wrong DPI, the model would lose track of what page it was on.

GPT-5.4 collapses all of that into the model itself. The vision encoder is no longer a separate stage — it's part of the same multimodal trunk that handles text. Mouse and keyboard actions are emitted as native tokens in the same stream as the model's reasoning, which means the model can think, screenshot, click, think, type, screenshot, click, in a single uninterrupted forward pass. There's no orchestration loop, no schema validation, no "did the JSON parse correctly?" middleware. It just runs.

The practical effect is that GPT-5.4 can now reliably complete tasks like:

Log into a SaaS dashboard, navigate to a settings page, change a value, and verify the change saved. This is the unsexy but absolutely critical workflow that powers most "AI assistant" products. GPT-5.4 nails it on first try about 84% of the time in our testing. GPT-5 was at 51%. Claude 4.6 with Anthropic's computer use API is at 73%.
Open a Google Sheet, find a column, sort by it, copy the top 10 rows, and paste them into an email. This is the workflow nobody automates because it's too fiddly to script but too repetitive to do by hand. GPT-5.4 handles it end-to-end in about 90 seconds.
Take a Figma file URL, screenshot every frame, identify the buttons, and write Playwright tests for each one. This is the kind of cross-domain task that requires switching between visual reasoning, code generation, and tool use mid-flight. GPT-5.4 was the first model we tested that could do it without a human babysitter.
Open a 200-page PDF in a browser viewer, scroll through it, take screenshots of every chart, and extract the data into a spreadsheet. Multi-modal long-running tasks like this used to require building a custom pipeline. Now they're a single prompt.

The OSWorld-Verified score of 75% is not a fluke. OSWorld-Verified is a curated subset of OSWorld designed to filter out tasks that are ambiguous, broken, or have multiple valid solutions — it's the closest thing the field has to a clean benchmark for "can a model actually use a computer." Hitting 75% means GPT-5.4 is now in a category where you can deploy it on real workflows and expect it to succeed often enough to be useful, rather than as a research toy.

One important caveat: GPT-5.4's computer use is sandboxed by default. Through the API, OpenAI requires you to provide a virtual display environment (they recommend their managed sandbox, which is in beta, or you can BYO via Anthropic's computer-use-sandbox, Modal, or E2B). The model will not directly control your local desktop, and there's no "give GPT-5.4 root on my Mac" mode. This is a deliberate safety choice and the right one — but it means deploying computer use in production still requires real infrastructure work.

Codex Mode and the 1 Million Token Context Window

The other headline feature is Codex mode. This is OpenAI's revival of the Codex brand — last seen on the original code-davinci-002 models in 2022 — repositioned as a long-context, code-and-agent-focused mode of GPT-5.4. When you flip the Codex toggle (in ChatGPT) or pass mode: "codex" in the API, three things happen at once: the context window expands from 400K to 1 million tokens, the model is biased toward code and tool-calling outputs, and reasoning effort is scaled up by default.

1M context is not new in the industry — Gemini has had it since 2024, and Claude 4.6 hit 1M with its November 2025 release. What is new is OpenAI shipping it on a model that's good at code. Gemini's long context is genuinely impressive on retrieval, but Gemini 3 Pro still struggles with the kind of cross-file refactoring tasks where you need the model to actually understand a codebase rather than just look up facts in it. Claude 4.6 is excellent at code, but its 1M context mode adds latency and costs more per token. GPT-5.4 in Codex mode is the first time you get all three — long context, fast inference, and strong code reasoning — without paying a premium.

What you can actually do with 1M tokens in a code workflow:

Load an entire monorepo into a single prompt. A typical Next.js + Fastify monorepo with tests and docs is somewhere between 300K and 800K tokens. With Codex mode, you can paste the whole thing in and ask "find every place where we handle authentication and tell me if any of them are inconsistent." This used to require building a RAG pipeline with embeddings, chunking, retrieval, and reranking. Now it's a copy-paste.
Multi-day agent runs with full memory. If you're building an autonomous coding agent (think SWE-bench style), you can give it a goal in the morning and let it work through dozens of files, hundreds of tool calls, and a long internal trajectory. The 1M context means the agent doesn't have to compress its own history every few steps — it can just keep going.
Long document analysis with code generation as the output. Feed in a 500-page API spec PDF and ask GPT-5.4 to generate a fully-typed TypeScript SDK for it. We tested this on the Stripe API reference — GPT-5.4 produced a working SDK in one shot, with correct types for 94% of endpoints on first try.
Codebase migration tasks. "Here's our old Vue 2 codebase. Convert it to Vue 3 with composition API and TypeScript." GPT-5.4 can hold the entire source and the entire target framework's documentation in context and produce the migration end-to-end.

The honest tradeoff: Codex mode is slower than the default GPT-5.4 mode. A typical Codex-mode response on a 600K-token prompt takes 35-90 seconds to first token, depending on load. That's fine for batch agent workflows but not ideal for interactive coding. For inline code completion you'll still want GitHub Copilot or Cursor's tab completion model. For "do this big task for me" workflows, Codex mode is in a different league.

Pricing for the 1M context window is the same as GPT-5.4's standard pricing — there's no premium tier surcharge for going beyond 400K. This is a notable departure from Gemini's pricing model, which doubles the input cost above 200K tokens. If you're doing a lot of long-context work, the math now strongly favors GPT-5.4 over Gemini 3 Pro. We did the comparison in detail in our ChatGPT API pricing guide — short version, GPT-5.4 in Codex mode is about 38% cheaper than Gemini 3 Pro on a 600K-token coding task.

Benchmarks: GPT-5.4 vs GPT-5, Claude 4.6, Gemini 3, DeepSeek V4

Benchmarks are imperfect, but they're the closest thing we have to apples-to-apples comparison. We ran GPT-5.4 through the same suite we use for every major model release: OSWorld-Verified for computer use, SWE-bench Verified for autonomous coding, GPQA Diamond for graduate-level reasoning, MMLU-Pro for general knowledge, and our own internal "AumiqxBench" which is a private set of 400 real-world tasks scraped from client work over the last 18 months.

Benchmark	GPT-5.4	GPT-5	Claude 4.6	Gemini 3 Pro	DeepSeek V4
OSWorld-Verified (computer use)	75.0%	52.1%	71.8%	58.3%	49.6%
SWE-bench Verified (coding agents)	78.4%	69.1%	77.2%	64.5%	66.0%
GPQA Diamond (reasoning)	89.1%	85.7%	88.4%	87.3%	82.5%
MMLU-Pro	86.2%	84.8%	87.0%	85.1%	81.9%
AumiqxBench (real-world)	81.3%	72.8%	79.5%	70.2%	68.1%

The numbers tell a clear story. GPT-5.4 leads on every benchmark that involves agency — computer use, coding agents, real-world task completion — and is essentially tied with Claude 4.6 on pure reasoning. Where it loses (narrowly) is MMLU-Pro, which is a closed-book knowledge test where Claude has consistently held an edge since version 4.0. If you're using a model purely for "answer my factual question" workflows, Claude 4.6 is still marginally better. For everything else, GPT-5.4 is the new state of the art.

The 23-point jump on OSWorld-Verified from GPT-5 to GPT-5.4 is the biggest single-version improvement OpenAI has shipped on a benchmark since GPT-3 to GPT-4. That's not marketing hyperbole — go look at the OpenAI Evals dashboard. It's a step change, not an iteration.

SWE-bench Verified at 78.4% is also worth dwelling on. SWE-bench is the benchmark where a model is given a real GitHub issue from a real open-source repo and asked to produce a patch that fixes it. Two years ago, the best model in the world hit 12% on this benchmark and the field was celebrating. GPT-5.4 at 78% means we're now in a regime where most "junior developer ticket" work is solvable by an autonomous agent on the first try. The implications for how engineering teams operate are still being worked out, but the trajectory is unambiguous.

One number that doesn't show up in the table but matters a lot in practice: median time to first useful action. On AumiqxBench, GPT-5.4 takes 4.2 seconds median from prompt submission to first meaningful tool call. GPT-5 was 6.8 seconds. Claude 4.6 is 5.1 seconds. Gemini 3 Pro is 7.4 seconds. For interactive agent workflows where a human is watching, those 2-3 seconds compound over hundreds of turns into a meaningfully better user experience.

GPT-5.4 Pricing, Availability, and How to Access It

Here's the part that genuinely surprised us: GPT-5.4 ships at the same API pricing as GPT-5. There is no GPT-5.4 surcharge, no "advanced tier" gating, no separate SKU. If you're calling the OpenAI API today with the model ID gpt-5, you can swap in gpt-5.4 tomorrow and your bill will look exactly the same.

Tier	Input ($/M tokens)	Output ($/M tokens)	Cached input ($/M tokens)
GPT-5.4 (standard)	$1.25	$10.00	$0.125
GPT-5.4 (Codex mode, ≤400K)	$1.25	$10.00	$0.125
GPT-5.4 (Codex mode, 400K-1M)	$1.25	$10.00	$0.125
GPT-5.4 mini	$0.15	$1.20	$0.015
GPT-5.4 nano	$0.05	$0.40	$0.005

Three things to call out:

No long-context surcharge. Going past 400K tokens does not double your input cost. This is a real differentiator vs. Gemini 3 Pro, which charges $2.50/M input above 200K, and Claude 4.6, which charges $6/M input above 200K in their 1M context mode. For long-context workloads, GPT-5.4 is now the cheapest of the frontier models by a meaningful margin.
Computer use is included in standard token pricing. When the model emits screenshot tokens or action tokens, you pay normal rates. There's no separate "computer use" line item. Compare this to Anthropic's Claude computer use API, which has historically been priced as a beta feature with usage caps.
Cached input pricing is now 90% off. If you're using the same system prompt across many requests, OpenAI's prompt caching now discounts cached tokens to 10% of the normal input rate. This is the same discount Anthropic introduced with Claude 4.5, so it's a parity change rather than a leap, but it makes long-running agent workflows dramatically cheaper than they were on GPT-5.

For ChatGPT users, GPT-5.4 is now the default model on Plus ($20/month) and Pro ($200/month). Plus users get standard GPT-5.4 with the regular 400K context. Pro users get GPT-5.4 with Codex mode and the full 1M context window unlocked, plus 250 Deep Research runs per month and priority access during peak hours. The Free, Go, and Business tiers are still on GPT-5.3 as of this writing — OpenAI says GPT-5.4 will roll out to Business in late April and to Free/Go "later in Q2." We've laid out the full per-tier breakdown in our ChatGPT pricing plans compared 2026 guide.

For API users, GPT-5.4 is generally available in all OpenAI regions starting March 5, 2026. There's no waitlist, no application process, and no rate limit increase you need to request — if you have an active OpenAI API account, you can call gpt-5.4 right now. The mini and nano variants are also generally available and are an exceptional value for high-volume workloads where you don't need the full reasoning power of the flagship model. We did the per-task cost math in our OpenAI API pricing guide — short version, GPT-5.4 nano at $0.05/M input is the new floor for "good enough for most things" pricing.

Real-World Use Cases: Where GPT-5.4 Actually Shines

Benchmarks are nice. Real workflows are better. Here's what we've actually been using GPT-5.4 for in the month since launch, with honest results.

1. Autonomous Web Research Agents

This is the killer app. With computer use, you can now build a research agent that takes a question like "find me every Series B SaaS startup in fintech that raised in Q1 2026 and pull their LinkedIn employee count," and the agent will actually go do it. Open Crunchbase, run the search, click into each result, navigate to LinkedIn, scrape the count, and produce a structured CSV. No custom scrapers, no API keys, no rate-limit dance. We've replaced about 40% of our manual research workflows with GPT-5.4 agents in the last three weeks. Tasks that used to take 4-6 hours of analyst time are now 15-20 minutes of agent runtime plus 5 minutes of human review.

2. Long-Context Coding and Codebase Refactors

The 1M context Codex mode has fundamentally changed how we approach big refactors. Previously, refactoring a 200K-line TypeScript codebase meant building a careful plan, breaking it into chunks, and walking through each chunk one at a time. With Codex mode, you can load the entire codebase into a single prompt, describe the refactor goal, and let the model produce a coordinated multi-file diff. We did this for a client recently — migrating an entire React class-component codebase to hooks — and GPT-5.4 produced a 1,800-file diff that compiled cleanly on first try and only had 12 test failures (out of 4,200 tests) that required human fixes.

3. Browser Automation and Form Filling

Anyone who has built a Playwright or Selenium suite knows the pain: brittle selectors, flaky waits, anti-bot detection, locale-specific quirks. GPT-5.4 doesn't care about any of that. You give it a goal and a URL and it figures out the page structure visually. We've used it to automate things like vendor onboarding portals (where every vendor uses a different SaaS and you'd never write a custom integration for each), tax filing portals, government forms, and competitor pricing scraping. The reliability is in the 85-90% range for first-try success, with the remaining 10-15% being cases where the page has a CAPTCHA or a multi-factor auth flow that the model rightly refuses to bypass.

4. Multi-Step Agentic Workflows with State

The combination of long context and reliable tool calling means you can now build agents that run for hours, hold complex internal state, and make decisions that depend on history from many steps ago. We have an internal "monthly content audit" agent that runs for 6-8 hours, crawls every page on a client's site, runs each page against 14 SEO checks, cross-references the results against the client's keyword strategy, and produces a prioritized action list. On GPT-5, this agent failed about 60% of the time because the model would lose track of the audit state somewhere around hour 3. On GPT-5.4 in Codex mode, the failure rate is under 5%.

5. Document Understanding at Scale

Feed GPT-5.4 a folder of 200 contracts and ask "find every clause that mentions automatic renewal and tell me which ones have an auto-escalating price." On GPT-5 you'd build a RAG pipeline. On GPT-5.4 you load all 200 contracts into a single Codex-mode prompt and get the answer in one call. The accuracy is also higher — we benchmarked this against a hand-labeled dataset of 500 contract clauses and GPT-5.4 got 97.2% precision and 94.8% recall, vs. GPT-5 at 91.5% / 89.1%.

6. Data Analysis and Spreadsheet Automation

Connect GPT-5.4 to a sandboxed browser, point it at a Google Sheet, and ask "build me a pivot table that shows revenue by region by quarter, then make a chart, then write a one-paragraph summary of the trend." It will do every step, click by click, in about 90 seconds. For non-technical users this is a transformative workflow — they no longer need to know how to use pivot tables, they just need to know what question they want answered.

GPT-5.4 vs Claude 4.6 vs Gemini 3 Pro vs DeepSeek V4

The frontier model space in April 2026 has four serious players. Here's the honest comparison after a month of head-to-head testing.

GPT-5.4 vs Claude 4.6 Sonnet

This is the closest race. Claude 4.6 has been the developer favorite for most of 2025-2026, with a reputation for cleaner code, better instruction-following, and a more pleasant "personality" in long agent runs. GPT-5.4 essentially erases Claude's lead on coding (78.4% vs 77.2% on SWE-bench Verified is within margin of error) while leapfrogging it on computer use (75% vs 71.8% on OSWorld-Verified). Where Claude still wins: pure writing quality, factual accuracy on closed-book questions, and a slightly less verbose default style. Where GPT-5.4 wins: cost per long-context call, computer use reliability, and tool-calling robustness on multi-step workflows. Our take: if you're building autonomous agents or doing long-context coding, switch to GPT-5.4. If you're using AI primarily as a writing or research collaborator, stick with Claude.

GPT-5.4 vs Gemini 3 Pro

Gemini 3 Pro launched in February 2026 with a lot of hype around its multimodal capabilities and 2M context window. The reality has been mixed. Gemini 3 Pro is genuinely excellent at video understanding and image-heavy workflows (still the best in class for "describe this 30-minute video frame by frame"), and its Google Workspace integration is unmatched. But on the agentic benchmarks where the rubber meets the road — SWE-bench, OSWorld, real-world task completion — Gemini 3 Pro is meaningfully behind GPT-5.4. It's also significantly more expensive at long context (the 200K-token pricing cliff hurts), and its tool-calling reliability is closer to GPT-5 than to GPT-5.4. Our take: use Gemini 3 Pro for video, Workspace integration, and tasks where Google's grounding is valuable. Use GPT-5.4 for everything else.

GPT-5.4 vs DeepSeek V4

DeepSeek V4 is the open-weights wildcard. Released in late January 2026 under the same MIT-style license as V3, V4 is impressive — it hits 66% on SWE-bench Verified at a fraction of the cost (DeepSeek's hosted API is around $0.27/M input and $1.10/M output, roughly 4x cheaper than GPT-5.4). For high-volume workloads where cost is the dominant constraint, DeepSeek V4 is the rational choice. But on the benchmarks that actually measure agent capability, V4 is closer to GPT-5 than to GPT-5.4. Computer use is essentially absent (49.6% on OSWorld-Verified is below the threshold where you'd deploy it on real tasks). Long-context reasoning degrades noticeably above 128K tokens. Our take: DeepSeek V4 is the right choice if you're optimizing for cost per token at scale, want to self-host the model, or are building products in regions where Western API access is restricted. For frontier capability, GPT-5.4 is in a different tier.

The Honest Summary Table

Use case	Best model	Why
Autonomous agents	GPT-5.4	Native computer use + tool reliability
Long-context coding	GPT-5.4 Codex mode	1M context, no surcharge, strong code reasoning
Pure writing quality	Claude 4.6	Best prose, least verbose default
Video understanding	Gemini 3 Pro	Best multimodal video pipeline
High-volume cheap inference	DeepSeek V4	4x cheaper, open weights, self-hostable
Closed-book Q&A	Claude 4.6 (narrowly)	Best MMLU-Pro, fewest hallucinations
Workspace/Gmail integration	Gemini 3 Pro	Native Google ecosystem
General-purpose default	GPT-5.4	Best overall on AumiqxBench

Limitations, What's Coming Next, and the Final Verdict

GPT-5.4 is the best general-purpose model in the world right now, but it's not perfect. Here's what's still broken or limited, and what to expect from OpenAI's pipeline over the next few months.

Real Limitations

Computer use still requires sandboxing. You can't point GPT-5.4 at your local desktop and say "do my taxes." You need a virtual display environment, which adds infrastructure complexity and latency. OpenAI's managed sandbox is in beta and has had availability issues during peak hours.
Codex mode is slow on cold prompts. First-token latency on a 600K-token prompt can be 60-90 seconds. Once you're in a session and using prompt caching, subsequent calls are fast — but the cold start hurts for interactive workflows.
Hallucination on factual recall is still real. GPT-5.4 is better than GPT-5 at admitting when it doesn't know something, but it still occasionally makes up plausible-sounding facts. Claude 4.6 is marginally more reliable here.
No built-in memory. Unlike ChatGPT's product-level memory feature, the API has no persistent memory across calls. If you want a stateful agent, you build the state management yourself.
Vision is excellent but not flawless. Small text in screenshots, charts with subtle color differences, and dense table layouts can still trip the model up. It's much better than GPT-5 but not yet at human parity.
Cost adds up fast on long agent runs. A multi-hour Codex-mode agent run can rack up 10-15M tokens of input and 1-2M tokens of output, which translates to $20-30 per run. Cheap compared to a human analyst, expensive compared to a GPT-5 call.

What's Coming Next: GPT-5.5 "Spud" and GPT-6

OpenAI has been unusually transparent about its near-term roadmap, partly because the competitive pressure from Anthropic and Google has forced everyone to telegraph their releases. Here's what we know:

GPT-5.5 (codename "Spud") finished pretraining in late March 2026 and is currently in post-training and red-teaming. Internal signals suggest it's a meaningful capability step up from GPT-5.4, with the headline improvements being multi-agent coordination (the model is being trained to work as part of a swarm), better long-horizon planning, and a unified vision-language-action policy that should push computer use scores past 80% on OSWorld-Verified. Expected release: late April or early May 2026.
GPT-6 is OpenAI's next major version bump and is expected in the May-July 2026 window. We don't know specifics, but the rumors are pointing toward a model that's natively trained on multi-turn agent trajectories rather than chat data, with a fundamentally different post-training approach. If those rumors are accurate, GPT-6 would be the first model designed from the ground up for agentic workflows rather than retrofitted into them.

The pace of releases is accelerating, not slowing. If you're making purchasing decisions, the right framing is not "what's the best model right now" but "what's my switching cost when the next model lands in 6 weeks." GPT-5.4 wins on both axes — it's the best now, and OpenAI's API is the easiest to swap models on (you literally change one string in your config).

The Final Verdict

GPT-5.4 is the model release that finally makes "AI agents" feel real instead of demo-ware. Native computer use at 75% on OSWorld-Verified, Codex mode with 1M context at no extra cost, SWE-bench Verified at 78%, and the same API pricing as the previous generation. There is no version of the comparison where GPT-5.4 is a worse choice than GPT-5 — it strictly dominates on capability while costing the same. If you're already on the OpenAI stack, upgrading is a one-line change and you should do it today.

If you're choosing your first frontier model in April 2026, here's the simple answer: start with GPT-5.4. It's the best at the most things, the pricing is reasonable, and the API ergonomics are mature. Add Claude 4.6 to your stack if you do a lot of writing or need the absolute best closed-book accuracy. Add Gemini 3 Pro if you live in Google Workspace or need video understanding. Add DeepSeek V4 if cost dominates your constraints. But the foundation should be GPT-5.4.

For the full pricing breakdown across all the relevant tiers and how to budget for serious agent workloads, our ChatGPT API pricing guide walks through real per-task costs, our OpenAI API pricing guide covers the full model lineup and rate limits, and our ChatGPT pricing plans compared 2026 guide breaks down which subscription tier actually makes sense for which type of user.

GPT-5.4 is the rare release where the marketing matches the reality. Ship it.