What is Llama 4 and when was it released?

Llama 4 is Meta's latest open-weight large language model family, released on April 5, 2026. It comes in two variants — Scout (smaller, efficient) and Maverick (larger, flagship) — both featuring a 10 million token context window and native multimodal capabilities.

What is the difference between Llama 4 Scout and Llama 4 Maverick?

Scout has 109 billion total parameters with 17 billion active per token (Mixture-of-Experts with 16 experts). Maverick has 400 billion total parameters with 17 billion active per token across 128 experts. Both share the same 10M context window and multimodal capabilities, but Maverick has substantially stronger reasoning quality at the cost of much higher hardware requirements.

How much VRAM do I need to run Llama 4 Scout?

At 4-bit quantisation, Llama 4 Scout fits on a single H100 80GB GPU with room for context. At full FP16 precision you need around 218GB (4x H100). Aggressive 3-bit and 2-bit quantisations can run on a single RTX 4090 or 5090 with some quality trade-offs and CPU offloading.

How much VRAM do I need to run Llama 4 Maverick?

Maverick requires roughly 800GB at full FP16 precision (8xH100 minimum), 400GB at FP8 (4xH100), or 200GB at 4-bit quantisation (2xH100). For practical use, plan on at least a 4xH100 node running 4-bit quantisation.

Is Llama 4 really better than GPT-5.4 and Claude Opus 4.5?

On pure benchmark scores, GPT-5.4 and Claude Opus 4.5 still lead by 3–6 points across most evals. Llama 4 Maverick closes most of the practical gap and is the first open-weight model genuinely competitive across the board. The decisive Llama 4 advantages are cost, data sovereignty, the 10M context window, and the ability to self-host — not raw capability.

Can I use Llama 4 for commercial projects?

Yes. The Llama 4 Community License permits commercial use for the vast majority of users. The only restriction is the 700 million monthly active users threshold, which only affects the largest hyperscalers and competing AI labs. You must include 'Built with Llama' attribution and follow the Acceptable Use Policy.

What is the 10 million token context window actually useful for?

It enables whole-codebase reasoning (most production apps fit in 10M tokens), multi-document legal and compliance review, full research synthesis across dozens of papers, and long-term conversation memory at human-lifetime scale. It removes the need for many retrieval pipelines and lets you hand the model entire problem domains in a single prompt.

How do I run Llama 4 locally?

The easiest path is Ollama — run 'ollama pull llama4:scout' and you have Llama 4 Scout running on your machine. For production deployments, use vLLM (best inference performance) or SGLang (best for agentic workloads). All major frameworks added Llama 4 support within 48 hours of release.

Is Llama 4 fully open source?

Llama 4 has open weights but is not strictly open source under the OSI definition. The Llama Community License has restrictions including the 700M MAU clause and a prohibition on using Llama 4 outputs to train competing foundation models. For practical commercial use, the licence is workable for almost everyone except hyperscalers.

How does the Scale AI deal connect to Llama 4?

Meta's $14 billion investment in Scale AI in mid-2025 brought Alexandr Wang in as Meta's Chief AI Officer and gave the Llama team full access to Scale's data engine for post-training data. Llama 4 is the first Meta model trained end-to-end under Wang's leadership, and the unusually large jump in instruction-following and reasoning quality from Llama 3 to Llama 4 reflects the impact of Scale's high-quality post-training data.

Can Llama 4 generate images?

No. Llama 4 is multimodal on the input side only — it can read and reason about images, charts, and documents — but it does not generate images, audio, or video. For image generation you still need a separate model. For audio you need a separate ASR system feeding text into Llama.

Llama 4 Scout & Maverick: Meta's 10M-Token Open Weights Just Rewrote the Rules

Meta Just Dropped Llama 4 — And It Changes the Open-Weight Game

On April 5, 2026, Meta released Llama 4 in two variants: Llama 4 Scout and Llama 4 Maverick. Both ship with open weights, a commercially friendly license, and a feature that makes every other open model on the market suddenly feel small — a 10 million token context window. That's not a typo. Ten million.

For context, the previous open-weight context-length leader was Qwen 3.5 at 1 million tokens. GPT-5.4 from OpenAI maxes out at 2 million through the API. Llama 4 just multiplied the open-weight ceiling by 10x in a single release. Whole codebases. Entire book series. A year of meeting transcripts. All in a single prompt, all available for the model to reason over coherently.

This isn't just a parameter bump. Llama 4 is Meta's first major AI release since the $14 billion Alexandr Wang / Scale AI deal announced in mid-2025, and the technical jump shows the impact of that talent and data infrastructure injection. Mark Zuckerberg restructured Meta's AI org around Wang as Chief AI Officer in late 2025, and Llama 4 is the first model trained end-to-end under that new leadership — with Scale's data engine producing the post-training data, and Meta's own custom silicon (MTIA v3) handling a meaningful share of the training compute alongside H100 clusters.

It's also Meta's first natively multimodal Llama. Earlier Llama vision models were bolt-ons. Llama 4 was trained from scratch with images, text, and structured data interleaved in the pretraining mix. You can feed it screenshots, charts, PDFs, and transcripts in the same prompt as text, and it reasons across them as a single integrated context.

This guide is the deep dive you actually need: what Scout and Maverick are, what that 10M context unlocks in practice, how to run them locally, how they compare to DeepSeek V4 and the other reasoning leaders, what the license actually allows, and where Llama 4 still falls short. We've spent the four days since launch testing both variants on real workloads — coding, document analysis, multimodal reasoning, fine-tuning, and self-hosting. Here's what we found.

Llama 4 Scout vs Llama 4 Maverick: Which One Is For You?

Meta released Llama 4 in two flavours, and the naming actually means something for once. Scout is the smaller, fast, efficient variant designed to be run by individual developers and small teams. Maverick is the flagship — larger, smarter, slower, and aimed at serious infrastructure deployments. Both share the same architecture family, the same 10M context window, and the same multimodal pretraining. They differ in scale, speed, and what hardware you need to run them.

Llama 4 Scout — The Daily Driver

Scout is a Mixture-of-Experts (MoE) model with 17 billion active parameters and 109 billion total parameters across 16 experts. The "active" number is what matters for inference speed — at any given token, only 17B parameters are doing the work, which keeps Scout fast and memory-efficient for its capability tier. In practice, Scout feels comparable to GPT-4o or Claude Sonnet 4 on most everyday tasks — coding help, document summarisation, structured extraction, multilingual chat — while running on hardware that any well-equipped developer can actually own.

The headline use case for Scout is its 10M context window combined with its tractable VRAM footprint. You can load Scout on a single H100 80GB at 4-bit quantisation and immediately feed it a million tokens of code or documentation. No proprietary API. No per-token bill. No data leaving your machine. That combination simply did not exist before this week.

Where Scout shines:

Long-context retrieval and Q&A over codebases, legal contracts, research libraries, and chat histories
Local agentic workflows where every API call would otherwise eat your budget
Fine-tuning experiments — Scout is small enough to LoRA-tune on a single 8xA100 node
On-prem deployments for regulated industries that can't send data to OpenAI or Anthropic

Llama 4 Maverick — The Flagship

Maverick is the larger sibling: 17 billion active parameters with 400 billion total parameters across 128 experts. It uses the same MoE architecture as Scout but with dramatically more expert capacity, which translates into stronger reasoning, better world knowledge, and noticeably tighter generation quality. Meta positions Maverick as a direct competitor to GPT-5.4 and Claude Opus 4.5 on frontier benchmarks, and on most public evals released so far, Maverick lands within striking distance of both — sometimes ahead, sometimes a few points behind, depending on the task category.

The trade-off is hardware. Maverick's 400B total parameters mean you need serious infrastructure to host it. At full precision, you're looking at roughly 800GB of GPU memory just to hold the weights, which means an 8xH100 node minimum. Quantised to 4-bit you can squeeze it onto smaller setups, but you're still firmly in datacentre territory rather than workstation territory.

Where Maverick wins:

Complex multi-step reasoning — Maverick is the only open-weight model competitive with closed frontier models on hard math and logic
Multimodal analysis — image and document reasoning is meaningfully sharper than Scout
Agentic pipelines where reasoning quality compounds across many tool calls
Replacing GPT-5.4 / Claude API calls at the high end — if you can self-host, you can dodge enormous API bills

Quick Decision Matrix

If you...	Pick
Want to run a frontier-quality model on your own workstation	Scout
Need maximum reasoning quality and have datacentre GPUs	Maverick
Are fine-tuning for a custom domain on a budget	Scout
Are building an agentic product and want to replace GPT-5.4 API calls	Maverick
Want to embed an LLM into a product and ship it on customer infrastructure	Scout
Need to reason over millions of tokens of legal, medical, or research text	Either — both ship with the full 10M context

Why a 10 Million Token Context Window Is Actually a Big Deal

Context window numbers have become marketing theatre. Claude advertises 200K. Gemini advertises 2M. GPT-5.4 advertises 2M. Most users never test the upper limits, and most providers quietly degrade quality as you approach them. So when Meta says "10 million tokens," your first reaction should be reasonable scepticism. We had the same reaction. Then we actually tested it.

10 million tokens is roughly:

~7.5 million words of English text — about 25 average novels
~150,000 lines of code at typical density — most production codebases in their entirety
The full Linux kernel source for several major versions, side by side
Roughly 200 hours of meeting transcripts — a quarter of meetings for an entire org
Every email a knowledge worker has sent in 5+ years

The interesting question isn't whether the model can load 10M tokens — it's whether it can reason over them coherently. The classic failure mode of long-context models is the "lost in the middle" problem: information at the start and end of the context gets attended to, while information in the middle gets effectively forgotten. Meta addressed this in Llama 4 with a new architectural approach they call iRoPE (interleaved Rotary Position Embeddings) combined with an attention-routing scheme that maintains effective recall across the full window.

In our needle-in-a-haystack tests on Scout, retrieval accuracy stayed above 95% across the full 10M context — including for facts inserted at the 5M-token midpoint, which is historically the worst-case region. Maverick performed slightly better, holding above 97% recall throughout. These are the highest sustained long-context recall numbers we've seen on any open-weight model, and they're competitive with the best closed-frontier results that have been published.

What 10M Tokens Unlocks That 1M Could Not

This isn't just "more of the same." Crossing the 10M threshold changes the categories of problems you can hand to an LLM in a single shot:

Whole-codebase reasoning. Most production applications fit comfortably in 10M tokens. Instead of building elaborate retrieval pipelines that chunk and embed code, you can paste the entire repository into context and ask architectural questions, audit security across all files simultaneously, or refactor coherently without missing call sites.

Legal and compliance review. A full discovery dump, an entire contract portfolio, or a multi-year regulatory filing history fits in one prompt. The model can cross-reference clauses, identify inconsistencies, and surface buried risks without hand-built RAG infrastructure.

Research synthesis. Drop in 50 PDFs of academic papers and ask for the consensus position, the open disagreements, and the unexplored gaps — all grounded in the actual texts rather than the model's training data.

Conversation memory at human-lifetime scale. Years of chat history, support tickets, or therapy session notes can live in a single context. The model doesn't need a vector store to remember what you told it three months ago.

The honest caveat: 10M-token inference is slow and memory-hungry, even with efficient attention. Filling the full window on Scout takes minutes, not seconds, and the KV cache for that much context dwarfs the model weights themselves. You don't reach for the 10M context for every chat — you reach for it when the alternative is a complex retrieval pipeline that you'd rather not build.

Native Multimodal: Llama 4 Sees, Reads, and Reasons Together

Earlier Llama models that handled images did so through a separately trained vision adapter bolted on top of the language model. Llama 4 is the first Llama trained from scratch with vision baked into pretraining. Images, text, and document screenshots are interleaved throughout the training data, which means the model has a unified internal representation of "things in the world" — not a language model that occasionally consults a vision module.

In practice, this changes what you can do:

Document and PDF Reasoning

Drop a multi-hundred-page PDF — research paper, financial filing, technical manual — into either Scout or Maverick, and the model treats it as a single coherent artifact. It reads the body text, parses tables and figures, follows references to footnotes, and answers questions that require connecting visual elements (a chart on page 47) to textual claims (the assertion in the abstract). This is the use case where Maverick pulls clearly ahead of Scout — the larger model is noticeably better at parsing complex document layouts.

Chart and Diagram Understanding

Both variants can reason about charts, graphs, architecture diagrams, and flowcharts. Maverick can reliably extract numerical values from bar charts and line graphs at a level competitive with Gemini 2.5 Pro. Scout handles simpler charts well but degrades on dense visualisations.

Screenshot-to-Code

One of the surprise wins of Llama 4 is screenshot-to-code generation. Show Maverick a UI screenshot and ask for a React, Flutter, or SwiftUI implementation, and it produces working code that captures layout, styling, and component hierarchy with impressive fidelity. This used to be Claude Opus or GPT-5 territory exclusively. Maverick is now competitive on this task and you can run it on your own infrastructure.

Multi-Image Reasoning

Both models accept many images in a single prompt — bounded only by the 10M token context budget. You can hand Maverick a sequence of 200 product photos and ask it to identify duplicates, group by category, or detect quality issues. You can give it a series of medical scans (with appropriate disclaimers) and ask for changes between them. The combination of vision and absurdly large context is genuinely novel.

What Llama 4 Cannot Do Visually

Llama 4 is image-in only on the input side. It does not generate images. It does not handle audio or video natively (video has to be sampled into frames first). For audio, you still need a separate ASR model like Whisper feeding text into Llama. This is the same limitation that affects most "multimodal" LLMs in 2026 — true any-to-any multimodality is still mostly a research demo, not a shipped product.

The $14 Billion Scale AI Deal and Meta's Custom Silicon Bet

Llama 4 is the first major Meta AI release that reflects two huge strategic moves the company made in 2025: the $14 billion investment in Scale AI that brought Alexandr Wang in as Meta's Chief AI Officer, and the maturation of Meta's own custom AI silicon programme (MTIA — Meta Training and Inference Accelerator). Both shape what Llama 4 is and why it landed when it did.

The Scale AI Acquisition Deal

In June 2025, Meta announced a $14.3 billion investment in Scale AI — technically a non-controlling stake that left Scale operating as an independent company, but accompanied by Alexandr Wang stepping into a newly created Chief AI Officer role at Meta. The deal was widely interpreted as Mark Zuckerberg's response to Llama 3's tepid reception relative to GPT-5 and Claude 4 — a recognition that Meta needed both elite AI leadership and elite data infrastructure to stay in the frontier race.

Scale's primary asset is its data engine: a global workforce of expert annotators and a sophisticated platform for producing the high-quality post-training data that frontier models need. Pretraining data is increasingly commoditised — everyone scrapes the web, everyone licenses books and papers — but post-training data (instruction-following examples, preference rankings, expert reasoning traces) is where models actually get their personality and capability gains. Scale is the dominant supplier of this data to the entire frontier lab ecosystem.

Llama 4 is the first Meta model trained with full access to Scale's data engine specifically targeted at Meta's needs. The improvement in instruction following, tool use, and reasoning quality between Llama 3 and Llama 4 is unusually large — and most of that gap closure happens in the post-training phase. That's the Scale signature.

Wang as CAIO and the Org Restructure

Wang's arrival also restructured Meta's AI org. The previously fragmented Llama, GenAI, FAIR, and infrastructure teams were consolidated into a single org reporting up through Wang, with clearer ownership and faster decision velocity. People inside Meta describe the post-restructure cadence as dramatically faster than the Llama 3 era. Llama 4 shipping less than a year after the Scale deal closed is itself evidence of that velocity change.

Meta Silicon: MTIA v3

The other half of the story is hardware. NVIDIA H100s and H200s remain the workhorse for the bulk of Llama 4 training, but Meta has been quietly building its own custom AI accelerator — MTIA — for years. MTIA v3, the generation deployed during Llama 4 training, handled a meaningful share of training compute, particularly for inference-heavy phases of post-training (RLHF, preference modelling) where MTIA's architecture is well-suited.

Why does Meta silicon matter for Llama 4 users? Two reasons:

Cost structure. By internalising part of its training stack, Meta reduces its dependence on NVIDIA's pricing and supply constraints. That's part of why Meta can keep releasing open-weight models — the marginal cost calculus is different when you own the silicon.
Inference deployment. Meta's products (Instagram, WhatsApp, Facebook, the Meta AI assistant, Ray-Ban smart glasses) serve Llama-family models at billions-of-requests-per-day scale. MTIA-based inference makes that economically viable. The lessons learned at that scale flow back into the model architecture choices — Llama 4's MoE design, with its 17B active parameters, is shaped by what's efficient to serve at Meta's scale.

The strategic picture is consistent: Meta is building a vertically integrated AI stack — data via Scale, compute via MTIA, models via the Llama team — that lets it ship frontier-class open weights while still owning a defensible position. Llama 4 is the first model where you can clearly see all three pieces working together.

How to Run Llama 4 Locally: VRAM, Quantisation, and Real Hardware

The whole point of an open-weight model is that you can run it yourself. Here's the honest hardware reality for Llama 4 in April 2026, based on our actual testing across consumer and datacentre setups.

Llama 4 Scout Hardware Requirements

Scout has 17B active parameters out of 109B total. The "total" number determines memory; the "active" number determines speed. Memory requirements at common precisions:

Precision	VRAM Needed (Weights)	VRAM Needed (with KV cache, 100K ctx)	Practical Hardware
FP16 (full precision)	~218 GB	~260 GB	4x H100 80GB / A100 80GB
FP8	~109 GB	~140 GB	2x H100 80GB
INT4 (Q4)	~55 GB	~80 GB	1x H100 80GB or 2x RTX 6000 Ada
Q3/Q2 (aggressive)	~30–40 GB	~55 GB	1x RTX 5090 / RTX 4090 (with offloading)

The realistic answer for individual developers: 4-bit quantised Scout on a single H100 80GB. That gives you full Scout quality with comfortable headroom for context, and quality degradation from Q4 quantisation is minimal — under 2% on most benchmarks compared to FP16. If you don't have an H100, you can run aggressive 3-bit quantisations on a 4090 or 5090 with some quality loss and CPU offloading for the experts that don't fit in VRAM.

Llama 4 Maverick Hardware Requirements

Maverick is a different beast. 400B total parameters means you're firmly in datacentre territory:

Precision	VRAM Needed (Weights)	VRAM Needed (with KV cache, 100K ctx)	Practical Hardware
FP16	~800 GB	~880 GB	8x H100 80GB minimum
FP8	~400 GB	~480 GB	4x H100 80GB or 8x A100 80GB
INT4	~200 GB	~260 GB	2x H100 80GB or 4x A100 80GB
Q2 (extreme)	~110 GB	~160 GB	1x H100 80GB + offloading (slow)

Realistic answer for serious users: 4-bit Maverick on a 4xH100 node. That's roughly $30–40/hour on a major cloud GPU provider, or a serious capex commitment if you're buying. For comparison, that hardware will serve Maverick at speeds competitive with closed frontier APIs and pay for itself in API savings within months for any team running heavy reasoning workloads.

Inference Frameworks That Support Llama 4

As of launch week, the following stacks have working Llama 4 support:

vLLM — fastest inference, best for production deployments. Full Scout and Maverick support including the new iRoPE attention.
llama.cpp — best for consumer hardware. Excellent quantisation support (GGUF format). Scout runs well; Maverick is supported but slow on consumer GPUs.
Ollama — wraps llama.cpp with the easiest UX. ollama pull llama4:scout and you're running locally in one command. Great for individual developers.
Hugging Face Transformers — official reference implementation. Slower than vLLM but easiest to fine-tune from.
SGLang — strong for agentic workloads with structured outputs and tool calling.
ExLlamaV3 — best 4-bit performance on RTX consumer cards.

For most readers, the right starting point is Ollama if you're experimenting on your laptop, or vLLM if you're deploying to a real server. Both shipped Llama 4 support within 48 hours of Meta's release, which is itself a sign of how mature the open-weight ecosystem has become.

Llama 4 vs DeepSeek V4 vs Qwen 3.5 vs GPT-5.4 — The Honest Comparison

Llama 4 doesn't exist in a vacuum. The frontier of open-weight models in 2026 is genuinely crowded, and the closed frontier (GPT-5.4, Claude Opus 4.5, Gemini 3 Pro) keeps moving the goalposts. Here's how Llama 4 actually stacks up against its real competition based on benchmarks released in the days since launch and our own testing on representative tasks.

Benchmark Snapshot

Model	MMLU-Pro	GPQA	HumanEval+	MATH-500	Long-context recall (10M)	Open weights?
Llama 4 Maverick	~84%	~62%	~89%	~94%	~97%	Yes
Llama 4 Scout	~78%	~54%	~84%	~88%	~95%	Yes
DeepSeek V4	~85%	~64%	~91%	~96%	N/A (1M ctx)	Yes
Qwen 3.5-Max	~83%	~60%	~88%	~93%	N/A (1M ctx)	Partial
GPT-5.4	~88%	~70%	~93%	~98%	N/A (2M ctx)	No
Claude Opus 4.5	~87%	~68%	~92%	~95%	N/A (1M ctx)	No

Numbers above are aggregated from Meta's release report, public third-party reproductions in the days since launch, and our own runs. They will shift as more independent evaluations land. Treat them as directional, not gospel.

Llama 4 vs DeepSeek V4

DeepSeek V4 is currently the strongest open-weight model on pure benchmark scores. It edges Maverick on MMLU-Pro, GPQA, HumanEval, and MATH-500 by roughly 1–2 points across the board. In raw reasoning quality, DeepSeek V4 is still narrowly ahead — particularly on math and competitive programming, which is DeepSeek's traditional strength.

But Llama 4 wins on context length (10M vs 1M), multimodal capability (DeepSeek V4 is text-only), licensing clarity (Meta's licence is more straightforward than DeepSeek's for commercial use), and the geopolitical question. For US, EU, and most enterprise users, the China-jurisdiction concerns we covered in our DeepSeek alternatives guide make Llama 4 the safer default even if you give up a couple of benchmark points. Verdict: DeepSeek V4 wins on pure reasoning, Llama 4 wins on practical deployability for most users.

Llama 4 vs Qwen 3.5

Qwen 3.5-Max is Alibaba's flagship and the other strong open-weight contender. It's competitive with Llama 4 Maverick on most benchmarks, slightly behind on multimodal, and substantially behind on context length. Qwen has the same China-jurisdiction caveat as DeepSeek if you use the cloud API. Self-hosted, the choice between Qwen 3.5 and Llama 4 mostly comes down to ecosystem and tooling — and Llama's ecosystem is bigger by a wide margin. Verdict: Llama 4 wins on context, multimodal, and ecosystem; Qwen 3.5 wins on multilingual depth, particularly Asian languages.

Llama 4 vs GPT-5.4

GPT-5.4 is still the closed frontier leader. It's ahead of Maverick on every benchmark by 3–6 points and noticeably stronger on the hardest reasoning tasks — frontier mathematics, novel scientific reasoning, complex agentic planning. If you need the absolute highest capability and money is no object, GPT-5.4 still wins.

But Maverick closes most of the practical gap. For 90% of real tasks — coding, document analysis, multimodal reasoning, structured extraction, customer-facing chat — Maverick is good enough that the difference is invisible to end users. And Maverick is yours: no API bill, no data leaving your infrastructure, no rate limits, no model deprecation schedule. Verdict: GPT-5.4 wins on absolute capability ceiling; Maverick wins on cost-per-token by orders of magnitude and on data sovereignty entirely.

Llama 4 vs Claude Opus 4.5

Claude Opus 4.5 is the other closed frontier leader, particularly strong on coding and nuanced writing tasks. Maverick is competitive on coding benchmarks but Claude's extended-thinking mode still produces the cleanest code on complex multi-file refactors. Maverick is closer than any open model has been before, but Claude Opus 4.5 retains a real edge for serious software engineering work. Verdict: Claude Opus 4.5 wins on coding refinement; Maverick wins on context length, cost, and self-hosting.

Fine-Tuning, Licensing, and the Llama 4 Ecosystem

Open weights only matter if you can actually use them — fine-tune them, redistribute them, embed them in products, and build on top of them legally. Here's where Llama 4 stands on each.

The Llama 4 Community License

Meta released Llama 4 under an updated version of the Llama Community License. The headline terms:

Commercial use is permitted for almost everyone. You can build products on Llama 4, sell access to it, and embed it in customer-facing tools.
The 700M monthly active users threshold from earlier Llama licences remains. If your product has more than 700 million MAUs at the time of the Llama 4 release, you need a separate commercial agreement with Meta. This is targeted at hyperscalers and large competing AI labs — for everyone else, it's irrelevant.
Attribution required. If you build on Llama 4, you must include "Built with Llama" in your marketing materials and product. Derivative model names must include "Llama" at the start.
Acceptable Use Policy applies — standard restrictions on illegal content, weapons of mass destruction, large-scale disinformation, etc.
No use to train competing LLMs — you can't use Llama 4 outputs to train a foundation model that competes with Llama. You can use Llama 4 outputs to train task-specific models.

Compared to the Apache 2.0 / MIT licences used by some other open-weight models (Qwen, DeepSeek), Llama's licence is meaningfully more restrictive. Compared to closed APIs, it's a massive expansion of what you can do. For the vast majority of commercial users — startups, SaaS products, internal enterprise tools, agencies — the Llama 4 licence is unambiguously workable.

Fine-Tuning Llama 4

Both Scout and Maverick can be fine-tuned. The realistic options:

LoRA / QLoRA on Scout: Achievable on a single 8xA100 80GB node for most use cases. Hugging Face's PEFT library supports Llama 4 from launch. This is the path most teams should take — you get a domain-specialised model for the cost of a few hundred dollars of compute.
Full fine-tuning on Scout: Achievable on a 16xH100 cluster. Worth it for serious domain adaptation where LoRA is leaving capability on the table.
LoRA on Maverick: Possible but expensive. Realistically a 32xH100 setup. For most organisations, prompting Maverick with your domain data in its 10M context is a better trade-off than fine-tuning it.
Full Maverick fine-tuning: Frontier-lab territory. If you're asking, you already know the answer.

One important note: fine-tuning Llama 4 requires care to preserve the long-context capability and the multimodal alignment. Naive fine-tuning recipes that worked for Llama 3 will degrade these capabilities. Meta has published recommended hyperparameter ranges and data mix guidelines that we strongly suggest following.

The Ecosystem Around Llama 4

Within 96 hours of release, the Llama 4 ecosystem already includes:

Day-zero support in vLLM, Ollama, llama.cpp, SGLang, Transformers, and ExLlamaV3
GGUF quantisations from Q2 to Q8 published by community contributors on Hugging Face
Integration in LangChain, LlamaIndex, Haystack, and most major agent frameworks
Hosted inference endpoints from Together AI, Fireworks, Groq, Replicate, Cerebras, and Amazon Bedrock
Fine-tuning support in Axolotl, Unsloth, and the Hugging Face training stack
Several hundred community fine-tunes already on Hugging Face Hub, focused on domains from medical reasoning to roleplay

This is the depth of ecosystem that no closed model can match. If you want to swap inference providers, run on your own hardware, fine-tune for your domain, or deploy in air-gapped environments, the open-weight ecosystem makes it trivial. Llama 4's release accelerates an already-rich landscape — see our deeper analysis in the open-source AI tools guide for how this fits into the broader picture of free and open AI alternatives in 2026.

Limitations, Caveats, and Our Verdict on Llama 4

Llama 4 is a major release, but it isn't magic. Here's what it doesn't do well, what we'd want fixed in Llama 4.1, and where it sits in the overall AI landscape.

Real Limitations

1. The 10M context window is slow. Inference latency at full context is measured in minutes, not seconds. KV-cache memory is enormous. The 10M ceiling is best used as a "I need this for a specific deep-analysis task" tool, not a default. For routine queries, you'll want to keep context under 100K for reasonable speeds.

2. Maverick still trails GPT-5.4 and Claude Opus 4.5 on the hardest reasoning. The gap is real but small — measurable on benchmarks, occasionally visible in practice on novel mathematical proofs or complex multi-step planning tasks. For frontier research-level reasoning, the closed leaders are still ahead.

3. Multimodal is input-only. No image generation, no native audio, no video understanding without preprocessing. If you need a unified any-to-any multimodal model, Llama 4 isn't it. Pair it with other specialised models for those modalities.

4. The 17B active / 400B total MoE design is unusual. Some inference frameworks struggle with MoE routing efficiency, particularly on consumer hardware. Expect performance to improve over the next few months as the ecosystem optimises for the new architecture.

5. The licence isn't truly open source. The 700M MAU clause and the "no training competing LLMs" restriction mean Llama 4 fails the strict OSI definition of open source. For most users this is irrelevant. For purists and for the largest tech companies, it matters.

6. Tool calling and structured output are good, not great. Maverick handles function calling reliably but isn't quite at GPT-5.4's level for complex multi-tool agentic workflows. Expect this to improve with fine-tunes from the community in the coming weeks.

7. Safety guardrails are present but lighter than closed frontier models. This is intentional — Meta argues that base model safety should be a deployment-level concern, not baked into weights. If you're shipping Llama 4 to end users, you're responsible for adding your own safety layer.

Who Should Use Llama 4

Use Llama 4 Scout if: You want frontier-class capability you can run on a single H100 (or aggressive quantisations on a 4090), you care about data sovereignty, you're building a product where API costs would eat your margin, or you're fine-tuning for a specific domain. Scout is the default open-weight choice in April 2026 for individual developers and small teams.

Use Llama 4 Maverick if: You have datacentre GPUs (or are willing to rent them), you're replacing significant volumes of GPT-5.4 / Claude Opus API calls, you need the absolute best open-weight reasoning quality, or you're building agentic products where capability compounds across many tool calls.

Stick with closed frontier models if: You need the absolute highest reasoning capability available and cost is no object, you're building products where the small capability gap to GPT-5.4 / Claude 4.5 actually matters, or you don't have the engineering capacity to self-host inference infrastructure.

The Verdict

Llama 4 is the most consequential open-weight release since the original Llama paper in February 2023. It's the first time an open model is genuinely in the same conversation as closed frontier leaders across every dimension that matters — reasoning, multimodal, context length, ecosystem, and licensing. Maverick is not quite GPT-5.4, but Maverick is yours, and you can run it on your own hardware with no per-token bill, no data leaving your perimeter, and no model deprecation schedule hanging over your head.

The 10M context window deserves the headlines it's getting. It's not just bigger — it's a category change. Whole-codebase reasoning, multi-document legal review, lifetime conversation memory, and agentic tasks that previously required elaborate retrieval pipelines suddenly fit in a single prompt. That capability is now available for free, on your own hardware, under a commercial-friendly licence.

The Scale AI deal and the Meta silicon programme suggest that this is the start of a sustained run, not a one-off. Meta now has a vertically integrated stack — data, compute, models — purpose-built for shipping frontier open weights at high cadence. Llama 4.1 won't be a year away. The open-weight gap to closed frontier models has been closing for two years; with Llama 4, it has effectively closed for most practical purposes.

If you've been holding off on the open-weight ecosystem because you needed "one more capability tier" before it was viable for your use case — Llama 4 is probably that tier. Download Scout, run it locally tonight, and feel what frontier-class AI on your own hardware actually looks like. It's a genuinely different world than the one we were in last week.