Modelmultimodal3w ago

Gemma 4 review

Google DeepMind's fourth-generation open-weight model family — five sizes from 2B to 31B, Apache 2.0 licensed, with the 12B Unified variant accepting text, image, audio, and video in a single encoder-free architecture.

Maker
Google DeepMind
Launched
Jun 3, 2026
Pricing
open-source
Visit official site
Firstlook

Our verdict

Gemma 4 is the most significant open-weight release from Google to date — and arguably from anyone in 2026 Q2. The Apache 2.0 license alone changes what you can build commercially. The 12B Unified's encoder-free any-to-any architecture is the technically interesting bet: skipping the separate encoder towers means the model can process mixed image, audio, and video inputs in a single forward pass without the coordination overhead. Whether that translates to better practical multimodal reasoning is still being benchmarked by the community, but the structural approach is the right one. The 31B Dense is the number to actually quote: LMArena #3 among all open models, AIME 2026 at 89.2%, GPQA Diamond at 84.3% — these are competitive with models several times the size. If you're shipping an AI product and not running on open weights yet, Gemma 4 is the reason to start now.

First look — our read from the docs and sources below; not yet hands-on tested.

Google has a pattern with open-weight releases: announce capabilities that sound competitive, ship weights under a restrictive license, add friction at every layer. Gemma 4 breaks that pattern. Apache 2.0 across all five model sizes means the commercial deployment question is actually closed — you don't need a usage agreement, a special approval, or a revenue ceiling to use it in a product.

That licensing shift is the first thing worth noting. The second is the architecture.

The encoder-free 12B Unified — what "any-to-any" actually means

Most multimodal models run modality-specific encoders in parallel before feeding the language backbone. A vision encoder processes your image. An audio encoder processes your audio clip. The LLM sees pre-processed feature representations from each. Gemma 4's 12B Unified skips those encoder towers entirely.

Instead, raw image patches and raw audio waveforms are projected directly into the LLM's embedding space via lightweight linear layers, then processed by the same transformer that handles text. The result: you can mix text, images, audio clips, and video frames in a single prompt without routing through separate pipelines.

The architectural bet here is that the language backbone can learn richer cross-modal representations than you get from specialized encoders patched together. Whether that pays off in practice is still being measured by the community — the MMMU Pro vision benchmark (69.1% on the 12B) is solid but trails dedicated vision models like GPT-4o or Gemini 2.5 Flash. The audio track record is even earlier-stage. But the structural approach is sound, and Google has the training compute to refine it over iterations.

Five sizes for every hardware tier

The Gemma 4 family spans a larger practical range than any prior generation:

  • E2B (2.3B active params) and E4B (4.5B active params): Edge-optimized, 128K context, designed for mobile and on-device inference. Both include audio support.
  • 12B Unified: The any-to-any centerpiece. 256K context, encoder-free, full modality support including video. The ~1,200 HuggingFace likes at launch reflect developer attention proportional to its novelty.
  • 26B MoE: Mixture-of-Experts — 3.8B active parameters at inference time drawing on 25.2B total. The efficiency pick for production: inference cost equivalent to a ~4B model, capability closer to the top of the range. Trade-off: no audio support.
  • 31B Dense: The benchmark flagship. MMLU Pro 85.2%, AIME 2026 89.2%, GPQA Diamond 84.3%, LMArena text score 1452 — #3 among all open-weight models at launch. 256K context.

What the benchmarks actually say

The 31B Dense instruction-tuned is the number Google leads with, and it's legitimate: 89.2% on AIME 2026 without tools is a math reasoning result that competes with models from labs that don't publish weights. The GPQA Diamond score (84.3%) is graduate-level science reasoning — not a soft benchmark.

The 12B Unified is where the picture is more nuanced. MMMU Pro at 69.1% on visual tasks and MMLU Pro at 77.2% are strong for the size, but vision specialists like Qwen2.5-VL compete in that range with dedicated encoder architectures. The encoder-free approach trades some visual ceiling for multimodal flexibility and architectural simplicity.

For most builders, the comparison to run is: 12B Unified vs your current vision model on your actual data. The benchmark gap may not reflect your specific task distribution.

Access and local inference

No gating. Pull weights directly from HuggingFace (google/gemma-4-12B-it, google/gemma-4-31B-it) under Apache 2.0. Ollama, LM Studio, vLLM, SGLang, and llama.cpp all support the family. Quantized variants are available for running on consumer hardware — the 26B MoE especially is practical on a single 24GB GPU when quantized.

Google AI Studio offers free hosted access for experimentation without needing to manage weights yourself.

The reasoning and tool-use layer

Both thinking mode (configurable chain-of-thought) and native function calling are trained into the instruction-tuned variants — not retrofitted. The system prompt templates in the model cards are the starting point. For agentic workflows, this matters: you're not prompt-engineering around a model that was trained for conversation; you're using one that was explicitly trained for tool-use patterns.

Bottom line

Gemma 4 is the open-weight family to benchmark against in 2026. Apache 2.0 removes the commercial deployment question entirely. The 31B Dense is competitive with frontier closed models on reasoning benchmarks. The 12B Unified's encoder-free any-to-any architecture is the technically ambitious bet — early on benchmarks, but the right structural bet for where multimodal inference is going. If you're still defaulting to closed APIs because open-weight quality wasn't there, Gemma 4 is the model family that closes that gap.

Provider

Specs & key facts

Family sizesE2B (2.3B), E4B (4.5B), 12B Unified, 26B MoE (3.8B active), 31B Dense[src]
Context window128K tokens (E2B/E4B) · 256K tokens (12B, 26B MoE, 31B)[src]
Input modalities (12B Unified)Text + image + audio + video — encoder-free, any-to-any[src]
Architecture (12B)Dense, encoder-free — raw image patches and audio waveforms projected directly into LLM embedding space[src]
Architecture (26B)Mixture-of-Experts — 3.8B active params, 25.2B total[src]
LicenseApache 2.0 (all sizes)[src]
MMLU Pro (31B-it)85.2%[src]
AIME 2026 (31B-it, no tools)89.2%[src]
GPQA Diamond (31B-it)84.3%[src]
LMArena text score (31B-it)1452 — #3 among all open models at launch[src]
Languages140+ supported[src]
HuggingFace likes (12B-it)1,210 at launch (2026-06-29)[src]

Capabilities

Text generation and reasoningYes (all sizes)
Image understanding (vision)Yes (all sizes)
Audio understandingYes (E2B, E4B, 12B Unified)
Video understandingYes (12B Unified)
Any-to-any multimodal inputYes (12B Unified — encoder-free)
Native function calling / tool useYes (trained for agentic workflows)
Thinking / reasoning modeYes — configurable chain-of-thought
Open weights (self-host)Yes — Apache 2.0, all five sizes
Image generationNo (text output only; DiffusionGemma is a separate variant)
Per-Layer EmbeddingsYes (new architecture feature)

How to use it

  1. 1Download any size from HuggingFace (huggingface.co/google/gemma-4-12B-it for the any-to-any 12B, or google/gemma-4-31B-it for the top performer). No gating — Apache 2.0, immediate access.
  2. 2For local inference, Ollama is the lowest-friction path: `ollama pull gemma4` (check Ollama's library for exact model tags). LM Studio also supports Gemma 4 with a GUI.
  3. 3For multimodal tasks (image + audio + text together), use the 12B Unified model — it's the only size with the encoder-free any-to-any architecture. Pass image and audio inputs in a single prompt.
  4. 4To enable reasoning mode, use the `-it` instruction-tuned variants and follow the system prompt template in the model card. Thinking tokens are configurable.
  5. 5For production API use, check Google AI Studio (aistudio.google.com) or third-party hosted providers (Groq, Together AI, Fireworks) — they offer free tiers and higher throughput than local inference.
  6. 6The 26B MoE is the efficiency pick: 3.8B active params at inference time but the capability of a larger model. Run it on hardware that can fit a ~4B model.

Pricing

Open weights (Apache 2.0)

Free

All five model sizes are free to download and use commercially. No API key or subscription required — pull weights from HuggingFace, Kaggle, or Ollama.

Hosted inference

Free / varies

HuggingFace Inference API, Google AI Studio, and community Spaces offer free-tier hosted access. Third-party providers (Groq, Together, Fireworks) may charge for throughput.

Apache 2.0 — download and run commercially at zero cost. Hosting costs depend on your infrastructure. Google AI Studio provides free access for experimentation. Verified 2026-06-29.

Pros & cons

Pros

  • Apache 2.0 license — genuinely unrestricted commercial use, self-hosting, fine-tuning. No custom Gemma terms.
  • Five sizes covering every hardware tier from edge (E2B at 2.3B) to production server (31B Dense).
  • 31B Dense benchmarks (#3 LMArena open, AIME 89.2%) compete with frontier closed models.
  • 12B Unified's encoder-free architecture handles text + image + audio + video in one model — no separate encoder pipelines to manage.
  • 256K context window on the three largest sizes — matches or exceeds most closed alternatives.
  • Native tool use and reasoning mode trained in — not bolted on.

Cons

  • 12B Unified any-to-any benchmarks (MMMU Pro 69.1%) lag GPT-4o and Gemini 2.5 Flash on vision tasks at this size — the architectural bet isn't fully cashed in yet.
  • Audio understanding benchmarks (against MERALION-10B) are mixed — the encoder-free audio path is novel but unproven at scale.
  • Encoder-free architecture is less mature in the inference tooling ecosystem — some frameworks may not fully support the 12B's mixed modality input yet.
  • No fine-tuning infrastructure from Google — you need Unsloth, TRL, or Axolotl to PEFT/LoRA fine-tune.
  • DiffusionGemma (the image generation variant) is a separate model — Gemma 4 itself only outputs text.

Alternatives

FAQ

Sources

Sources

  1. 1.Gemma 4 family launched April 2, 2026; 12B Unified (any-to-any) announced June 3, 2026https://deepmind.google/models/gemma/gemma-4/Verified 2026-06-29
  2. 2.Apache 2.0 license — open-weight, unrestricted commercial usehttps://huggingface.co/google/gemma-4-12B-itVerified 2026-06-29
  3. 3.google/gemma-4-12B-it: 1.21k HuggingFace likes, 2.51M downloads/month at launchhttps://huggingface.co/google/gemma-4-12B-itVerified 2026-06-29
  4. 4.31B Dense instruction-tuned benchmarks: MMLU Pro 85.2%, AIME 2026 89.2%, GPQA Diamond 84.3%https://ai.google.dev/gemma/docs/core/model_card_4Verified 2026-06-29
  5. 5.12B Unified benchmarks: MMLU Pro 77.2%, GPQA Diamond 78.8%, MMMU Pro (Vision) 69.1%https://huggingface.co/google/gemma-4-12B-itVerified 2026-06-29
  6. 6.HuggingFace blog covering the Gemma 4 family architecture and benchmarkshttps://huggingface.co/blog/gemma4Verified 2026-06-29

More coverage

News & first-looks about this release. Coming soon.
Head-to-head comparisons. Coming soon.