Google has a pattern with open-weight releases: announce capabilities that sound competitive, ship weights under a restrictive license, add friction at every layer. Gemma 4 breaks that pattern. Apache 2.0 across all five model sizes means the commercial deployment question is actually closed — you don't need a usage agreement, a special approval, or a revenue ceiling to use it in a product.
That licensing shift is the first thing worth noting. The second is the architecture.
The encoder-free 12B Unified — what "any-to-any" actually means
Most multimodal models run modality-specific encoders in parallel before feeding the language backbone. A vision encoder processes your image. An audio encoder processes your audio clip. The LLM sees pre-processed feature representations from each. Gemma 4's 12B Unified skips those encoder towers entirely.
Instead, raw image patches and raw audio waveforms are projected directly into the LLM's embedding space via lightweight linear layers, then processed by the same transformer that handles text. The result: you can mix text, images, audio clips, and video frames in a single prompt without routing through separate pipelines.
The architectural bet here is that the language backbone can learn richer cross-modal representations than you get from specialized encoders patched together. Whether that pays off in practice is still being measured by the community — the MMMU Pro vision benchmark (69.1% on the 12B) is solid but trails dedicated vision models like GPT-4o or Gemini 2.5 Flash. The audio track record is even earlier-stage. But the structural approach is sound, and Google has the training compute to refine it over iterations.
Five sizes for every hardware tier
The Gemma 4 family spans a larger practical range than any prior generation:
- E2B (2.3B active params) and E4B (4.5B active params): Edge-optimized, 128K context, designed for mobile and on-device inference. Both include audio support.
- 12B Unified: The any-to-any centerpiece. 256K context, encoder-free, full modality support including video. The ~1,200 HuggingFace likes at launch reflect developer attention proportional to its novelty.
- 26B MoE: Mixture-of-Experts — 3.8B active parameters at inference time drawing on 25.2B total. The efficiency pick for production: inference cost equivalent to a ~4B model, capability closer to the top of the range. Trade-off: no audio support.
- 31B Dense: The benchmark flagship. MMLU Pro 85.2%, AIME 2026 89.2%, GPQA Diamond 84.3%, LMArena text score 1452 — #3 among all open-weight models at launch. 256K context.
What the benchmarks actually say
The 31B Dense instruction-tuned is the number Google leads with, and it's legitimate: 89.2% on AIME 2026 without tools is a math reasoning result that competes with models from labs that don't publish weights. The GPQA Diamond score (84.3%) is graduate-level science reasoning — not a soft benchmark.
The 12B Unified is where the picture is more nuanced. MMMU Pro at 69.1% on visual tasks and MMLU Pro at 77.2% are strong for the size, but vision specialists like Qwen2.5-VL compete in that range with dedicated encoder architectures. The encoder-free approach trades some visual ceiling for multimodal flexibility and architectural simplicity.
For most builders, the comparison to run is: 12B Unified vs your current vision model on your actual data. The benchmark gap may not reflect your specific task distribution.
Access and local inference
No gating. Pull weights directly from HuggingFace (google/gemma-4-12B-it, google/gemma-4-31B-it) under Apache 2.0. Ollama, LM Studio, vLLM, SGLang, and llama.cpp all support the family. Quantized variants are available for running on consumer hardware — the 26B MoE especially is practical on a single 24GB GPU when quantized.
Google AI Studio offers free hosted access for experimentation without needing to manage weights yourself.
The reasoning and tool-use layer
Both thinking mode (configurable chain-of-thought) and native function calling are trained into the instruction-tuned variants — not retrofitted. The system prompt templates in the model cards are the starting point. For agentic workflows, this matters: you're not prompt-engineering around a model that was trained for conversation; you're using one that was explicitly trained for tool-use patterns.
Bottom line
Gemma 4 is the open-weight family to benchmark against in 2026. Apache 2.0 removes the commercial deployment question entirely. The 31B Dense is competitive with frontier closed models on reasoning benchmarks. The 12B Unified's encoder-free any-to-any architecture is the technically ambitious bet — early on benchmarks, but the right structural bet for where multimodal inference is going. If you're still defaulting to closed APIs because open-weight quality wasn't there, Gemma 4 is the model family that closes that gap.