Gemma 4 releases four model sizes under Apache 2.0, with the 31B ranked third among all open models

Google DeepMind has released Gemma 4, the fourth generation of its open model family. The release spans four model sizes — Effective 2B (E2B), Effective 4B (E4B), 26B Mixture of Experts (MoE), and 31B Dense — and ships under an Apache 2.0 license. According to the announcement, the 31B model ranks third among open models on the Arena AI text leaderboard and the 26B MoE ranks sixth. The post states that the 26B MoE “outcompetes models 20x its size.” The post also describes Gemma 4 as built from the same research and technology as Gemini 3.

Since the first Gemma generation, developers have downloaded Gemma over 400 million times, producing more than 100,000 variants.

Model sizes and what each is optimized for

The four sizes serve distinct deployment contexts. The 26B MoE prioritizes latency: it activates only 3.8 billion of its total parameters during inference, delivering faster tokens-per-second while keeping resource use in check. The 31B Dense maximizes raw output quality and provides a foundation for fine-tuning. Both larger models fit in unquantized bfloat16 form on a single 80GB NVIDIA H100 GPU; quantized versions run on consumer GPUs for local IDEs, coding assistants, and agentic workflows.

The E2B and E4B models are engineered for edge and mobile deployment. The post describes them as built “from the ground up for maximum compute and memory efficiency,” activating an effective 2 or 4 billion parameter footprint during inference to preserve RAM and battery life. According to the announcement, these models run completely offline with near-zero latency on hardware including phones, Raspberry Pi, and NVIDIA Jetson Orin Nano. DeepMind worked with the Google Pixel team, Qualcomm Technologies, and MediaTek on these models. Android developers can prototype agentic flows using E2B and E4B in the AICore Developer Preview, with forward-compatibility with Gemini Nano 4.

Capabilities across the family

The post lists a set of capabilities shared across all four sizes. Native function-calling, structured JSON output, and native system instructions are present throughout, enabling autonomous agents to interact with external tools and APIs. All models natively process video and images at variable resolutions, with the announcement highlighting OCR and chart understanding as specific strengths. The E2B and E4B models add native audio input for speech recognition and understanding.

Context window lengths vary by tier. Edge models support 128K context; larger models support up to 256K, which the post says allows passing “repositories or long documents in a single prompt.” All models were natively trained on more than 140 languages.

Apache 2.0 and what it means for deployment

The license change is the announcement’s most commercially significant detail. Gemma 4 ships under Apache 2.0, which the post characterizes as providing “complete developer flexibility and digital sovereignty” — meaning unrestricted use, modification, and distribution without royalty obligations. Previous Gemma models used a custom license with use-case restrictions; Apache 2.0 removes those barriers.

The post frames this as a response to developer feedback: “You gave us feedback, and we listened.”

Fine-tuning results and ecosystem support

The post cites two fine-tuning examples. INSAIT used Gemma to create BgGPT, described by the announcement as “a pioneering Bulgarian-first language model.” Yale University worked with DeepMind on Cell2Sentence-Scale, focused on discovering new pathways for cancer therapy.

The ecosystem support list is extensive. Day-one integrations include Hugging Face (Transformers, TRL, Transformers.js, Candle), LiteRT-LM, vLLM, llama.cpp, MLX, Ollama, NVIDIA NIM and NeMo, LM Studio, Unsloth, SGLang, Keras, and others. Model weights are available on Hugging Face, Kaggle, and Ollama. Google Cloud deployment paths include Vertex AI, Cloud Run, GKE, and TPU-accelerated serving.