Matthew Berman·April 3, 2026·9m

Google just dropped Gemma 4... (WOAH)

TL;DR

Gemma 4’s big story is efficiency, not sheer scale — Matthew Berman highlights that Google’s new 31B dense and 26B MoE “thinking” models hit Elo scores near massive rivals like Qwen 3.5 while staying small enough to run locally on normal high-end consumer hardware.
Google is pushing a real edge-compute vision — Berman argues open models are getting “smaller, better, faster,” which makes hybrid AI more practical: frontier hosted models for the hardest tasks, and local models for most day-to-day work.
The tiny Gemma variants are built for actual devices, not just demos — the E2B and E4B “effective” models use per-layer embeddings to shrink inference footprint and are designed for offline use on phones, Raspberry Pi, and Nvidia Jetson hardware, with native audio on the small versions.
Gemma 4 is clearly aimed at agents — Google added native function calling, structured JSON output, and system instructions, and Berman points to Steve Vibe’s Tool Call 15 results where Gemma 4 31B scored a perfect tool-calling benchmark.
The tradeoff is context window size — Berman’s main disappointment is that the edge models top out at 128K context and the larger ones at 256K, which he says feels limited compared with where bigger frontier models are heading.
Google made it easy to adopt commercially — Gemma 4 ships under Apache 2.0 and is already available across Hugging Face, llama.cpp, MLX, Ollama, Nvidia NIMs, LM Studio, and Unsloth, so Berman’s takeaway is simple: download it, fine-tune it, and start building.

The Breakdown

Google drops Gemma 4, and Berman is genuinely fired up

Matthew Berman opens with real enthusiasm, giving Google “huge props” for consistently pushing open-weight models when not every major lab is doing that. His framing is immediate: Gemma 4 matters because it brings advanced reasoning and agentic workflows into models that are actually small enough for people to run.

Why the size-to-performance curve is the whole point

He walks through the Elo chart with a simple lens: you want models as far up and left as possible. The standout is Gemma 4 31B dense and the 26B MoE version with 4B active parameters, which he says perform near monsters like Qwen 3.5’s 397B total / 17B active setup, but at a fraction of the size. His reaction is basically: this is what makes local AI real, because most people can run 31B, while almost nobody has the hardware for the giant alternatives.

“Effective” parameters and the tiny edge models

Berman pauses to explain something he had to look up himself: the E in E2B means “effective.” Google uses per-layer embeddings so the model can have large embedding tables for fast lookups without the full inference cost, which helps the 2B and 4B variants fit on-device. That matters because these are aimed at offline deployment where RAM and battery life actually matter.

Arena rankings show Gemma punching above its weight

Google says the 31B model is the number three open model on the Arena AI text leaderboard, and Berman pulls up the ranking to underline the point. Sitting behind huge models like GLM-5 and Kimi K2.5, Gemma 4 31B looks unusually competitive for something he keeps reminding viewers is “a small model.”

The product pitch detour: Recraft for image generation

Midway through, he breaks for the sponsor, Recraft, and the tone stays pretty personal. He says Recraft V4 stood out not just for realism but for “taste,” control, typography, multilingual text, and polished-looking branding or UI outputs, with a separate Vector model for SVG graphics.

Back to Gemma: reasoning, agents, multimodality, and coding

Once back, he runs through the practical feature list: multi-step reasoning, math, instruction-following, native function calling, structured JSON output, and system instructions. He clearly sees Gemma as an agent model first, joking “Yes, OpenClaw, you know, I will be testing it,” while also noting it supports images, video, OCR, chart understanding, and native audio input on the smallest variants.

His one real complaint: the context window

The one place his excitement dips is context length. He says 128K for the edge models is fine, but only 256K for the larger models is disappointing — he wanted Google to go further there.

Built for offline use, easy to download, and already benchmarking well

He closes on deployment and benchmarks: Google worked with Pixel, Qualcomm, and MediaTek, and the small models are meant to run offline with near-zero latency on phones, Raspberry Pi, and Jetson Orin Nano. Gemma 4 is available across Hugging Face, vLLM, llama.cpp, MLX, Ollama, Nvidia NIMs, and more, under Apache 2.0, with metrics like 1452 Arena score, 85.2 MMLU multilingual, 89% AIME 2025, 80% LiveCodeBench, and a perfect Tool Call 15 result for Gemma 4 31B — which is why Berman ends with a simple challenge: go download it and build something.