What is DiffusionGemma?

DiffusionGemma is an experimental open AI model from Google that generates text using diffusion instead of the usual word-by-word method. It writes a whole block of 256 tokens at once and refines it over a few passes, which makes it up to 4x faster on a dedicated GPU. It is released under the Apache 2.0 license, so you can download the weights and run it yourself.

How is DiffusionGemma different from a normal AI model like ChatGPT?

A normal model writes one word at a time, left to right, like a typewriter. DiffusionGemma drafts a full block of text at once and then cleans it up over several passes, like a printing press stamping a whole page. That makes it much faster on a single machine, but Google says the output quality is lower than its standard Gemma 4 model, so it is built for speed-critical local work, not maximum-quality production output.

How fast is DiffusionGemma?

Google reports up to 4x faster text generation on dedicated GPUs: over 1,000 tokens per second on an NVIDIA H100 and over 700 tokens per second on a consumer NVIDIA GeForce RTX 5090. The speed is strongest for a single user on one machine. In high-traffic cloud serving, the advantage shrinks and standard models can be cheaper.

Can I run DiffusionGemma on my own computer?

Yes, if you have a high-end consumer GPU. It is a 26 billion parameter model that only uses 3.8 billion at a time, and when quantized it fits inside the 18GB of memory on top consumer cards like the RTX 5090 or 4090. Google notes that Apple Silicon Macs may not see the same speed boost, because their memory setup works differently from a dedicated GPU.

Should my business use DiffusionGemma in production?

Not for anything where quality matters most. Google is clear that DiffusionGemma trades some output quality for speed and recommends standard Gemma 4 for production work. Treat DiffusionGemma as a tool for fast local prototypes, interactive editing, and experiments, and keep a stronger model for the final output your customers see.

Jun 15, 2026 · 6 min read

DiffusionGemma: the AI model that writes in blocks, not one word at a time

Google's DiffusionGemma writes text in blocks instead of word by word, up to 4x faster on a GPU. What that speed means for your business, and where it doesn't help yet.

Abdulkader Safi Senior Software Engineer

Share:

DiffusionGemma: the AI model that writes in blocks, not one word at a time

Most AI models write the way you would type a text message. One word, then the next, then the next, left to right, never looking back. It works, but on your own machine it is slow, because the computer spends most of its time waiting for the next word before it can start the one after.

Google just shipped a model that throws that out. On 10 June 2026 they released DiffusionGemma, an experimental open model that writes a whole block of text at once and then cleans it up over a few passes. The headline number is up to 4x faster text generation on a GPU. The more interesting part is what that speed opens up, and where it quietly falls apart.

Here is my read as someone who builds with these models for clients, plus the honest catch you need to know before you get excited.

What "writes in blocks" actually means

Picture the difference like this. A normal model is a typewriter: one key at a time, in order. DiffusionGemma is a printing press: it stamps the whole page in one go, then goes back and fixes the smudges.

In Google's words, it starts with a "canvas" of 256 random placeholder tokens (a token is roughly three-quarters of a word). Then it makes a few passes. Each pass locks in the words it is confident about and uses them as clues to fix the rest. After a handful of rounds, the random mess has converged into a clean paragraph.

If that sounds familiar, it is the same trick AI image generators use. They start with visual static and refine it into a picture. DiffusionGemma does it with words instead of pixels.

This is not a brand-new idea. Researchers have tried diffusion for text for years. What Google did was make it work at a useful size, by changing how the model uses your hardware rather than how clever it is.

The speed is real, and it is local

The numbers Google published: over 1,000 tokens per second on an NVIDIA H100 (the data-centre card), and over 700 tokens per second on an RTX 5090 (a high-end gaming card you can actually buy). That is roughly four times faster than generating word by word.

The reason it is faster on your own machine is the whole point. When AI runs in the cloud, a server batches thousands of users together so the hardware is always busy. Run that same word-by-word model on one machine for one person and the GPU sits mostly idle, waiting. DiffusionGemma hands the GPU a big chunk of work at once, so it actually gets used.

So this is a model built for one person on one machine, doing something interactive and fast. Think live editing, rapid prototyping, a coding assistant that infills the middle of a function while you watch. Google specifically calls out in-line editing and code infilling as the sweet spot, because the model can see the whole block while it writes, so it can do things like close complex markdown formatting cleanly or render code in near real time.

If you have followed our writing on running AI yourself, this fits the same thread as Gemma 4 running on a normal laptop. The direction is clear: more of this runs on your own hardware, not someone else's server.

What this means for your business

Strip away the research talk and there are three practical reasons to care.

Speed changes what feels usable. A tool that responds instantly gets used. A tool with a two-second lag on every action gets abandoned. For anything interactive, an internal editor, a code helper, a drafting tool your team uses all day, a 4x speed jump is the difference between "nice demo" and "we actually use this."

It runs on hardware you can own. A 26 billion parameter model that only activates 3.8 billion at a time, quantized to fit in 18GB of memory, runs on a top consumer GPU. No per-request cloud bill that grows with use. No sending your data to an outside API. For a business in a regulated field, or one that simply does not want its work leaving the building, that matters. We make the same case in zero-trust architecture for non-technical owners: keep the sensitive stuff on machines you control.

It is free to build on. DiffusionGemma ships under the Apache 2.0 license, so you can download the weights and put it inside your own tools with no licensing fee. That is the same open footing as the rest of the Gemma family.

One caveat on the hardware, and Google flags it themselves: Apple Silicon Macs may not get the same speedup. The trick relies on a dedicated GPU being compute-bound. Mac chips share memory differently and tend to be limited by memory bandwidth instead, so the gain there is muted. If your team is all on MacBooks, test before you assume.

The catch nobody should skip

Here is the part the headline leaves out. DiffusionGemma's output quality is lower than standard Gemma 4. Google says so plainly: it prioritises speed, so the writing quality drops. It is labelled experimental, and the official advice is to use standard Gemma 4 for anything that demands maximum quality.

So this is a fast tool, not a smart-er one. The right mental model is a quick first draft. Use it where speed and interactivity beat polish: prototypes, live editing, experiments, internal tools where a human is in the loop fixing anything off. For the final output a customer reads or a contract depends on, reach for a stronger model.

There is one genuinely new ability worth flagging. Because DiffusionGemma sees the whole block at once, it can handle tasks where a later word depends on an earlier one in a way typewriter-style models struggle with. Google's example: a fine-tuned version solving Sudoku, where every square depends on the others. That two-way attention also helps with code infilling and structured formats. It is a different shape of tool, good at some things normal models are bad at.

This is the pattern with most AI releases now. The announcement is one number; the useful judgment is knowing where it fits. We took the same line on Claude Fable 5 and what a cheaper model means for clients and on the wider shift in 13 AI terms every founder should understand in 2026. New model, same question: what is it actually good for, and what is it not.

Where DiffusionGemma fits, and where it doesn't

A short version you can act on.

Good fit: interactive local tools, fast prototyping, code infilling and in-line editing, structured tasks where words depend on each other, and any case where keeping data on your own hardware matters. Run it on a high-end NVIDIA GPU, quantized, and the speed is real.

Bad fit: high-traffic cloud serving (where standard models batch better and cost less), Apple Silicon as your main hardware, and any final output where quality has to be the top priority. For that last one, Google's own answer is standard Gemma 4.

If you want help working out whether something like this belongs in your stack, that is the kind of call we make with clients all the time. We wrote about how to think about it in custom software development, when you actually need it. The honest answer is sometimes yes, sometimes no, and a model being fast is rarely the deciding factor on its own.

What to do now

If you are curious, the weights are free on Hugging Face and you can run them with tools like vLLM, MLX, or Hugging Face Transformers. Start with a prototype, not a production tool, and compare the output against standard Gemma 4 so you can feel the quality trade for yourself.

If you would rather skip the experiment and just know whether faster local AI is worth building into your business, that is a conversation worth having. Talk to us. We will tell you straight when it fits and when it is a distraction.

Source: Google, DiffusionGemma: 4x faster text generation and the DiffusionGemma model overview, both 10 June 2026.

NEWSLETTER

Stay Ahead of the Curve

Get the latest digital marketing insights delivered to your inbox weekly.