Gemma 4 12B is Google’s new mid‑sized multimodal model that runs efficiently on consumer laptops while delivering reasoning performance close to the larger 26B MoE model. It introduces a unified, encoder‑free architecture that processes vision and audio directly through the LLM backbone, enabling fast, low‑memory multimodal intelligence with native audio input. The model is open‑source under Apache 2.0 and integrates broadly across developer tools, making it easy to run locally, fine‑tune, or deploy in production.