✦Field note · 04 June 2026

When the weights finally fit.

Two model drops in the same window. Bonsai Image 4B fits an image diffuser on an iPhone. Gemma 4 12B fits a unified multimodal model on a 16 GB laptop. Last month the hardware arrived. This month, the models did.

✦What dropped

Two labs. Same bet.

     ╭───────╮
     │ ◆ ◆ ◆ │
     │  ◆ ◆  │
     │   ◆   │
     ╰───────╯

Prism ML

Bonsai Image 4B

Text-to-image diffusion, 4 B parameters, compressed to 0.93 GB (1-bit) or 1.21 GB (ternary). First image model in its class to run directly on an iPhone. Ternary variant retains 95%of FLUX.2 Klein 4B accuracy. Apache 2.0.

9.4 s · 512×512 on iPhone 17 Pro Max
6.0 s · 512×512 on Mac M4 Pro (5.6× faster)
Apple Silicon + CUDA

     ┌───────┐
     │ ▓ ▓ ▓ │
     │ ▓ ◉ ▓ │
     │ ▓ ▓ ▓ │
     └───────┘

Google

Gemma 4 12B

Unified multimodal model. Text, vision, audio flow directly into the same transformer — no separate encoders. Runs on a 16 GBlaptop. Approaches the 26 B MoE variant on benchmarks. Apache 2.0.

Encoder-free architecture
Multi-Token Prediction drafters for lower latency
Available via Hugging Face, Kaggle, Ollama, LM Studio

✦What fits where


   model                         size on disk        runs on
   ──────────────────────────────────────────────────────────────
   FLUX.2 Klein 4B  (fp16)       7.75 GB             data-center GPU
   Bonsai Image 4B  (ternary)    1.21 GB             iPhone 17 Pro Max
   Bonsai Image 4B  (1-bit)      0.93 GB             iPhone 17 Pro Max
   Gemma 4 12B                   ~12 GB              16 GB-VRAM laptop
   Gemma 4 26B (MoE)             ~24 GB              32 GB-VRAM laptop

✦Gemma’s bet

Throw out the encoders.

Traditional multimodal stacks bolt a vision encoder, an audio encoder, and a text tokenizer onto a language model. Gemma 4 removes the bolts. Vision and audio project directly into the token space. One backbone, three modalities, one tensor stream. Less to compile, less to ship, less to break on-device.


   text   ───────╮
                  ╲
   vision ────────┼──→  ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒  ──→  output
                  ╱     unified LLM core
   audio  ───────╯       (no encoders)

           one model. three modalities. one tensor stream.

✦Bonsai’s bet

Quantize until it fits in your pocket.

The base model is FLUX.2 Klein 4B — respectable, but 7.75 GB and datacenter-bound. Prism takes it down to 1.21 GB ternary (95% accuracy retained) and 0.93 GB at 1-bit (88% retained). That’s the difference between “impressive demo” and “ships in an app store binary.”


   prompt ─→ Bonsai 4B ─→ 512 × 512 image

   ┌──────────────────────────────────────────────┐
   │ iPhone 17 Pro Max   ▰▰▰▰▰▰▰▰▰▰▰▰      9.4 s  │
   │ Mac M4 Pro          ▰▰▰▰▰▰▰▰          6.0 s  │
   │ full-precision FLUX ▰▰▰▰▰▰▰▰▰▰▰▰▰▰▰  33.6 s  │
   └──────────────────────────────────────────────┘
              ternary retains 95% of FLUX accuracy

✦Why both, why now

The hardware met the weights halfway.

Spark gave us 128 GB of headroom on a laptop. Bonsai and Gemma answer the question that headroom raises: what do you actually run in it?The pattern is the same in both announcements — smaller, denser, multimodal-by-default, permissively licensed, designed to ship.

✦What you can build with this stack

App-store apps that do not phone home.

Image generation in a private journal
Bonsai ships inside the app binary. Prompts and outputs never leave the phone.
Voice-and-screen assistants
Gemma 4 reads what’s on screen, listens, and answers — all in one forward pass.
Document workflows that respect documents
Multimodal vision means scanned PDFs become structured data without a cloud round-trip.
Distillation, on-device
Apache 2.0 means you can fine-tune, redistribute, and ship variants — the cloud isn’t the only training venue anymore.

✦How we’re reading this

The on-device stack is now three layers deep.

Silicon at the bottom (Spark, Apple Silicon). Weights in the middle (Bonsai, Gemma, Llama, Qwen, MiniMax). Apps on top ( Masi, Maya, Drik). Each layer is moving independently, and each just shipped a meaningful step.

The stack is no longer theoretical.

✦What’s still hard

Not all of it is solved.

Bonsai needs an iPhone 17 Pro Max — last-gen phones aren’t there yet.
Gemma 12B needs 16 GB unified memory; lots of laptops in the wild have 8.
Quality gap to frontier is still real for hard reasoning + niche domains.
On-device multimodal tooling (eval, observability, prompt mgmt) is still scattered.

✦ ✦ ✦

The silicon got bigger.
The weights got smaller.
They just met.

✦Further reading