atman

Field note · 04 June 2026

When the weights finally fit.

Two model drops in the same window. Bonsai Image 4B fits an image diffuser on an iPhone. Gemma 4 12B fits a unified multimodal model on a 16 GB laptop. Last month the hardware arrived. This month, the models did.

What dropped

Two labs. Same bet.

Prism ML

Bonsai Image 4B

Text-to-image diffusion, 4 B parameters, compressed to 0.93 GB (1-bit) or 1.21 GB (ternary). First image model in its class to run directly on an iPhone. Ternary variant retains 95%of FLUX.2 Klein 4B accuracy. Apache 2.0.

  • 9.4 s · 512×512 on iPhone 17 Pro Max
  • 6.0 s · 512×512 on Mac M4 Pro (5.6× faster)
  • Apple Silicon + CUDA

Google

Gemma 4 12B

Unified multimodal model. Text, vision, audio flow directly into the same transformer — no separate encoders. Runs on a 16 GBlaptop. Approaches the 26 B MoE variant on benchmarks. Apache 2.0.

  • Encoder-free architecture
  • Multi-Token Prediction drafters for lower latency
  • Available via Hugging Face, Kaggle, Ollama, LM Studio

What fits where


   model                         size on disk        runs on
   ──────────────────────────────────────────────────────────────
   FLUX.2 Klein 4B  (fp16)       7.75 GB             data-center GPU
   Bonsai Image 4B  (ternary)    1.21 GB             iPhone 17 Pro Max
   Bonsai Image 4B  (1-bit)      0.93 GB             iPhone 17 Pro Max
   Gemma 4 12B                   ~12 GB              16 GB-VRAM laptop
   Gemma 4 26B (MoE)             ~24 GB              32 GB-VRAM laptop

Gemma’s bet

Throw out the encoders.

Traditional multimodal stacks bolt a vision encoder, an audio encoder, and a text tokenizer onto a language model. Gemma 4 removes the bolts. Vision and audio project directly into the token space. One backbone, three modalities, one tensor stream. Less to compile, less to ship, less to break on-device.


   text   ───────╮
                  ╲
   vision ────────┼──→  ▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒▒  ──→  output
                  ╱     unified LLM core
   audio  ───────╯       (no encoders)

           one model. three modalities. one tensor stream.

Bonsai’s bet

Quantize until it fits in your pocket.

The base model is FLUX.2 Klein 4B — respectable, but 7.75 GB and datacenter-bound. Prism takes it down to 1.21 GB ternary (95% accuracy retained) and 0.93 GB at 1-bit (88% retained). That’s the difference between “impressive demo” and “ships in an app store binary.”


   prompt ─→ Bonsai 4B ─→ 512 × 512 image

   ┌──────────────────────────────────────────────┐
   │ iPhone 17 Pro Max   ▰▰▰▰▰▰▰▰▰▰▰▰      9.4 s  │
   │ Mac M4 Pro          ▰▰▰▰▰▰▰▰          6.0 s  │
   │ full-precision FLUX ▰▰▰▰▰▰▰▰▰▰▰▰▰▰▰  33.6 s  │
   └──────────────────────────────────────────────┘
              ternary retains 95% of FLUX accuracy

Why both, why now

The hardware met the weights halfway.

Spark gave us 128 GB of headroom on a laptop. Bonsai and Gemma answer the question that headroom raises: what do you actually run in it?The pattern is the same in both announcements — smaller, denser, multimodal-by-default, permissively licensed, designed to ship.

What you can build with this stack

App-store apps that do not phone home.

How we’re reading this

The on-device stack is now three layers deep.

Silicon at the bottom (Spark, Apple Silicon). Weights in the middle (Bonsai, Gemma, Llama, Qwen, MiniMax). Apps on top ( Masi, Maya, Drik). Each layer is moving independently, and each just shipped a meaningful step.

The stack is no longer theoretical.

What’s still hard

Not all of it is solved.

✦ ✦ ✦

The silicon got bigger.
The weights got smaller.
They just met.