FTerViT: A Fully Ternary Vision Transformer That Runs on a $10 Microcontroller
1 CSEM, Neuchâtel·2 ETH Zürich
TL;DR
A Vision Transformer stores its weights as 32-bit floats. Ternary quantization shrinks each weight to one of three values — {-1, 0, +1} — packed in 2 bits, for a theoretical 16× compression. The catch: every existing ternary ViT only ternarizes the transformer encoder, and leaves the patch embedding, LayerNorm, and classifier head in INT8 or FP32 because those layers are notoriously fragile under quantization.
That exception is fine at 8 bits. At 2 bits it becomes the bottleneck. In DeiT-Tiny, those “leftover” components are under 4% of the parameters but 38% of the bytes on disk.
FTerViT ternarizes all of it — every weight matrix and every normalization parameter. It reaches 82.43% ImageNet-1K top-1 at 6.09 MB (~15× compression, only −2.42 pp vs. FP32), beats prior ternary ViTs by up to 8 pp, and is the first ternary ViT to run on a microcontroller — a dual-core Xtensa LX7 inside a $10 ESP32-S3.
Interactive — hover any bar. Where the bytes go in DeiT-Tiny as precision drops: the red “FP32 leftover” (patch embedding, LayerNorm, head, biases) balloons to 39% of a partially-ternary model, but full ternarization shrinks it to 10% — taking the model from 22.9 MB to 1.6 MB.
Why the first and last layers are the hard part
Quantizing the first and last layers of a network has been known to be risky since long before ViTs existed. When TerViT’s authors tried to fully ternarize the patch embedding and classifier, accuracy fell 22.4 percentage points.
We measured why, using two standard layer-importance estimators (Taylor first-order and a Hessian-trace approximation). The result is counterintuitive:
| Component | Share of params | Share of importance (Taylor-FO) |
|---|---|---|
| LayerNorm | < 0.2% | 34–39% |
| Patch embedding | 1–3% | most important single layer in DeiT-S |
| Classifier head | 2–3% | < 1.3% |
LayerNorm holds a fraction of a percent of the weights yet carries a third of the model’s sensitivity. The patch embedding is the single most important layer. So you cannot just brute-force ternarize them — you need primitives that survive it.
The method, in three pieces
FTerViT replaces every weight-carrying module with a ternary equivalent:
- TernaryBitLinear — for the fully-connected layers (BitNet b1.58 style: weights →
{-1, 0, +1}via absmean scaling, activations → 8-bit). Absmean is used instead of max-scaling because dividing by max|W| pushes most weights into the zero bin. - TernaryBitConv2d — for the patch embedding, with per-channel scaling so heterogeneous filter magnitudes don’t collapse.
- TernaryLayerNorm — ternarizes the learnable affine parameters γ and β. The per-token mean/variance stays FP32 (negligible cost; ternarizing it would destroy positional information).
Then a two-phase knowledge-distillation recipe, with the FP32 model as a frozen teacher of the same architecture:
- Phase 1 (training): distill at learning rate 1e-4 with cosine decay until validation accuracy saturates (~250 epochs).
- Phase 2 (recovery): restart at a low 1e-5 for just 10 epochs.
The loss is pure forward-KL — no cross-entropy term, because CE conflicts with KL in low-bit networks. The Phase-2 restart is the surprise: it lifts top-1 by 3–4 pp in 10 epochs, and starting it from epoch 250 beats running Phase 1 all the way to epoch 400. The extra 150 epochs of slow training buy essentially nothing.
Interactive — hover for a synced readout across all curves; click a legend chip to toggle one. Phase 1 (top) saturates near 78% and never closes the gap to the FP32 teacher (83.08%). Phase 2 (bottom) restarts five Phase-1 checkpoints at a low LR: the epoch-250 run reaches 79.64% in 10 epochs, edging out the epoch-400 run.
Results
On ImageNet-1K, FTerViT is the best ternary DeiT-S reported — and it does it while being fully ternary, not partially.
Interactive — hover any marker for model, bit-width, size, and accuracy. Stars are FTerViT; the same numbers are listed below.
| Method | W/A bits | Model | Size | Top-1 |
|---|---|---|---|---|
| TerViT | 2 / 8 | DeiT-S | 6.0 MB | 74.2% |
| ViT-1.58b | 2 / 8 | ViT-L | 57 MB | 74.25% |
| Q-ViT | 2 / 2 | DeiT-S | 6.0 MB | 72.1% |
| FTerViT (ours) | 2 / 8 | DeiT-S | 5.81 MB | 77.47% |
| FTerViT (ours) | 2 / 8 | DeiT-III-S²²⁴ | 5.81 MB | 79.64% |
| FTerViT (ours) | 2 / 8 | DeiT-III-S³⁴⁴ | 6.09 MB | 82.43% |
A few details worth noting: 37% of the ternary weights end up at exactly zero, which is free sparsity for inference. The ternarized model preserves what the FP32 teacher “looks at” — attention rollout maps land on the same semantic regions, and patch-embedding features stay at 0.88 cosine similarity to FP32. The gap to FP32 shrinks as the model grows: 9.2 pp for DeiT-Tiny, but only 2.4 pp for DeiT-III-S³⁴⁴.
Running it on a $10 board
This is the part I find most fun. We deployed the 224×224 model on an ESP32-S3-EYE — a dual-core Xtensa LX7 at 240 MHz with 8 MB PSRAM, a 2 MP camera, and a small LCD, for about $10. The original 88.3 MB FP32 model cannot even load on this device; the 5.81 MB ternary model fits in flash at 79% partition use.
The same $10 ESP32-S3-EYE board: (left) running FTerViT entirely on-device — no cloud, no network — and calling the cat at 100%; (right) the MCU boot screen with the CSEM and ETH Zürich logos.
The inference engine is standalone pure C with no external dependencies: weights are bit-packed 4-per-byte, kernels do integer multiply-accumulate with fused QKV projections. Beyond just fitting, ternary is genuinely cheaper to run. Measured on a Nordic PPK2 power probe:
Interactive — tap Latency / Power / Energy to switch metric; hover a bar for the delta vs. FP32. On a DeiT-Tiny FC1 layer, packed ternary is 1.71× faster than FP32 (804 ms vs. 1376 ms) and uses 55% less above-idle energy (59 mJ vs. 130 mJ). Smaller weights mean less memory traffic, which cuts both latency and active power.
Most prior MCU-scale ViTs get there by redesigning the architecture with neural architecture search plus INT8. FTerViT takes the orthogonal route: keep a strong off-the-shelf backbone (DeiT-III-S) and compress it 15× — landing 5.1 pp above MCUFormer on ImageNet.
Takeaways
- The components everyone assumed were too fragile to ternarize — patch embedding, LayerNorm, classifier head — can be ternarized, if you distill from a same-architecture teacher.
- A short low-LR recovery phase beats grinding out more training epochs.
- A standard ViT, compressed this way, fits and runs on commodity microcontroller hardware — which is exactly the direction I think edge AI is heading.
The pretrained models and full code are public: GitHub · Hugging Face. This work is supported by the Swiss National Science Foundation and the SwissChips initiative (ETH Zürich, EPFL, CSEM).