JouleRoute-LM — George Hu

Chapter One

The
Problem.

A transformer classifier runs every input through its full stack of layers — an easy example and a hard one cost exactly the same. But the easy ones are often already decided halfway up. The depth spent past that point is wasted compute, and on a laptop that is wasted battery.

This project rebuilds a prior simulation ("JouleRoute v1") on real model weights. The earlier version modelled the idea; this one attaches working early-exit heads to a real BERT-base encoder and asks the only question that matters on hardware: does stopping early save measurable energy, and what does it cost in accuracy?

"An easy sentence should not cost the same as a hard one — and the only way to know what it really costs is to measure the joules."

The lineage is explicit. BranchyNet (2016) introduced early-exit branches gated by prediction entropy; DeeBERT (2020) applied the idea to BERT — freeze the backbone, train a small off-ramp after each layer, and exit the first time one is confident. I follow DeeBERT closely, then add the v1 continuity hook: a simulated battery budget that tightens the exit threshold as it depletes.

Scope is deliberately narrow and honest: encoder classification only — BERT-base, 12 layers, SST-2 sentiment. Decoder / generative early-exit is a real research frontier, not this experiment; why it is hard is the last word of the page.

L1–3

← early heads: easy sentences leave here

L4–7

← entropy threshold: most exits land mid-stack

L8–10

L11–12

← full depth: only the hard sentences reach it

Chapter Two

The
Approach.

Freeze the backbone, train only the off-ramps, then route by confidence. Three setups isolate what early-exit — and the battery budget — actually contribute, each measured against the full-depth baseline.

Setup 01

Full-Depth Baseline

All 12 layers

The frozen, SST-2-fine-tuned BERT-base run end to end — every sentence through all 12 layers. What a standard deployment does, and the honest reference point: 92.43% validation accuracy, logged before any routing existed.

Setup 02

Entropy Early-Exit

Confidence-only

A linear head after each layer (DeeBERT-style, backbone frozen). At each layer I take the softmax entropy of its head; the first time it falls below a threshold τ, the sentence exits with that prediction. Confident inputs leave early; hard ones ride to the top.

Setup 03
Entropy + Battery Budget
JouleRoute — the v1 hook
The continuity link to v1: a simulated battery budget that loosens the entropy threshold as it drains, so a depleting device exits earlier and spends less — trading a little accuracy for energy on purpose. Clearly labelled simulated; it is a control knob, never a measured joule.

Measurement Design

Everything runs on an M5 MacBook Air via the MPS backend. The backbone is frozen, so I cache its per-layer features once and train the 12 linear heads on top — cheap, reproducible, seeded. The depth gradient is the whole premise: head accuracy climbs from ~72% at layer 1 to ~92% by layer 10.

Energy is the part that has to be real. Every joule comes from a raw powermetrics capture written to a timestamped log — never estimated from layer counts. I measure energy-per-example at each fixed depth with identical full batches (constant GPU saturation), then weight by each threshold's real exit distribution. Three measurement methods were tried; two were discarded for confounds before the numbers were trusted.

Statistics run over 5 random seeds — the only randomness is head training; the backbone never moves. A paired permutation test compares early-exit against full-depth on the same sentences; a bootstrap puts a 95% confidence interval on the accuracy change, layers saved, and energy saved.

Energy — measured

Real powermetrics logs

Whole-SoC power sampled by Apple's powermetrics, integrated over time, idle floor subtracted. No joule is reported without a raw log behind it.

Significance

Paired Permutation Test

Non-parametric, per-example sign-flip on the same 872 validation sentences — the right test for comparing two classifiers on shared inputs.

Uncertainty

Bootstrap 95% CI

10,000 resamples (nested over seeds and examples) on Δaccuracy, layers saved, and energy saved — every effect size carries an interval, not just a point.

Reproducibility

Frozen backbone · 5 seeds

Only the heads train; the backbone is fixed. Results are stable across seeds (accuracy spread ≈ 0.2 pp), so the finding is not a lucky initialisation.

Chapter Three

The
Curve.

This is the whole project in one plot — accuracy against the compute and the real energy it costs. It is modest on purpose. A clean, dramatic number here would be a red flag; this is the honest shape.

Two real-measurement panels: accuracy versus average layers computed, and accuracy versus measured energy per example, each with the 92.43% full-depth baseline drawn as a dashed line. — Real measurements, M5 MacBook Air. Left: accuracy vs average layers computed. Right: accuracy vs energy per example, from powermetrics logs (mean ± std of 3 captures). Dashed line: the 92.43% full-depth baseline. Each point is an entropy threshold τ.

Headline · τ = 0.20

−37%

fewer layers and ~37% less measured energy, at 98.8% of baseline accuracy

98.8%

Baseline accuracy held

n.s.

Accuracy change

At the sweet-spot threshold (τ = 0.20) the router computes 37% fewer layers and draws ~37% less measured energy per example, while holding 98.8% of the full-depth baseline accuracy (0.913 vs 0.924). The energy is real — it traces to powermetrics logs, not a layer-count estimate.

Across 5 seeds the accuracy change is not statistically significant: paired-permutation p-values stay ≥ 0.30 and the bootstrap 95% CI on Δaccuracy is [−1.15, +0.69] percentage points — it straddles zero. In plain terms: early-exit saved about a third of the compute and energy with no accuracy loss I can actually measure. That is the win.

Notice the energy saved (~37%) is no larger than the layers saved (~37%), and never beats it — embeddings, the layers you still run, and fixed overhead don't vanish, so energy scales sub-linearly with depth. If energy had dropped faster than compute, the measurement would be lying.

Push τ harder and the trade-off appears as it should: by τ = 0.50 you save ~65% of layers but accuracy falls to ~86%; exit at the very first layer and accuracy collapses to ~71%. The knee of the curve — not the extremes — is the operating point.

Threshold τ	Accuracy	Avg layers	Energy saved
0.05	0.919	9.32	−22%
0.10	0.916	8.63	−28%
0.20	0.913	7.55	−37%
0.35	0.892	5.93	−50%
0.50	0.860	4.25	−63%
0.69	0.713	1.00	−85%

SST-2 validation (872 sentences) · accuracy vs full-depth baseline 0.9243 · energy/example from powermetrics, weighted by the real exit distribution · saved vs 12-layer full depth

Chapter Four

Technical
Stack.

Real weights, real hardware, auditable libraries — and nothing between the model and the power meter.

Model + framework

PyTorch · MPS

BERT-base (12 layers) fine-tuned on SST-2, run in PyTorch on the Apple-Silicon MPS backend. Backbone frozen; only the exit heads train.

Data + weights

HuggingFace

HuggingFace Transformers + Datasets for the SST-2 backbone and the GLUE sentiment data. Standard checkpoint, standard split — nothing bespoke to flatter the result.

Statistics

NumPy · SciPy

NumPy and SciPy for the paired permutation test, the nested bootstrap CIs, and the seed-level variance — all on the real measurements.

Visualisation

Matplotlib

Matplotlib for the accuracy-vs-layers and accuracy-vs-energy curve — the single deliverable plot of the project.

Energy — the point

macOS powermetrics

Apple's power sampler, run as root into raw timestamped logs. The integrity rule of the project: no energy number exists without a log behind it.

Hardware

M5 MacBook Air

24 GB, fan-less — a real, power-constrained edge device. The whole reason to measure joules instead of simulating them.

Interpretation

What This
Means.

Three things worth stating plainly — including the ways the result is smaller, and more honest, than a headline would prefer.

Finding 01

A modest curve is the success

Cutting ~37% of compute and energy for no measurable accuracy loss is exactly the believable result early-exit should give on a real encoder. It is not dramatic — and that is the point. I set out to measure the honest shape of the trade-off, not to manufacture a big number.

Finding 02

Energy follows compute — sub-linearly

Layers saved and energy saved track each other, with energy never running ahead. The fixed costs — embeddings, the layers still computed, per-inference overhead — are real, so you cannot save energy faster than you save depth. A curve that claimed otherwise would be a measurement artifact, not a result.

Finding 03

The method had to earn trust

The first two energy-measurement methods produced confounded numbers — a bs=1 dispatch artifact, then a batch-packing confound — and I caught and discarded both before trusting the third. Statistics run over 5 seeds with paired tests and bootstrap intervals. Integrity of the measurement was the entire reason to redo v1 on hardware.

Source code · Full methodology

View on GitHub

Honest Limits

Limitations & future work.

What this experiment does not show, stated up front — and where it would have to go next.

Limit 01

Energy is whole-SoC, not BERT alone

powermetrics reports package power for the whole chip, not the model in isolation. I subtract a measured idle floor to get an attributable figure, but it is a believable proxy — not a clean-room measurement of BERT's joules.

Limit 02

One task, one model, validation split

BERT-base on SST-2 sentiment only, evaluated on the validation set because GLUE's test labels are hidden. The curve is honest for this setting; it is not a claim about other tasks, larger models, or held-out test data.

Limit 03

Encoder classification only

Early-exit here applies to a bidirectional encoder producing one label per sentence — the regime where a per-layer off-ramp is clean. Generative decoding is a different and harder problem, which is exactly the next limit.

Future work

Generative early-exit & the missing-KV problem

Extending this to a decoder LLM hits the missing-KV problem: if an early token exits at layer 5, the deeper layers never compute its key/value entries — yet a later token attending backward needs exactly those missing KV vectors. Resolving that (recompute, propagate, or approximate the skipped states) is an open research frontier, firmly out of scope here, and the natural next step.

Try the Mechanism

Route it
yourself.

An illustrative model of the routing decision on a 12-layer stack — not the benchmark. Set a sentence's difficulty, the entropy threshold τ, and the simulated battery, and watch where it exits versus full depth. The measured numbers are in the curve above.

Sentence Difficulty Easy

Easy & clear → hard / ambiguous

Battery Budget 100%

Simulated — lower battery loosens τ, exits earlier

Entropy Threshold τ 0.20

Higher τ = exit sooner, less confident

Full Depth All 12 layers

12 / 12 layers · full-depth energy

JouleRoute-LM Early-exit

— / 12 layers computed

—

Energy saved vs static

—

More queries per budget

—

Task-success-per-joule gain

ADAPTIVE

System mode

Illustrative model of the entropy + battery routing decision — not a measurement. The real, measured result (BERT-base / SST-2, M5): at τ=0.20, ~37% fewer layers and ~37% less energy at 98.8% of baseline accuracy, with no statistically significant accuracy change. See the curve above.