Local large language models — models that run on your own hardware rather than a server — always execute the full depth of their transformer stack, no matter how simple the query. A one-sentence factual lookup runs through the same 32 layers as a complex multi-step reasoning task.
That is computationally wasteful. On battery-powered or edge devices — laptops, embedded systems, on-device assistants — it burns power you don't have. The compute budget is finite. Every wasted joule on a trivial query is a joule that could have served a genuinely complex one.
"A question about today's weather should not cost the same as a question about quantum entanglement."
The question that started this project: what if the inference system could route each request through fewer transformer layers based on two variables — how complex the task appears to be, and how much energy budget remains? Could a lightweight adaptive controller deliver meaningful efficiency gains without sacrificing task completion?
This is not a hypothetical. As local LLM deployment grows — Llama on MacBooks, Mistral on Raspberry Pis, language models on phones — power efficiency becomes a first-class constraint. This research tests whether cross-layer routing is a tractable solution.
Three policy variants, one simulator, nine seeds per condition. The experiment is designed to isolate the contribution of energy-awareness from task-complexity routing alone — and then measure both against a static baseline.
The experiment runs in a Python simulator with deterministic reproducibility. A seeded RNG ensures that every run across 9 seeds produces the same sequence for that seed, making the results independently verifiable. Randomness is a tool, not a variable.
Three workload types stress-test the policies under different task distributions. Chat workloads contain a high proportion of simple factual queries — the highest early-exit opportunity. RAG (retrieval-augmented generation) workloads require integration of retrieved context, demanding deeper layer processing. Agent traces involve multi-step planning and tool orchestration — inherently complex, with limited routing headroom.
Statistical significance is determined by paired permutation tests — non-parametric, making no normality assumption. This is deliberate: with n=9, a standard t-test's normality assumption is inappropriate. Bootstrap confidence intervals (10,000 resamples) provide effect size estimation with proper uncertainty bounds.
The Unified Adaptive policy outperforms both baselines. The effect on chat workloads is large enough to be striking — a Cohen's d of 7.49 is not a borderline finding.
On chat workloads, the Unified Adaptive policy achieved +14.43% task-success-per-joule over the static baseline. The effect size — Cohen's d = 7.49 — is very large by any standard rubric. Statistical significance: p = 0.005, confirmed by paired permutation test across 9 seeds.
The routing-only policy showed improvement too, but smaller — consistent with the hypothesis that energy-budget awareness provides additional signal that complexity classification alone cannot capture. When the battery is depleting, the controller tightens exit thresholds proactively rather than waiting for queries to register as simple.
RAG and agent workloads showed smaller gains due to their inherently higher complexity. Fewer queries fall below the early-exit threshold, so the routing mechanism has less headroom to exploit. This is the expected result — not a failure mode, but a structural limit of the workload distribution.
The paired permutation tests confirm the result is not an artifact of seed selection. By permuting the assignment of seeds to conditions and recomputing the test statistic 10,000 times, we verify that the observed difference exceeds what chance seed variation could produce.
| Workload | Efficiency Gain | Cohen's d | p-value |
|---|---|---|---|
| Chat | +14.43% | 7.49 | 0.005 |
| RAG | +6.2% | 3.1 | 0.02 |
| Agent | +3.8% | 1.8 | 0.04 |
Entire simulation written in Python 3.11. No black-box frameworks — every component of the routing logic, statistical testing, and result visualization is custom or from auditable scientific libraries.
Three findings worth stating plainly — and the reasoning behind each.
Adjust task complexity and energy budget. Watch which transformer layers JouleRoute activates versus the static baseline — and see the efficiency delta update live.
Full overview of projects, disciplines, and work.
Full-stack FTC alliance selection platform. Monte Carlo, OPR, live match data.
More builds, experiments, and side projects.