LM
Independent Research · 2026
Adaptive Cross-Layer Inference Control · Local LLM Energy Efficiency

Joule
Route-LM

Local LLMs waste power running the full model depth on every query. This research tests whether adaptive layer routing can fix that.
+14.43%
Efficiency gain · chat
p = 0.005
Statistical significance
9
Random seeds tested
2,000+
Simulations run

The
Problem.

Local large language models — models that run on your own hardware rather than a server — always execute the full depth of their transformer stack, no matter how simple the query. A one-sentence factual lookup runs through the same 32 layers as a complex multi-step reasoning task.

That is computationally wasteful. On battery-powered or edge devices — laptops, embedded systems, on-device assistants — it burns power you don't have. The compute budget is finite. Every wasted joule on a trivial query is a joule that could have served a genuinely complex one.

"A question about today's weather should not cost the same as a question about quantum entanglement."

The question that started this project: what if the inference system could route each request through fewer transformer layers based on two variables — how complex the task appears to be, and how much energy budget remains? Could a lightweight adaptive controller deliver meaningful efficiency gains without sacrificing task completion?

This is not a hypothetical. As local LLM deployment grows — Llama on MacBooks, Mistral on Raspberry Pis, language models on phones — power efficiency becomes a first-class constraint. This research tests whether cross-layer routing is a tractable solution.

L1–4
← early exit: simple queries
L5–12
← JouleRoute adaptive exit
L13–24
L25–32
← static baseline always here
Chapter Two

The
Approach.

Three policy variants, one simulator, nine seeds per condition. The experiment is designed to isolate the contribution of energy-awareness from task-complexity routing alone — and then measure both against a static baseline.

Policy 01
Static Baseline
No routing
Always runs full model depth. No routing, no adaptation. This is what every standard local LLM deployment does today — every query, regardless of complexity, traverses all transformer layers. The performance ceiling and energy floor.
Policy 02
Routing-Only
Complexity-aware
Routes based on task complexity classification alone, ignoring energy state. An early-exit controller scores each query and exits the stack when complexity falls below a threshold. Smarter than static — but blind to the power budget.
Policy 03
Unified Adaptive
JouleRoute — Complexity + Energy
Routes based on both task complexity and remaining energy budget. The controller adjusts exit depth dynamically: simpler tasks exit earlier, and as the energy budget depletes, exit thresholds tighten further. Both signals inform every decision.

The experiment runs in a Python simulator with deterministic reproducibility. A seeded RNG ensures that every run across 9 seeds produces the same sequence for that seed, making the results independently verifiable. Randomness is a tool, not a variable.

Three workload types stress-test the policies under different task distributions. Chat workloads contain a high proportion of simple factual queries — the highest early-exit opportunity. RAG (retrieval-augmented generation) workloads require integration of retrieved context, demanding deeper layer processing. Agent traces involve multi-step planning and tool orchestration — inherently complex, with limited routing headroom.

Statistical significance is determined by paired permutation tests — non-parametric, making no normality assumption. This is deliberate: with n=9, a standard t-test's normality assumption is inappropriate. Bootstrap confidence intervals (10,000 resamples) provide effect size estimation with proper uncertainty bounds.

Primary Metric
Task-Success-per-Joule
Composite of task completion rate and energy consumed. Rewards policies that complete tasks efficiently — not just fast, not just frugal.
Significance Test
Paired Permutation Test
Non-parametric. Compares each seed pair across policies. Avoids normality assumption — appropriate for n=9 samples.
Effect Size
Bootstrap CIs · Cohen's d
10,000 resamples per condition. Cohen's d quantifies effect magnitude independent of sample size.
Reproducibility
Seeded RNG · 9 Seeds
Deterministic across runs. 9 seeds provide sufficient variance estimation without overfitting to a single random trajectory.

The
Results.

The Unified Adaptive policy outperforms both baselines. The effect on chat workloads is large enough to be striking — a Cohen's d of 7.49 is not a borderline finding.

Key finding — chat workload
+14.43%
task-success-per-joule over static baseline
7.49
Cohen's d
p = 0.005
Significance

On chat workloads, the Unified Adaptive policy achieved +14.43% task-success-per-joule over the static baseline. The effect size — Cohen's d = 7.49 — is very large by any standard rubric. Statistical significance: p = 0.005, confirmed by paired permutation test across 9 seeds.

The routing-only policy showed improvement too, but smaller — consistent with the hypothesis that energy-budget awareness provides additional signal that complexity classification alone cannot capture. When the battery is depleting, the controller tightens exit thresholds proactively rather than waiting for queries to register as simple.

RAG and agent workloads showed smaller gains due to their inherently higher complexity. Fewer queries fall below the early-exit threshold, so the routing mechanism has less headroom to exploit. This is the expected result — not a failure mode, but a structural limit of the workload distribution.

The paired permutation tests confirm the result is not an artifact of seed selection. By permuting the assignment of seeds to conditions and recomputing the test statistic 10,000 times, we verify that the observed difference exceeds what chance seed variation could produce.

Workload Efficiency Gain Cohen's d p-value
Chat +14.43% 7.49 0.005
RAG +6.2% 3.1 0.02
Agent +3.8% 1.8 0.04
All results: Unified Adaptive vs. Static Baseline · Paired permutation test · Bootstrap CI 10,000 resamples
Chapter Four

Technical
Stack.

Entire simulation written in Python 3.11. No black-box frameworks — every component of the routing logic, statistical testing, and result visualization is custom or from auditable scientific libraries.

Core Language
Python 3.11
Simulation core. Deterministic seeded execution. Entire experiment pipeline in one reproducible environment.
Numerics
NumPy
Statistical sampling, array operations, and seeded random number generation across all 9 experimental seeds.
Statistics
SciPy
Hypothesis testing infrastructure. Permutation test implementation and distribution utilities for effect size estimation.
Visualisation
Matplotlib
Result visualization — efficiency curves, distribution plots, and policy comparison charts for analysis and reporting.
Simulation Engine
Custom Layer Simulator
Seeded, deterministic transformer layer simulator. Models per-layer energy cost, complexity scoring, and early-exit logic.
Inference
Bootstrap Resampling
10,000-iteration bootstrap for confidence interval estimation. Provides robust uncertainty quantification without parametric assumptions.
Interpretation

What This
Means.

Three findings worth stating plainly — and the reasoning behind each.

Finding 01
Energy-aware routing is tractable
The controller adds negligible overhead while delivering measurable efficiency gains on realistic workloads. The routing decision itself — scoring complexity and checking energy state — costs far less than a single transformer layer. The gains are real, and the mechanism is lightweight enough to deploy without degrading inference latency on simple queries.
Finding 02
Simple queries are the low-hanging fruit
Chat workloads benefit most because they contain the highest proportion of queries that don't require full model depth. Factual lookups, greetings, single-turn Q&A — these can exit early without meaningful accuracy loss. The implication: workload characterization matters. Deploying JouleRoute on an agent-heavy use case will yield smaller gains than deploying it on a conversational assistant.
Finding 03
Statistical methodology matters
Using paired permutation tests (non-parametric) and bootstrap CIs rather than standard t-tests avoids the normality assumption that would be inappropriate for a 9-sample comparison. This is not academic caution — it's the difference between a result that holds up and one that doesn't. Small-n experiments demand conservative testing. The methodology here was chosen before the data was collected.
+14.43%
Peak efficiency gain · chat
d = 7.49
Cohen's d · very large effect
10k×
Bootstrap resamples per condition
Source code · Full methodology
View on GitHub
Try the Simulator

Route it
yourself.

Adjust task complexity and energy budget. Watch which transformer layers JouleRoute activates versus the static baseline — and see the efficiency delta update live.

Task Complexity Low · Chat
Factual lookup → multi-step agent reasoning
Thermal Budget 80%
Battery + thermal headroom available
KV-Cache Fill 0%
Prefix reuse from shared context
Model Cascade Route to 3B model when thermal budget exhausted
Static Baseline All 32 layers always
32 / 32 layers · 100% energy per query
JouleRoute-LM Adaptive exit
KV reuse
0%
— / 32 layers active
Energy saved vs static
More queries per budget
Task-success-per-joule gain
ADAPTIVE
System mode
Research finding (chat workloads, n=9 seeds): +14.43% task-success-per-joule vs static baseline · Cohen's d = 7.49 · p = 0.005 · paired permutation test · bootstrap CI 10,000× · KV-cache + cascade not yet included in published results

Also Explore

Home · Portfolio
Portfolio

Full overview of projects, disciplines, and work.

Explore →
Project · Web Platform
ScoutSelect

Full-stack FTC alliance selection platform. Monte Carlo, OPR, live match data.

Explore →
Projects · Engineering
Other Projects

More builds, experiments, and side projects.

Explore →
Ready to go back?
Return to Portfolio