Adaptive Compute: Agents That Think Harder Only When It Helps
Today we shipped adaptive compute and closed the last four gaps between our agent runtime and the CoALA paper — dynamic search breadth and depth, reasoning-scored memory, and learning as a choice.
Most agent loops spend the same amount of thought on every step. They propose a fixed K candidate actions, run for a fixed N steps, and commit the same compute whether the next move is obvious or genuinely hard. The paper we model our runtime on calls this out directly: "most LLM reasoning methods fix a search budget." It's wasteful on the easy steps and underpowered on the hard ones.
Today we shipped adaptive compute — a runtime that spends more where it helps and less where it doesn't — and in doing so closed the last four open items in our running validation of the cognitive core against CoALA, the Cognitive Architectures for Language Agents paper. This is a ship log for that work: two commits, four gaps, all green.
If you haven't read it yet, From Chatbot to Cognitive Architecture is the backstory — why we hold our agent runtime to a published blueprint and grade ourselves against it. This post is what happens when you take the remaining red marks on that scorecard and close them one by one.
The scorecard had four reds
We keep a COALA_VALIDATION.md doc that maps our runtime to the paper section by section, cites the implementing file:line, and is honest about the gaps. Four items were still open — and tellingly, they're exactly the things the paper itself flags as understudied, risky, or frontier:
- No procedural learning — an agent that can propose edits to its own persona/skills.
- Learning isn't a proposable action — memory writes happened on a fixed schedule, never as a deliberate choice.
importancewas stubbed — the reasoning-based signal in our recall ranking was inert.- Fixed search budget — no metareasoning, no adaptive compute.
Gap #1 shipped last week (see Self-Improving Agents That Can't Go Rogue). Today's two commits closed #2, #3, and #4. Here's each.
Gap #3 — reasoning-scored importance
CoALA cites Generative Agents (Park et al., §4.3) for a specific recall function: rank memories by recency (rule-based), importance (reasoning-based), and relevance (embedding-based). Our recall already computed exactly that blend:
score = α·similarity + β·recency + γ·importance // 0.7 / 0.2 / 0.1
The problem: importance was hardcoded to 0.5 for every row. So the γ·importance term was a constant — it contributed nothing to ranking. We had the function the paper names, but one of its three signals was dead.
The fix threads a real importance score through the write path. The post-call/turn extractor already makes one LLM call to reflect on a transcript and pull out durable notes; we now ask it to rate each new semantic note 1–10 (noteImportance) in that same call — no extra round-trip. That score flows MemoryExtractorService → MemoryReconciler.remember → MemoryService.write → the row's importance property. Recall ranking now weights a live signal: a note the model judged a 9 ("the customer's contract renews next month and they're a flight risk") outranks a passing 3 ("mentioned they like cricket"), all else equal.
It's a small diff with an outsized effect — the difference between a recall function that looks like the paper's and one that behaves like it.
Gap #2 — learning as a choice, not a schedule
Our autonomous loop always wrote one episodic row per step — a faithful trace of what happened. But that's a fixed schedule. The paper's §7 ideal is stronger: "Learning could be proposed as a possible action during regular decision-making." In other words, an agent should be able to decide to commit something to long-term memory, the same way it decides to call a tool.
So we added a remember action. A new SelfMemoryToolset exposes it as a first-class LEARNING action in the autonomous action space, wired into AgentRuntime.runAutonomous alongside the other tools the planner can choose from. When the agent concludes something worth keeping — a durable fact about the world or about itself — it can propose remember as its next move, and the write goes through the same MemoryReconciler as everything else (so it's deduplicated and importance-scored, not a blind append).
This matters because it makes learning visible to the planner. Writing to memory is no longer something the harness does on the agent's behalf every step; it's an option the agent weighs against acting on the world. That's the paper's "learning on par with grounding," and it's the seam that later lets a planner trade off "act now" vs. "record this for next time."
Gap #4 — adaptive compute, the headline
This is the big one, and it lands on three axes. The goal: stop spending a flat budget regardless of difficulty.
Dynamic breadth — widen the search only when unsure
Our AutonomousDriver plans by propose → evaluate → select: one LLM call proposes candidate actions, a second scores each 0.0–1.0 for advancing the objective, and we take the argmax. Previously it always proposed K = 3.
Now K is dynamic. The driver proposes K_base candidates, looks at the top score, and only widens — up to K_max — when the best candidate scores below a confidence threshold. A confident step (top score ≥ decision-confidence) commits immediately on three candidates. An ambiguous step earns more proposals, bounded to a few widen rounds with a distinct-candidate merge so we don't re-score duplicates.
propose K_base candidates → score
while top_score < decision-confidence and rounds < MAX_WIDEN_ROUNDS:
propose more (toward K_max), merge distinct, re-score
select argmax
Easy decisions stay cheap; hard, ambiguous decisions get a wider net — which is precisely where extra candidates pay off.
Dynamic depth — extend the budget while productive
Step count adapts too, in both directions:
- Early-stop when stuck. If the agent repeats itself or keeps failing the same action (
stuck-thresholdconsecutive no-progress steps), the loop ends instead of burning the rest of its budget thrashing. - Escalate when productive. Conversely, if the agent is making progress and approaches its step budget,
AgentRuntimeextends the budget — by the base amount each time — capped atmax-steps-capover at mostmax-escalations. A run that's clearly converging on a hard task isn't cut off at an arbitrary line; e.g. a base-2 budget escalates 2 → 4 → 6 while it keeps making headway.
Together, early-stop and escalation make the budget breathe: shrink on a dead end, grow on a productive thread.
The knobs
Everything is config, with sensible defaults:
| Property | Default | Effect |
|---|---|---|
matrix.runtime.decision-candidates | 3 | K_base candidates per DECIDE step |
matrix.runtime.decision-candidates-max | 6 | K_max — widen up to this when unsure |
matrix.runtime.decision-confidence | 0.7 | top score below this triggers a wider search |
matrix.runtime.stuck-threshold | 2 | consecutive no-progress steps → early-stop |
matrix.runtime.max-steps-cap | 16 | hard ceiling on the escalated step budget |
matrix.runtime.max-escalations | 2 | budget escalations allowed while productive |
What this is — and what it isn't
Be precise about the claim: this is heuristic metareasoning. Threshold rules — a confidence cutoff, a stuck counter, a productivity check — decide how much compute to spend. That's a real improvement over a fixed budget, and it closes the gap the paper names. But it isn't the frontier's end state.
The genuinely open problem, which the paper frames as future work, is learned metareasoning: a model that predicts how much compute a given step actually needs, rather than firing on hand-tuned thresholds. That's where this goes next. We're honest about the line in the validation doc: all four shipped, only learned metareasoning remains frontier.
Why grade yourself against a paper at all
It would be easy to ship these as four unrelated tweaks. Holding them to a blueprint is what turns "we added some heuristics" into "we closed the metareasoning gap, the reasoning-scored-importance gap, and the learning-as-action gap — here's the section, here's the file:line, here are the 27 green tests." A scorecard you can be embarrassed by is the thing that gets the last 10% done.
The verdict line in our doc now reads: the items the paper flags as understudied, risky, or frontier "have since been built (all four), each with the paper's safety caveats designed in." That's a good day.
Takeaway: an agent runtime shouldn't spend a flat budget on a variable world. Score memories by reasoning, let the planner choose to learn, and make the search breadth and depth adapt to how hard each step actually is — cheap when it's easy, deep when it's not.
Want to see the cognitive core in action? Create a workspace and run an autonomous task, or read One Decision Cycle for Interactive and Autonomous Agents for how the same loop powers a live phone call and a background job.
Build your first agent on Matrix
Spin up a workspace, wire up tools and knowledge, give your agent a voice, and talk to it in real time — no agent code required.