Cognitive Core & Autonomy·June 15, 2026·10 min read

Self-Improving Agents That Can't Go Rogue: Propose, Approve, Apply

Letting an agent edit its own behaviour is powerful and dangerous. Matrix lets it propose changes to its procedural memory — a human approves before anything changes.

By Matrix Team

An agent that can rewrite its own instructions is the dream and the nightmare in the same sentence. The dream: an agent notices it keeps fumbling the same edge case, learns the fix, and gets better without you babysitting a prompt file. The nightmare: that same write path is the single most dangerous capability you can hand a language model. Give it a tool that edits its own persona and one confused — or jailbroken — turn can quietly subvert everything you intended it to do.

The CoALA paper (Cognitive Architectures for Language Agents) is blunt about exactly this. Writing to procedural memory — the agent's own "code" — is, in their words:

"significantly riskier than writing to episodic or semantic memory, as it can easily introduce bugs or allow an agent to subvert its designers' intentions." (§4.1)

Most "self-improving agent" demos skip that sentence. Matrix builds around it. Self-improving agents here can propose changes to their own behaviour — and a human approves before anything changes. The safety posture isn't bolted on after the fact; it's the shape of the system. This post is about that shape.

The principle: the agent never holds the pen

Here is the design decision everything else hangs on, stated as principle #1 in the spec:

No agent-callable mutation of code. The only agent-facing tool writes a PROPOSED record; there is no tool that edits an Agent/Skill entity. The apply path is an operator-only endpoint. This is a structural guarantee, not an RBAC policy — even a fully compromised agent cannot self-edit.

Read that twice. The point isn't "we have a permission that blocks self-edits." Permissions can be misconfigured, escalated, or bypassed by a clever prompt. The point is that the mutation tool does not exist in the agent's tool surface at all. The most an agent can do — autonomous, interactive, jailbroken, or perfectly behaved — is leave a note that says "I'd like to change this." Applying that note is a separate, operator-only code path the model can never reach.

That's structural safety versus policy safety. RBAC is still there as defense-in-depth, but it's the second lock, not the first. The first lock is that there's no door.

The generic engine: propose → approve → apply

Rather than build a one-off "self-edit" feature, Matrix added a domain-agnostic primitive: ChangeRequestService. It does exactly three things, for any entity in the graph:

Propose a mutation — UPDATE, CREATE, or DELETE — against a target entity.
Route a HUMAN approval task.
Apply it via EntityManager, but only when an operator approves.

Because everything in Matrix is an entity (agents, skills, tools, knowledge — all EntityNode rows), the proposal is an entity too. A ChangeRequest row records what it wants to do:

Field	Meaning
`op`	`UPDATE` / `CREATE` / `DELETE`
`targetType` / `targetId`	the entity to mutate
`patchJson`	the proposed field values
`beforeJson`	snapshot of the touched fields — the diff baseline and the apply-time stale-guard
`status`	`PROPOSED` → `APPLIED` / `REJECTED` / `SUPERSEDED`
`proposerKind` / `proposerId`	`AGENT` or `OPERATOR`, from the tenant context
`summary` / `rationale`	what shows up in the review queue
`supersedes` / `supersededAt`	links an applied update to the version it replaced — so it's revertable

The payoff of going generic: every future "agent proposes an entity change" feature inherits the same approval pipeline, audit trail, and diff UI for free. The engine doesn't care whether it's editing a persona or a tool description. It moves an entity into a proposed state and waits for a human.

Two producers ride the engine

Today, two scoped tools sit on top of ChangeRequestService. Both are classified as LEARNING actions in the cognitive core's typed action space, so they render under "Learning" in the action-space prompt block.

`propose_procedural_edit` — persona and skill text

This is the headline tool: an agent proposing a change to its own procedural memory — its persona (Agent.systemPrompt) or an attached skill's augmentor (Skill.systemPromptBlock). The payload is deliberately structured to keep the blast radius small:

// tool: propose_procedural_edit  (ActionType.LEARNING)
{
  "target":       "self",            // "self" → this agent's systemPrompt; or "skill"
  "skillKey":     "string",          // required iff target == "skill"; must be attached to this agent
  "mode":         "append_section",  // default; or "replace" (flagged louder in the diff)
  "section":      "Learned rules",   // append_section: heading for the new block
  "proposedText": "When a caller asks for a refund older than 90 days, cite policy clause 4.2 before declining.",
  "summary":      "Add refund-window rule for >90 day requests",
  "rationale":    "Got this wrong twice this week; callers expected a reason, not a flat no."
}

Note the defaults. Append is preferred over replace — the common case ("I learned X, add a rule") is a bounded append_section, not a full-prompt rewrite. A replace is allowed, but it's flagged more loudly in the review diff because its blast radius is the whole prompt. The handler validates before it writes anything: the new text must be non-blank, must fit under a max-character cap (default 12000), and is rejected if it trips a secret-pattern denylist (api_key, BEGIN ... PRIVATE KEY, and similar). An agent can't smuggle a credential into its own persona.

Scoping is enforced at write time: target: "self" resolves to the running agent; target: "skill" only resolves if that skill is currently attached to this agent, in this org. Cross-agent edits are rejected outright.

`propose_tool_change` — a tool's description, nothing more

The second producer lets an agent suggest improving an attached tool's description or name — the model-facing text that helps it call the tool correctly. Critically, it can never touch wiring or secrets: not the URL, not headers, not auth, not the parameter schema. Just the natural-language fields a model reads. And because it rides the same engine, a tool change request renders in the exact same review drawer as a persona edit — zero new frontend.

Review where work already lives

There's no bespoke admin screen for any of this. When a proposal is created, ChangeRequestService dispatches a HUMAN approval task through the existing Task framework. It lands in the Tasks tab — the same inbox operators already use — as a change_request_approval step.

The operator clicks Review, which opens a per-field before → after diff drawer: current text on one side, proposed text on the other, added lines highlighted. They approve or reject in one call:

POST /api/orgs/{slug}/tasks/runs/{runId}/complete
{ "result": { "decision": "approve" } }
# or
{ "result": { "decision": "reject", "reason": "Too broad — scope to the EU region only." } }

On approve, a generic HumanStepResolver applies the mutation under the operator's principal — never the proposer's — inside the completion transaction. On reject, the request is closed with the reason. Either way, the model's role ended the moment it left the note.

The safety details that matter

Structural safety is the foundation, but the apply path is hardened too:

Supersede, never delete. Applying an UPDATE writes the new value and links the prior version via supersedes. Every applied change is reversible to its exact prior text. Nothing is destroyed; history is a forward-only chain.
An optimistic stale-guard. At apply time, every field in the proposal's beforeJson snapshot must still hold its snapshotted value on the live target. If someone edited the persona in the meantime, the apply throws, the transaction rolls back, and the run stays waiting for a human. No silent clobbering of a change you made after the agent proposed.
Everything is audited. Every approve and reject writes an AuditEvent — who, when, what, and the diff. The decision is as traceable as the proposal.
Append-preferred, denylisted, size-capped. Covered above — small blast radius by default, secrets blocked, length bounded.

On top of those, the platform adds quality guardrails so the review queue stays sane: a per-proposer cooldown, an open-proposal cap, and an embedding-based near-duplicate guard so an agent can't spam ten rewordings of the same idea.

Double-gated, and it ships dark

This is important: none of this is on by default. Self-improvement is double-gated, both gates default off:

Agent.selfImproveEnabled — a per-agent opt-in. Off unless you deliberately turn it on for a specific agent.
matrix.runtime.procedural-learning.enabled — a platform kill-switch for propose_procedural_edit.
matrix.runtime.tool-proposals.enabled — the same, for propose_tool_change.

If either the per-agent flag or the relevant platform flag is off, the tool simply isn't in the agent's surface. You opt into self-improvement explicitly, one agent at a time, with a global override you can flip off in one move. The feature exists in the codebase, fully wired, shipping dark — exactly the posture CoALA argues for in §6: take "the minimal action space necessary," and don't hand an agent a risky capability it doesn't need.

For deployments running strict access control (Organization.accessEnabled=true), the spec recommends a belt-and-braces agent grant denying write to AGENT/SKILL types too. But that's defense-in-depth. The primary control is that the agent has no write tool to begin with.

Live on the next turn — no redeploy

Here's the part that makes it feel like real learning rather than a config workflow. Matrix recomposes every agent's prompt per turn from the live Agent/Skill entity (see CoALA in production for how the cognitive core assembles working memory each cycle). So an approved edit takes effect on the agent's very next turn — automatically. No redeploy, no restart, no cache to bust. The operator approves mid-conversation and the agent's instructions are updated before it speaks again.

How it fits the autonomous loop

Self-improvement isn't bolted onto the runtime — it's a first-class move in the same decision cycle that drives autonomous agents. When an autonomous agent plans a step, its planner can propose any tool in its valid action space, and propose_procedural_edit is just another candidate. So a long-running autonomous task can decide, mid-run, "I keep hitting this; I should propose a durable rule," call the tool, and log that step into its run history.

But the run never applies anything. The proposal queues for human review like any other. The agent's autonomy ends at the proposal; the human's authority begins there. A good autonomous prompt rider, from the spec: "If you discover a durable improvement to how you should behave, call propose_procedural_edit rather than just acting on it this once." Learn the lesson, propose the change, let a human decide if it sticks.

The takeaway

CoALA names procedural-memory writes as the riskiest thing an agent can do, and most implementations either avoid it entirely or hand it over with no guardrails. Matrix takes the third path: an agent can genuinely learn and improve its own behaviour, but the only thing it can do with that learning is ask. The structural guarantee — no self-edit tool exists — means the safety holds even if the model is compromised. Supersede-never-delete makes it reversible. Audit makes it accountable. Per-turn recomposition makes it instant. And it all ships dark, so you opt in deliberately, one agent at a time.

Self-improving agents don't have to be a leap of faith. They can be propose, approve, apply.

See the bigger picture: procedural memory is one of the four kinds of agent memory, and the decision cycle behind all of this is laid out in CoALA in production.

Want to try it? Spin up a workspace, create an agent in /admin/agents, and turn on selfImproveEnabled for one of them. Watch the proposals land in your Tasks tab — and decide, every time, what your agent becomes next.

#self-improving agents#procedural memory#human in the loop#governance

Build your first agent on Matrix

Spin up a workspace, wire up tools and knowledge, give your agent a voice, and talk to it in real time — no agent code required.

Create a workspace Read more articles