Production

RCA auto-remediation agent that acts only inside a reversible envelope

An agent that reads root-cause-analysis tickets and runs scoped terminal commands to resolve known failure modes, autonomous only inside a known, safe, reversible remediation envelope.

30% increase in data uptime, measured in production on a humanoid robotics data program

The problem

On a humanoid robotics data program, working with node-cluster and egocentric-tracking data, a recurring set of failure modes kept knocking the pipeline offline. The fixes were well understood, but they waited on a human to read the root-cause-analysis ticket, recognize the failure, and run the same handful of commands. That wait was the cost: known problems with known fixes were still incurring full human-response latency, and uptime suffered for issues nobody actually needed to think about.

Approach

The whole design is the envelope. The interesting engineering here is not making an agent run terminal commands; it is deciding exactly what it may fix on its own and what still needs a human. The agent is allowed to act autonomously, but only inside a bounded set of known failures with known, reversible fixes. Anything outside that envelope is not its call.

Autonomous, but only on known failure modes. The agent reads an RCA ticket, matches it to a known failure, and runs the scoped fix. The leash: an unrecognized failure is out of scope, full stop, and goes to a human.
Reversible only. Every command in the envelope is one that can be undone. The leash: if a fix is not safely reversible, it is not in the envelope and the agent cannot run it.
Scoped commands, not a free shell. The agent runs a defined set of remediation commands, not arbitrary terminal access. The leash: the boundary is the set of known fixes, and the agent halts at it rather than improvising.

What was built

An agent that watches root-cause-analysis tickets, classifies each against a known set of failure modes, and runs the scoped, reversible remediation for the ones it recognizes. The remediation envelope, the set of known failures paired with their known, reversible fixes, is the core artifact; it is what makes the autonomy safe. Anything the agent does not recognize is escalated to a person instead of guessed at.

Guardrails

What the agent is structurally not allowed to do. This is the through-line: capability, then leash.

NoIt cannot act on an unknown failure mode. Anything outside the envelope is escalated to a human, never guessed.
NoIt cannot run an irreversible command. If a fix cannot be safely undone, it is not in the envelope.
NoIt does not get a free terminal. It runs a defined set of scoped remediations, not arbitrary commands.

Stack and tools

Claude Code Python Scoped commands Bounded autonomy Reversible envelope

My role

I designed and shipped this agent into production as part of my work leading data collection on the program.

Links and verification

This is professional production work on a humanoid robotics data program. It is de-identified and has no public repo, so there is nothing to link here. The 30% increase in data uptime is a real measured outcome, presented as fact.