First-pass labeling agent

The problem

On a humanoid robotics data program, working with node-cluster and egocentric-tracking data, the labeling pipeline started every item from raw, unsorted input. Labelers spent a large share of their time just deciding what a thing was before they could grade it, and that sorting cost scaled linearly with volume. The bottleneck was not labeler skill, it was that every item arrived with zero structure, so the expensive human judgment was being spent on cheap triage instead of the calls that actually needed a person.

Approach

The design decision was to put an agent in front of the queue to do the cheap sorting, and to keep the human on the part that carries the judgment. The agent never grades and never finalizes; it proposes a first-pass category, and the labeler opens a pre-sorted queue and confirms or overrides every item.

Triage, do not decide. The agent buckets incoming data into a first-pass category. The leash: it proposes only, and a person confirms every call before anything is final.
Pre-sort the queue, do not shrink the review. Labelers open a triaged queue instead of raw input, so attention goes to the hard items. The leash: every item still passes a human, so the review surface never gets quietly cut.
Make overrides cheap and visible. A wrong first-pass guess costs one correction, not a re-sort from scratch. The leash: the human stays the source of truth, and the agent's guess is always a suggestion the person can reject.

What was built

A first-pass classification agent wired into the existing labeling workflow, plus the triaged queue it feeds. The agent reads incoming unlabeled items, assigns a first-pass category, and hands a pre-sorted queue to the labelers. The human-confirm step is part of the workflow, not bolted on after: nothing the agent proposes becomes a label until a person accepts it.

What a labeling pass looks like

The shape of the work, made concrete. Below is a small representative worksheet: the agent proposes a first-pass category and a confidence, and a person confirms or overrides every row. Two are overridden on purpose, because the human owning the call is the whole point.

Representative passAnonymized sample

Representative and anonymized. This illustrates the workflow and output shape, not real program data. Item names, categories, and confidence values are made up to show how a pass reads. They are not a leaked export and are not a business metric. The only measured result is the 25% increase in data grading output stated above.

Generic placeholder items. Categories shown are Accept, Re-segment, Discard, and Needs review. Confidence is the agent's own first-pass score, not an accuracy claim.
Item	First-pass category	Confidence	Human verdict
Clip 0481	Accept	0.94	Confirmed
Segment A-12	Re-segment	0.71	Confirmed
Clip 0517	Discard	0.58	Overridden
Segment B-07	Accept	0.88	Confirmed
Clip 0533	Needs review	0.49	Confirmed
Segment C-21	Re-segment	0.63	Overridden
Clip 0560	Accept	0.91	Confirmed

See it on synthetic data

The pattern, runnable. Below is a tiny synthetic queue of generic support-style messages, the kind of unsorted input a labeler would otherwise hand-sort one by one. Press Run first pass to watch a transparent classifier bucket them, sort them by confidence, and score itself against the hidden ground truth. The point is not the model, it is the routing: the agent does the first pass, and the human still owns every ambiguous call.

First-pass triageSynthetic data

Fifty-four made-up messages sorted into four generic buckets by a simple keyword heuristic. Items it is confident about go to a bulk-confirm lane; the uncertain ones are routed to a human for hand review.

Illustrative demo on synthetic data. The numbers below are measured live on this sample, not a client or employer result. The classifier is a plain keyword-and-margin heuristic shown for transparency, not a production model, and this demo reports its own separately computed figures.

Confidence to bulk-confirm 70%

--press Run first pass to measureevery figure is computed from this sample

Raw queue

Everything unlabeled. A human would hand-sort all of it.

Triaged queue

Auto-bucketed, grouped by category, sorted by confidence.

Drag the threshold to see the honest tradeoff: a lower bar bulk-confirms more items but lets more first-pass errors through, while a higher bar sends more work to the human and raises the bulk lane's accuracy. Triage moves where the human looks; it does not remove the human.

Project the savings at your scale

The demo above measures a review-reduction lift on synthetic data. This is the next honest step: a transparent model that PROJECTS what a lift like that is worth, using numbers you set. Change any input and the math updates live. Nothing here is a measured or booked result; it is arithmetic on your own assumptions.

Projected savingsProjection

Projection. Based on the assumptions you set and the review reduction measured in the demo above. Not a guarantee or a measured client result.

Items reviewed per day How many items a human would otherwise open one by one.

Minutes per manual review Average time to hand-review a single item.

Reviewer cost per hour (USD) An assumption you set, not a price. Fully loaded hourly cost.

Review-reduction lift (%) 89% Share of items the human no longer opens one by one. The default tracks the demo above, which measures 89% on an easy synthetic sample; real-world lift is usually lower, so set a conservative value for your own estimate. Drag the demo slider to drive this, or override it here.

--reviewer hours saved per weekat five working days

--reviewer hours saved per monthat ~21.7 working days

--reviewer hours saved per yearat 260 working days

--projected money saved per yearyearly hours saved x cost per hour

Show the formula

Every output is computed live from the four inputs above. No result is stored in the page.

Minutes saved per day = items per day × lift × minutes per review
Hours saved per day = minutes saved per day ÷ 60
Per week = hours saved per day × 5. Per month = × 21.7. Per year = × 260.
Money saved per year = yearly hours saved × cost per hour

Working-day assumptions: 5 days per week, 260 days per year, and 260 ÷ 12 = 21.7 days per month. The lift default traces to the demo above, where it is measured live on synthetic data. These are planning assumptions, not measured outcomes.

This model is deliberately simple and transparent. It values only the hand-review time the triage step removes; it does not claim quality gains, and it does not net out the cost of running the agent. Treat the lift as an assumption to pressure-test, not a promise.

Guardrails

What the agent is structurally not allowed to do. This is the through-line: capability, then leash.

NoIt cannot finalize a label. Every call is a proposal that a human confirms or overrides.
NoIt cannot remove the human review step. The triaged queue speeds the work; it does not replace the person.
NoIt cannot silently change the dataset. The agent writes a first-pass suggestion, never a graded result.

Stack and tools

Claude Code Python Classification Human-in-the-loop Triage queue

My role

I designed and shipped this agent into production as part of my work leading data collection on the program.

Links and verification

This is professional production work on a humanoid robotics data program. It is de-identified and has no public repo, so there is nothing to link here. The 25% increase in data grading output is a real measured outcome, presented as fact.

First-pass labeling agent that triages the queue so humans grade faster