First-pass labeling agent that triages the queue so humans grade faster
An agent that auto-buckets incoming unlabeled data into first-pass categories, so human labelers open a triaged queue instead of raw input, and a person confirms every call.
The problem
On a humanoid robotics data program, working with node-cluster and egocentric-tracking data, the labeling pipeline started every item from raw, unsorted input. Labelers spent a large share of their time just deciding what a thing was before they could grade it, and that sorting cost scaled linearly with volume. The bottleneck was not labeler skill, it was that every item arrived with zero structure, so the expensive human judgment was being spent on cheap triage instead of the calls that actually needed a person.
Approach
The design decision was to put an agent in front of the queue to do the cheap sorting, and to keep the human on the part that carries the judgment. The agent never grades and never finalizes; it proposes a first-pass category, and the labeler opens a pre-sorted queue and confirms or overrides every item.
- Triage, do not decide. The agent buckets incoming data into a first-pass category. The leash: it proposes only, and a person confirms every call before anything is final.
- Pre-sort the queue, do not shrink the review. Labelers open a triaged queue instead of raw input, so attention goes to the hard items. The leash: every item still passes a human, so the review surface never gets quietly cut.
- Make overrides cheap and visible. A wrong first-pass guess costs one correction, not a re-sort from scratch. The leash: the human stays the source of truth, and the agent's guess is always a suggestion the person can reject.
What was built
A first-pass classification agent wired into the existing labeling workflow, plus the triaged queue it feeds. The agent reads incoming unlabeled items, assigns a first-pass category, and hands a pre-sorted queue to the labelers. The human-confirm step is part of the workflow, not bolted on after: nothing the agent proposes becomes a label until a person accepts it.
What a labeling pass looks like
The shape of the work, made concrete. Below is a small representative worksheet: the agent proposes a first-pass category and a confidence, and a person confirms or overrides every row. Two are overridden on purpose, because the human owning the call is the whole point.
Representative passAnonymized sample
Representative and anonymized. This illustrates the workflow and output shape, not real program data. Item names, categories, and confidence values are made up to show how a pass reads. They are not a leaked export and are not a business metric. The only measured result is the 25% increase in data grading output stated above.
| Item | First-pass category | Confidence | Human verdict |
|---|---|---|---|
| Clip 0481 | Accept | 0.94 | Confirmed |
| Segment A-12 | Re-segment | 0.71 | Confirmed |
| Clip 0517 | Discard | 0.58 | Overridden |
| Segment B-07 | Accept | 0.88 | Confirmed |
| Clip 0533 | Needs review | 0.49 | Confirmed |
| Segment C-21 | Re-segment | 0.63 | Overridden |
| Clip 0560 | Accept | 0.91 | Confirmed |
See it on synthetic data
The pattern, runnable. Below is a tiny synthetic queue of generic support-style messages, the kind of unsorted input a labeler would otherwise hand-sort one by one. Press Run first pass to watch a transparent classifier bucket them, sort them by confidence, and score itself against the hidden ground truth. The point is not the model, it is the routing: the agent does the first pass, and the human still owns every ambiguous call.
First-pass triageSynthetic data
Fifty-four made-up messages sorted into four generic buckets by a simple keyword heuristic. Items it is confident about go to a bulk-confirm lane; the uncertain ones are routed to a human for hand review.
Illustrative demo on synthetic data. The numbers below are measured live on this sample, not a client or employer result. The classifier is a plain keyword-and-margin heuristic shown for transparency, not a production model, and this demo reports its own separately computed figures.
Raw queue
Everything unlabeled. A human would hand-sort all of it.
Triaged queue
Auto-bucketed, grouped by category, sorted by confidence.
Drag the threshold to see the honest tradeoff: a lower bar bulk-confirms more items but lets more first-pass errors through, while a higher bar sends more work to the human and raises the bulk lane's accuracy. Triage moves where the human looks; it does not remove the human.
Project the savings at your scale
The demo above measures a review-reduction lift on synthetic data. This is the next honest step: a transparent model that PROJECTS what a lift like that is worth, using numbers you set. Change any input and the math updates live. Nothing here is a measured or booked result; it is arithmetic on your own assumptions.
Projected savingsProjection
Projection. Based on the assumptions you set and the review reduction measured in the demo above. Not a guarantee or a measured client result.
Show the formula
Every output is computed live from the four inputs above. No result is stored in the page.
- Minutes saved per day = items per day × lift × minutes per review
- Hours saved per day = minutes saved per day ÷ 60
- Per week = hours saved per day × 5. Per month = × 21.7. Per year = × 260.
- Money saved per year = yearly hours saved × cost per hour
Working-day assumptions: 5 days per week, 260 days per year, and 260 ÷ 12 = 21.7 days per month. The lift default traces to the demo above, where it is measured live on synthetic data. These are planning assumptions, not measured outcomes.
This model is deliberately simple and transparent. It values only the hand-review time the triage step removes; it does not claim quality gains, and it does not net out the cost of running the agent. Treat the lift as an assumption to pressure-test, not a promise.
Guardrails
What the agent is structurally not allowed to do. This is the through-line: capability, then leash.
- NoIt cannot finalize a label. Every call is a proposal that a human confirms or overrides.
- NoIt cannot remove the human review step. The triaged queue speeds the work; it does not replace the person.
- NoIt cannot silently change the dataset. The agent writes a first-pass suggestion, never a graded result.
Stack and tools
My role
I designed and shipped this agent into production as part of my work leading data collection on the program.
Links and verification
This is professional production work on a humanoid robotics data program. It is de-identified and has no public repo, so there is nothing to link here. The 25% increase in data grading output is a real measured outcome, presented as fact.