Runbook — Annotator Calibration & Qualification Session
Purpose: turn raw labelers into trained annotators (defined in validation-plan P1) and produce the documented evidence — “two annotators, qualified at κ ≥ 0.8 on a held-out clip” — that makes the inter-rater κ defensible.
Who’s in the room: the surgeon/PI (defines truth), the annotators (≥2, e.g. RA / med student / resident), and whoever runs CVAT. ~half a day total, can be split across two sessions.
One core principle: the surgeon sets the standard once; everything else measures distance from it. Annotators are never the source of truth — they’re trained toward the surgeon’s reference and gated on hitting it.
Before the session (prep)
- CVAT is up; each annotator + the surgeon has a login.
- Label set loaded from detection-ontology (7 instrument classes + events: swap, scope clean, instrument change, pause, transition).
- Annotation protocol ready — annotation-protocol (v1.0, drafted from the locked definitions) + confirm its §0 scheme decisions with the surgeon at session start.
- Three clips selected from different cases (so no clip is reused across roles):
- Demo clip (~5 min) — surgeon narrates this one live.
- Calibration clip (~10 min) — everyone labels, then compare openly.
- Qualification clip (~10 min) — labeled independently, blind, scored. Annotators must not see this one beforehand.
- Reference labels created by the surgeon on the calibration + qualification clips (this is “truth”). Do this privately in advance.
- Agreement-scoring ready (the
data:statistical-analysisskill computes κ / IoU / ICC from CVAT exports).
Part 1 — Orientation (~45 min)
Goal: shared mental model.
- Surgeon walks the demo clip live (~20 min) — pauses on each instrument and event, names it, explains why it’s that and not the look-alike (nav suction vs non-nav suction vs nav probe — your hardest boundary, per Swap Count Artifact Analysis). Calls out what is not an event (a 2-frame flicker is not a swap; a hand-occlusion is not a pause).
- Protocol read-through (~15 min) — go through the written definitions together; annotators ask edge-case questions.
- CVAT mechanics (~10 min) — how to draw/edit boxes, use the timeline, save, submit a job. Practice on a throwaway clip.
Output: everyone has seen truth demonstrated and knows the tool. No scoring yet.
Part 2 — Calibration round (~90 min) open, iterative
Goal: converge the annotators onto the surgeon’s standard, and surface protocol gaps.
- Independent labeling — each annotator labels the calibration clip alone (~30–40 min). No conferring.
- Score vs reference — export, compute κ (class), IoU (boxes), ICC (event counts), boundary agreement (±1 s). (~10 min)
- Disagreement review — the important part (~30 min). Put the annotators’ labels next to the surgeon’s reference and walk every disagreement:
- Annotator error → coaching (“that’s a nav probe, here’s the tell”).
- Genuine ambiguity → the protocol is underspecified → write a new rule and update detection-ontology. (This is why calibration doubles as an ontology stress-test.)
- Re-label if needed — if agreement is poor (κ < ~0.6) or many rules changed, run a second short calibration clip and re-score. Iterate until annotators and reference are converging.
Output: a sharpened protocol + annotators who’ve seen their own mistakes corrected. Calibration is open-book — disagreement is expected and useful here. Record what rules changed.
Part 3 — Qualification round (~45 min) blind, scored, pass/fail
Goal: prove each annotator can hit the standard independently, on data they haven’t seen.
- Blind independent labeling of the qualification clip — no conferring, no peeking at the reference, protocol document allowed. (~25 min)
- Score each annotator vs the surgeon’s reference. (~10 min)
- Apply the gate:
| Result | Threshold (proposed) | Action |
|---|---|---|
| Pass | κ ≥ 0.80 (class) and mean IoU ≥ 0.7 and event-count ICC ≥ 0.75 | Annotator is qualified → may label the real study subset |
| Borderline | κ 0.6–0.8 | Targeted coaching on their specific error pattern + re-qualify on a fresh clip |
| Fail | κ < 0.6 | Re-train (repeat calibration); if persistent, reassign — not everyone annotates well |
Output: a documented pass/fail per annotator. Only passed annotators’ labels enter the P1 study.
After the session (documentation — don’t skip)
Record, in the project, the facts a reviewer will want:
- Who annotated (role, not name if anonymizing RA, PGY-3 resident)
- Training performed — orientation + calibration rounds, dates.
- Qualification scores per annotator (the κ / IoU / ICC they passed at) and the threshold used.
- Protocol version used for the live study + the changes calibration produced.
- Update validation-plan P1 status with the qualification numbers.
The deliverable sentence this produces: “Two annotators (RA, PGY-3) were trained on the protocol and qualified independently at κ ≥ 0.80 against surgeon-defined reference labels on a held-out clip before annotating the validation subset.”
Timing summary
| Part | Time | Mode |
|---|---|---|
| Prep | (advance) | surgeon builds reference |
| 1. Orientation | ~45 min | together |
| 2. Calibration | ~90 min | label alone → review together |
| 3. Qualification | ~45 min | blind, scored |
| Docs | ~15 min | — |
Total active session ≈ 3 hours. Split Part 1 from Parts 2–3 across two days if scheduling is tight.
Links
- validation-plan — P1, where “trained annotator” is defined and where these numbers land
- detection-ontology — the definitions calibration tests and sharpens
- cvat-self-host-runbook — the tool the session runs in
- Swap Count Artifact Analysis — why the swap/flicker and suction-class boundaries need the most calibration attention