Runbook — Annotator Calibration & Qualification Session

Purpose: turn raw labelers into trained annotators (defined in validation-plan P1) and produce the documented evidence — “two annotators, qualified at κ ≥ 0.8 on a held-out clip” — that makes the inter-rater κ defensible.

Who’s in the room: the surgeon/PI (defines truth), the annotators (≥2, e.g. RA / med student / resident), and whoever runs CVAT. ~half a day total, can be split across two sessions.

One core principle: the surgeon sets the standard once; everything else measures distance from it. Annotators are never the source of truth — they’re trained toward the surgeon’s reference and gated on hitting it.


Before the session (prep)

  • CVAT is up; each annotator + the surgeon has a login.
  • Label set loaded from detection-ontology (7 instrument classes + events: swap, scope clean, instrument change, pause, transition).
  • Annotation protocol readyannotation-protocol (v1.0, drafted from the locked definitions) + confirm its §0 scheme decisions with the surgeon at session start.
  • Three clips selected from different cases (so no clip is reused across roles):
    • Demo clip (~5 min) — surgeon narrates this one live.
    • Calibration clip (~10 min) — everyone labels, then compare openly.
    • Qualification clip (~10 min) — labeled independently, blind, scored. Annotators must not see this one beforehand.
  • Reference labels created by the surgeon on the calibration + qualification clips (this is “truth”). Do this privately in advance.
  • Agreement-scoring ready (the data:statistical-analysis skill computes κ / IoU / ICC from CVAT exports).

Part 1 — Orientation (~45 min)

Goal: shared mental model.

  1. Surgeon walks the demo clip live (~20 min) — pauses on each instrument and event, names it, explains why it’s that and not the look-alike (nav suction vs non-nav suction vs nav probe — your hardest boundary, per Swap Count Artifact Analysis). Calls out what is not an event (a 2-frame flicker is not a swap; a hand-occlusion is not a pause).
  2. Protocol read-through (~15 min) — go through the written definitions together; annotators ask edge-case questions.
  3. CVAT mechanics (~10 min) — how to draw/edit boxes, use the timeline, save, submit a job. Practice on a throwaway clip.

Output: everyone has seen truth demonstrated and knows the tool. No scoring yet.


Part 2 — Calibration round (~90 min) open, iterative

Goal: converge the annotators onto the surgeon’s standard, and surface protocol gaps.

  1. Independent labeling — each annotator labels the calibration clip alone (~30–40 min). No conferring.
  2. Score vs reference — export, compute κ (class), IoU (boxes), ICC (event counts), boundary agreement (±1 s). (~10 min)
  3. Disagreement review — the important part (~30 min). Put the annotators’ labels next to the surgeon’s reference and walk every disagreement:
    • Annotator error → coaching (“that’s a nav probe, here’s the tell”).
    • Genuine ambiguity → the protocol is underspecified → write a new rule and update detection-ontology. (This is why calibration doubles as an ontology stress-test.)
  4. Re-label if needed — if agreement is poor (κ < ~0.6) or many rules changed, run a second short calibration clip and re-score. Iterate until annotators and reference are converging.

Output: a sharpened protocol + annotators who’ve seen their own mistakes corrected. Calibration is open-book — disagreement is expected and useful here. Record what rules changed.


Part 3 — Qualification round (~45 min) blind, scored, pass/fail

Goal: prove each annotator can hit the standard independently, on data they haven’t seen.

  1. Blind independent labeling of the qualification clip — no conferring, no peeking at the reference, protocol document allowed. (~25 min)
  2. Score each annotator vs the surgeon’s reference. (~10 min)
  3. Apply the gate:
ResultThreshold (proposed)Action
Passκ ≥ 0.80 (class) and mean IoU ≥ 0.7 and event-count ICC ≥ 0.75Annotator is qualified → may label the real study subset
Borderlineκ 0.6–0.8Targeted coaching on their specific error pattern + re-qualify on a fresh clip
Failκ < 0.6Re-train (repeat calibration); if persistent, reassign — not everyone annotates well

Output: a documented pass/fail per annotator. Only passed annotators’ labels enter the P1 study.


After the session (documentation — don’t skip)

Record, in the project, the facts a reviewer will want:

The deliverable sentence this produces: “Two annotators (RA, PGY-3) were trained on the protocol and qualified independently at κ ≥ 0.80 against surgeon-defined reference labels on a held-out clip before annotating the validation subset.”


Timing summary

PartTimeMode
Prep(advance)surgeon builds reference
1. Orientation~45 mintogether
2. Calibration~90 minlabel alone → review together
3. Qualification~45 minblind, scored
Docs~15 min

Total active session ≈ 3 hours. Split Part 1 from Parts 2–3 across two days if scheduling is tight.