Runbook — Annotator Calibration & Qualification Session

Purpose: turn raw labelers into trained annotators (defined in validation-plan P1) and produce the documented evidence — “two annotators, qualified at κ ≥ 0.8 on a held-out clip” — that makes the inter-rater κ defensible.

Who’s in the room: the surgeon/PI (defines truth), the annotators (≥2, e.g. RA / med student / resident), and whoever runs CVAT. ~half a day total, can be split across two sessions.

One core principle: the surgeon sets the standard once; everything else measures distance from it. Annotators are never the source of truth — they’re trained toward the surgeon’s reference and gated on hitting it.

Before the session (prep)

CVAT is up; each annotator + the surgeon has a login.
Label set loaded from detection-ontology (7 instrument classes + events: swap, scope clean, instrument change, pause, transition).
Annotation protocol ready — annotation-protocol (v1.0, drafted from the locked definitions) + confirm its §0 scheme decisions with the surgeon at session start.
Three clips selected from different cases (so no clip is reused across roles):
- Demo clip (~5 min) — surgeon narrates this one live.
- Calibration clip (~10 min) — everyone labels, then compare openly.
- Qualification clip (~10 min) — labeled independently, blind, scored. Annotators must not see this one beforehand.
Reference labels created by the surgeon on the calibration + qualification clips (this is “truth”). Do this privately in advance.
Agreement-scoring ready (the data:statistical-analysis skill computes κ / IoU / ICC from CVAT exports).

Part 1 — Orientation (~45 min)

Goal: shared mental model.

Surgeon walks the demo clip live (~20 min) — pauses on each instrument and event, names it, explains why it’s that and not the look-alike (nav suction vs non-nav suction vs nav probe — your hardest boundary, per Swap Count Artifact Analysis). Calls out what is not an event (a 2-frame flicker is not a swap; a hand-occlusion is not a pause).
Protocol read-through (~15 min) — go through the written definitions together; annotators ask edge-case questions.
CVAT mechanics (~10 min) — how to draw/edit boxes, use the timeline, save, submit a job. Practice on a throwaway clip.

Output: everyone has seen truth demonstrated and knows the tool. No scoring yet.

Part 2 — Calibration round (~90 min) open, iterative

Goal: converge the annotators onto the surgeon’s standard, and surface protocol gaps.

Independent labeling — each annotator labels the calibration clip alone (~30–40 min). No conferring.
Score vs reference — export, compute κ (class), IoU (boxes), ICC (event counts), boundary agreement (±1 s). (~10 min)
Disagreement review — the important part (~30 min). Put the annotators’ labels next to the surgeon’s reference and walk every disagreement:
- Annotator error → coaching (“that’s a nav probe, here’s the tell”).
- Genuine ambiguity → the protocol is underspecified → write a new rule and update detection-ontology. (This is why calibration doubles as an ontology stress-test.)
Re-label if needed — if agreement is poor (κ < ~0.6) or many rules changed, run a second short calibration clip and re-score. Iterate until annotators and reference are converging.

Output: a sharpened protocol + annotators who’ve seen their own mistakes corrected. Calibration is open-book — disagreement is expected and useful here. Record what rules changed.

Goal: prove each annotator can hit the standard independently, on data they haven’t seen.

Blind independent labeling of the qualification clip — no conferring, no peeking at the reference, protocol document allowed. (~25 min)
Score each annotator vs the surgeon’s reference. (~10 min)
Apply the gate:

Result	Threshold (proposed)	Action
Pass	κ ≥ 0.80 (class) and mean IoU ≥ 0.7 and event-count ICC ≥ 0.75	Annotator is qualified → may label the real study subset
Borderline	κ 0.6–0.8	Targeted coaching on their specific error pattern + re-qualify on a fresh clip
Fail	κ < 0.6	Re-train (repeat calibration); if persistent, reassign — not everyone annotates well

Output: a documented pass/fail per annotator. Only passed annotators’ labels enter the P1 study.

After the session (documentation — don’t skip)

Record, in the project, the facts a reviewer will want:

Who annotated (role, not name if anonymizing RA, PGY-3 resident)
Training performed — orientation + calibration rounds, dates.
Qualification scores per annotator (the κ / IoU / ICC they passed at) and the threshold used.
Protocol version used for the live study + the changes calibration produced.
Update validation-plan P1 status with the qualification numbers.

The deliverable sentence this produces: “Two annotators (RA, PGY-3) were trained on the protocol and qualified independently at κ ≥ 0.80 against surgeon-defined reference labels on a held-out clip before annotating the validation subset.”

Timing summary

Part	Time	Mode
Prep	(advance)	surgeon builds reference
1. Orientation	~45 min	together
2. Calibration	~90 min	label alone → review together
3. Qualification	~45 min	blind, scored
Docs	~15 min	—

Total active session ≈ 3 hours. Split Part 1 from Parts 2–3 across two days if scheduling is tight.

Pharyvac Computer Vision

Explorer

calibration-qualification-session

Runbook — Annotator Calibration & Qualification Session

Before the session (prep)

Part 1 — Orientation (~45 min)

Part 2 — Calibration round (~90 min) open, iterative

Part 3 — Qualification round (~45 min) blind, scored, pass/fail

After the session (documentation — don’t skip)

Timing summary

Links

Graph View

Table of Contents

Backlinks