Glossary — Plain-Language Definitions
For team members new to computer vision, surgery, or the stats. Every bit of jargon used across this project, explained in plain English. If a doc uses a term you don’t know, it’s here.
New to the project entirely? Start with Start-here first, then keep this open as a reference.
The project in one breath
We record sinus surgery on a camera. A program watches the video and labels which surgical instrument is being used, moment to moment. From those labels we compute how efficiently the surgery went (time per instrument, wasted time, tool switches). Validation = proving those numbers are actually correct before anyone trusts them.
Computer vision / machine-learning terms
- Model — a program that learned to do a task from examples (here: spot instruments in video frames). Ours is a “YOLO” model.
- YOLO — “You Only Look Once,” a popular fast object-detection model family. It draws a box around each instrument in a frame and says what it is.
- Detection — finding where something is in an image (the box) and what it is (the label).
- Bounding box — the rectangle drawn around an instrument. “Box all visible instruments” = draw a rectangle on each one.
- Class — a category the model can output. Our classes are the instrument types (forceps, microdebrider, the suctions, etc.).
- Frame — one still image from the video. Video is ~30 frames per second, so a 2-hour case is ~216,000 frames.
- fps (frames per second) — how many frames per second. We annotate at ~5 fps (every ~6th frame) to save labor.
- Inference — running the trained model on new video to get predictions.
- Training — teaching the model by showing it labeled examples.
- Confidence / score — how sure the model is about a detection (0–1). Low confidence is where it “flickers” between guesses.
- Domain shift — when new data looks different from training data (new camera, new surgeon, new lighting) and the model gets worse. A big theme for the multi-surgeon expansion.
- Temporal smoothing — cleaning up the frame-by-frame labels over time so brief glitches (1–2 wrong frames) don’t count as real events. Our rule: ignore anything shorter than 1 second.
Accuracy / evaluation terms
- Ground truth — the correct answer, decided by humans. The model is graded against this.
- Annotation / labeling — humans marking the ground truth (drawing boxes, marking events) on video.
- Precision — of everything the model said was a forceps, what fraction really were? (Punishes false alarms.) Ours: 92.4%.
- Recall — of all the forceps that were actually there, what fraction did the model find? (Punishes misses.) Ours: 85.3%.
- F1 — a single score blending precision and recall (their harmonic mean). Ours: ~88.7%.
- IoU (Intersection over Union) — how much two boxes overlap, 0–1. Used to decide if a predicted box “matches” a ground-truth box (we count a match at IoU ≥ 0.5).
- mAP (mean Average Precision) — the standard overall score for object detectors; averages precision across recall levels and classes. “mAP@0.5” = computed at the 0.5 IoU match threshold.
- Confusion matrix — a grid showing what got mistaken for what (e.g. how often nav-suction was called non-nav-suction). Our key diagnostic.
- Support — how many real examples of a class there were. A 0.6 score on 12 examples is much noisier than on 1,200 — always read it next to support.
- Held-out / test set — data the model did not train on, used for honest grading. Grading on data it trained on is cheating (inflated scores).
- Leave-one-case-out (LOOCV) — with only 16 cases, we train 16 times, each time hiding one case and testing on it. Every case gets tested on a model that never saw it. The honest way to grade at small sample size.
- Case-level vs frame-level split — frames from one surgery look almost identical, so a case must be entirely in train or entirely in test — never split its frames across both, or scores inflate.
Statistics / validity terms
- Inter-rater reliability — do two human annotators, working independently, produce the same labels? If not, the “ground truth” is unreliable and nothing built on it can be trusted.
- κ (kappa) / Cohen’s kappa — a 0–1 score for how much two annotators agree, correcting for chance agreement. We require κ ≥ 0.80 (“substantial to near-perfect”) before trusting labels.
- ICC (Intraclass Correlation) — agreement score for numbers (e.g. do two annotators count the same number of swaps per case?).
- Construct validity — does a metric actually measure what it claims? E.g. does our “efficiency index” really track efficiency, or just case length? Proven by checking it behaves the way real efficiency should (separates experts from trainees, etc.).
- Convergent validity — the metric correlates with an accepted external measure it should correlate with.
- Discriminant / known-groups validity — the metric separates groups it should separate (e.g. simple vs. complex cases).
- Pre-registration — writing down what you expect to find before running the analysis, so a positive result can’t be dismissed as cherry-picking.
- Measurement ceiling — you can’t measure the model more precisely than your humans agree. If humans agree at 0.6, a perfect model also scores ~0.6 — the noise hides the truth. Why inter-rater agreement comes first.
Surgical / domain terms
- FESS — Functional Endoscopic Sinus Surgery. The sinus operation we record.
- Endoscope — the camera-on-a-rod the surgeon puts inside the nose to see. (Different from our recording camera, which watches the room/hands.)
- Rhinology / rhinologist — the ENT subspecialty (and surgeon) focused on the nose and sinuses.
- The instruments — forceps (grabbing/cutting, large), microdebrider (powered shaver), suctions (nav suction, non-nav suction — thin tubes that remove blood/debris), nav probe (thin navigation pointer), suction bovie (suction + cautery), plus a catch-all.
- Bilateral / primary / complex — bilateral = both sides; primary = first-time (not revision); complex = extra procedures added (septoplasty, turbinate work).
- Septoplasty / turbinate reduction — common add-on procedures that make a case “complex.”
- Mayo stand / scrub tech — the instrument tray, and the assistant who hands instruments to the surgeon. Where “handoffs” happen.
Project-specific terms (our invented vocabulary)
- Bout — one continuous stretch of using a single instrument. Must last ≥ 1 second to count (shorter = glitch).
- Swap / instrument change — switching from one instrument (bout) to another. Our headline workflow number — and the one the raw model over-counted 2–4×.
- The swap-count artifact — the model’s raw swap count (~434/case) was wildly inflated by frame-to-frame flicker, not real switches. The true number is ~150–200. A cautionary tale that “accurate detector ≠ correct metric.”
- Scope cleaning — when the surgeon pulls the endoscope out to wipe/defog it and reinserts. We label these by hand.
- Surgical pause — ≥ 5 seconds with no instrument working. Feeds the “dead time” metric.
- Dead time — non-instrument time during a case (~22% of recorded time). The grant’s headline efficiency number.
- Suction (any) / hierarchical rollup — because the three thin suction-like tools are easily confused, we keep the fine labels but fall back to a single “suction (any)” bucket when the model isn’t sure. Robust where unsure, precise where sure.
- Ontology — our structured “dictionary” mapping raw detections → meaningful surgical events → outcomes. The shared definitions everything else depends on. See detection-ontology.
- The four pillars (P1–P4) — our validation stages: P1 humans agree → P2 model matches humans → P3 the metrics mean something → P4 it holds on new surgeons. See validation-plan.
- CVAT — the (free, self-hosted) software annotators use to label video. See cvat-self-host-runbook.
- Efficiency Index — the planned single composite score summarizing a case’s efficiency (formula still to be defined).
Links
- Start-here — read this first
- validation-plan · detection-ontology · annotation-protocol · p2-evaluation-plan