c# YOLO Model Improvement Analysis Project: Pharyvac FESS Computer Vision Date: 2026-03-28 Data: 16 bilateral full primary FESS cases (Nov 2025, Mar 2026)
Current Performance
| Metric | Value |
|---|---|
| Overall Precision | 92.4% |
| Overall Recall | 85.3% |
| F1 Score (derived) | ~88.7% |
The model is tuned conservatively, high precision means few false positives, but the 85.3% recall indicates ~15% of instrument-present frames go undetected. This directly manifests as the unlabelled frames in the dataset.
The Unlabelled Frames Problem
Unlabelled frames account for a mean of 25.4% of total recorded time (~41 minutes per case). This is the single largest opportunity for improvement.
| Case | Unlabelled % | Recorded Time | Notes |
|---|---|---|---|
| 1 | 33.2% | 1:59:15 | Highest, early case, model may not have been tuned |
| 2 | 29.6% | 4:35:53 | Long complex case |
| 3 | 22.2% | 3:49:15 | |
| 4 | 26.3% | 3:32:14 | |
| 5 | 21.3% | 3:12:00 | |
| 6 | 24.4% | 1:42:17 | |
| 7 | 22.2% | 3:17:30 | |
| 8 | 31.4% | 2:15:51 | Second highest |
| 9 | 22.2% | 1:48:12 | |
| 10 | 18.3% | 3:23:21 | Best performance |
| 11 | 27.4% | 1:45:53 | |
| 12 | 29.4% | 2:29:55 | |
| 13 | 20.9% | 3:13:55 | |
| 14 | 26.7% | 1:33:39 | |
| 15 | 24.4% | 2:46:47 | |
| 16 | 26.7% | 1:59:06 |
Key observation: Unlabelled frame percentage does not correlate strongly with case length or complexity. This suggests the issue is primarily about frame-level detection quality (occlusion, angle, lighting) rather than fatigue or procedural phase.
Unlabelled Frames vs Instrument Swaps: A Counterintuitive Finding
There is a negative correlation (r = -0.54) between unlabelled frame percentage and instrument swap count. When controlling for case duration, this strengthens to r = -0.67. In other words, cases with more instrument swaps have fewer unlabelled frames.
This makes intuitive sense: more swaps mean instruments are constantly entering and leaving the frame, giving YOLO more opportunities to detect something. The hardest frames to classify aren’t the busy, instrument-heavy ones, they’re the quieter moments between active instrument use, where the surgeon is repositioning, examining the field, or using their hands without a recognizable instrument. This directly informs which frames are most valuable to label for retraining (see Swap Count Artifact Analysis).
Likely Sources of Missed Detections
1. Instrument Transitions (~40% of unlabelled frames, estimated)
When the surgeon swaps instruments, there are frames where:
- One instrument is leaving the frame and another entering
- The surgeon’s hand fully occludes the instrument
- Instruments are partially in frame (tip only, handle only)
Note: The raw swap count (mean 434/case) is likely inflated 2-4x by detection flickering artifacts. See Swap Count Artifact Analysis for full details. The true physical swap count is probably 100-200 per case.
Recommendation: Create a dedicated “transition” or “hand-only” class rather than leaving these unlabelled. Even if not useful for instrument time tracking, it would reduce true unlabelled frames and give a cleaner picture of what the model actually can’t see.
2. Small / Thin Instruments
The nav probe and suction bovie have the lowest time proportions (3.8% and 2.1% respectively) and are physically the thinnest instruments. They likely have fewer training examples and are harder to distinguish from each other and from suction tips.
Supporting evidence from swap analysis: Instrument swaps correlate r = 0.83 with nav suction time, the highest of any instrument. Nav suction is thin and visually similar to non-nav suction and nav probe. This strong correlation likely reflects detection flickering between suction-like classes, suggesting these instruments are where the model struggles most with class boundaries.
| Instrument | Correlation with swap count | Interpretation |
|---|---|---|
| Nav suction | r = 0.83 | Model likely flickers between suction classes |
| Forceps | r = -0.43 | Large/distinctive → stable detection |
| Nav probe | r = -0.57 | Low usage cases have fewer swaps overall |
| Microdebrider | r = -0.21 | Moderate distinctiveness |
Recommendation: Audit per-class precision and recall. If the nav probe and suction bovie have significantly lower recall, targeted data augmentation (rotation, scale, brightness jitter) on these classes would help. Consider also whether these two are being confused with each other, a confusion matrix would reveal this.
3. GoPro Perspective & Motion Blur
The GoPro mounted on the surgeon captures a wide-angle, first-person view. This means:
- Instruments at the periphery of the frame are distorted by the fisheye lens
- Rapid head movements cause motion blur
- The endoscopic monitor (if in frame) creates a competing visual signal
Recommendation:
- Apply lens distortion correction as a preprocessing step before inference
- Add motion-blurred samples to training data (or apply motion blur augmentation)
- Mask out the endoscopic monitor region if it’s consistently in the same position
4. Lighting Variability
OR lighting changes throughout a case, overhead lights get adjusted, the endoscope light reflects off instruments differently depending on angle.
Recommendation: Aggressive brightness/contrast augmentation during training. Consider histogram equalization as a preprocessing step.
Building the Transition Class: Detailed Approach
The “transition” class is the single highest-impact improvement available. Here’s how to implement it properly.
The Challenge
“Transition” isn’t a single visual pattern like “forceps” is. A transition frame could look like many different things: a bare gloved hand, an instrument halfway out of frame, two instruments overlapping during a handoff, or just the surgical field with nothing in it.
Step 1: Categorize the Unlabelled Frames
Pull 50-100 currently unlabelled frames at random and manually sort them into buckets. You’ll likely find they cluster into:
- Hand-only, no instrument visible, just a gloved hand (~40% estimated)
- Instrument entering/leaving, partial visibility during handoff (~30%)
- Empty field, surgical field with no instrument, surgeon is looking but not working (~20%)
- Genuinely ambiguous, motion blur, heavy occlusion (~10%)
The relative proportions tell you whether a single “transition” class makes sense or whether you’d benefit from 2-3 sub-classes.
Step 2: Choose an Architectural Approach
Option A: Add detection classes to YOLO (simpler) Add “hand” and “no-instrument” as YOLO detection targets. The hand is a real object you can draw a bounding box around. “No-instrument” gets handled by a secondary classifier that runs only when YOLO returns no detections above threshold. This doesn’t change your existing pipeline.
Option B: Frame-level classifier on top of YOLO (more robust) After YOLO runs, a lightweight CNN or even a simple heuristic examines the frame and asks: “Is this a transition, or did YOLO genuinely find an instrument?” This two-stage approach can use global frame features (overall brightness, motion magnitude, hand presence) rather than relying solely on bounding boxes.
Step 3: Temporal Smoothing (Fastest Win — No Retraining Needed)
Before retraining anything, you can reclaim a large chunk of unlabelled frames with post-processing logic:
If unlabelled gap < N frames AND same instrument on both sides:
→ label as that instrument (brief occlusion during use)
If unlabelled gap < N frames AND different instruments on both sides:
→ label as "transition"
If unlabelled gap >= N frames:
→ keep as unlabelled (genuinely different activity)
At 30fps, N=15 (0.5 seconds) is conservative and safe. N=30 (1 second) is more aggressive but probably valid since physical swaps rarely take less than a second. This same smoothing logic also fixes the swap count inflation problem, see Swap Count Artifact Analysis.
Step 4: Training Data Strategy
The negative correlation between unlabelled % and swaps helps prioritize labeling effort:
- Cases 1 and 8 (33% and 31% unlabelled), richest sources of transition examples
- Cases 10 and 13 (18% and 21% unlabelled), where the model already performs best, good “clean” training data for instrument classes
Aim for 500-1000 transition frames, sampled disproportionately from the high-unlabelled cases. Suggested distribution: ~40% hand-only, ~30% instrument entering/leaving, ~20% empty field, ~10% ambiguous.
Step 5: Evaluation Impact
Adding a transition class changes what “recall” means. Currently, an unlabelled frame isn’t penalized, it’s just missing data. Once transition exists as a class, a frame that should be “transition” but gets labelled “forceps” becomes a false positive for forceps. Precision may dip slightly even as overall coverage improves.
Track a new metric alongside precision/recall: frame coverage rate (% of frames assigned any label, including transition). Current coverage: ~75%. Target: >90%.
Improving Recall Without Sacrificing Precision
Short-term (next 5 cases)
- Lower confidence threshold selectively. If you’re using a single confidence threshold (e.g., 0.5) for all classes, try per-class thresholds. The microdebrider (large, distinctive) can afford a higher threshold; nav probe may need a lower one.
- Add “transition” class. Label 200-300 frames of instrument swaps from existing footage. This alone could reclaim 5-10% of currently unlabelled time.
- Temporal smoothing. Apply minimum bout duration filter (see above). A simple post-processing step that could boost effective recall by 3-5% and simultaneously fix the swap count inflation.
Medium-term (next 10-20 cases)
- Per-class recall audit. Break out precision/recall by instrument class. This is the most important diagnostic you don’t currently have.
- Confusion matrix. Which instruments get confused with each other? The r = 0.83 correlation between swaps and nav suction time strongly suggests suction-class confusion is a primary issue.
- Active learning. Use the model’s own low-confidence predictions to prioritize which frames to hand-label next. The frames where the model is uncertain (confidence 0.3-0.5) are the most valuable training data.
Long-term
- Temporal model. Move from per-frame YOLO to a video-aware architecture (e.g., YOLO + LSTM, or a transformer-based approach) that uses temporal context. Surgeons don’t teleport instruments, knowing what was in the last 5 frames massively constrains what can be in the current frame.
- Multi-camera fusion. If a second angle (e.g., endoscopic camera feed) is available, fusing detections from both views would dramatically reduce occlusion-based misses.
Suggested Metrics to Track Going Forward
| Metric | Current | Target | Why |
|---|---|---|---|
| Overall Precision | 92.4% | >90% (maintain) | Don’t sacrifice this |
| Overall Recall | 85.3% | >92% | Close the gap |
| Unlabelled frame % | 25.4% | <15% | Direct impact on data quality |
| Frame coverage rate | ~75% | >90% | Includes transition class |
| Per-class recall (min) | Unknown | >80% all classes | No instrument should be a blind spot |
| F1 Score | ~88.7% | >91% | Balanced performance |
| Smoothed swap count | Unknown | Track after implementing | True physical swaps |