YOLO Model Improvement Analysis

c# YOLO Model Improvement Analysis Project: Pharyvac FESS Computer Vision Date: 2026-03-28 Data: 16 bilateral full primary FESS cases (Nov 2025, Mar 2026)

Current Performance

Metric	Value
Overall Precision	92.4%
Overall Recall	85.3%
F1 Score (derived)	~88.7%

The model is tuned conservatively, high precision means few false positives, but the 85.3% recall indicates ~15% of instrument-present frames go undetected. This directly manifests as the unlabelled frames in the dataset.

The Unlabelled Frames Problem

Unlabelled frames account for a mean of 25.4% of total recorded time (~41 minutes per case). This is the single largest opportunity for improvement.

Case	Unlabelled %	Recorded Time	Notes
1	33.2%	1:59:15	Highest, early case, model may not have been tuned
2	29.6%	4:35:53	Long complex case
3	22.2%	3:49:15
4	26.3%	3:32:14
5	21.3%	3:12:00
6	24.4%	1:42:17
7	22.2%	3:17:30
8	31.4%	2:15:51	Second highest
9	22.2%	1:48:12
10	18.3%	3:23:21	Best performance
11	27.4%	1:45:53
12	29.4%	2:29:55
13	20.9%	3:13:55
14	26.7%	1:33:39
15	24.4%	2:46:47
16	26.7%	1:59:06

Key observation: Unlabelled frame percentage does not correlate strongly with case length or complexity. This suggests the issue is primarily about frame-level detection quality (occlusion, angle, lighting) rather than fatigue or procedural phase.

Unlabelled Frames vs Instrument Swaps: A Counterintuitive Finding

There is a negative correlation (r = -0.54) between unlabelled frame percentage and instrument swap count. When controlling for case duration, this strengthens to r = -0.67. In other words, cases with more instrument swaps have fewer unlabelled frames.

This makes intuitive sense: more swaps mean instruments are constantly entering and leaving the frame, giving YOLO more opportunities to detect something. The hardest frames to classify aren’t the busy, instrument-heavy ones, they’re the quieter moments between active instrument use, where the surgeon is repositioning, examining the field, or using their hands without a recognizable instrument. This directly informs which frames are most valuable to label for retraining (see Swap Count Artifact Analysis).

Likely Sources of Missed Detections

1. Instrument Transitions (~40% of unlabelled frames, estimated)

When the surgeon swaps instruments, there are frames where:

One instrument is leaving the frame and another entering
The surgeon’s hand fully occludes the instrument
Instruments are partially in frame (tip only, handle only)

Note: The raw swap count (mean 434/case) is likely inflated 2-4x by detection flickering artifacts. See Swap Count Artifact Analysis for full details. The true physical swap count is probably 100-200 per case.

Recommendation: Create a dedicated “transition” or “hand-only” class rather than leaving these unlabelled. Even if not useful for instrument time tracking, it would reduce true unlabelled frames and give a cleaner picture of what the model actually can’t see.

2. Small / Thin Instruments

The nav probe and suction bovie have the lowest time proportions (3.8% and 2.1% respectively) and are physically the thinnest instruments. They likely have fewer training examples and are harder to distinguish from each other and from suction tips.

Supporting evidence from swap analysis: Instrument swaps correlate r = 0.83 with nav suction time, the highest of any instrument. Nav suction is thin and visually similar to non-nav suction and nav probe. This strong correlation likely reflects detection flickering between suction-like classes, suggesting these instruments are where the model struggles most with class boundaries.

Instrument	Correlation with swap count	Interpretation
Nav suction	r = 0.83	Model likely flickers between suction classes
Forceps	r = -0.43	Large/distinctive → stable detection
Nav probe	r = -0.57	Low usage cases have fewer swaps overall
Microdebrider	r = -0.21	Moderate distinctiveness

Recommendation: Audit per-class precision and recall. If the nav probe and suction bovie have significantly lower recall, targeted data augmentation (rotation, scale, brightness jitter) on these classes would help. Consider also whether these two are being confused with each other, a confusion matrix would reveal this.

3. GoPro Perspective & Motion Blur

The GoPro mounted on the surgeon captures a wide-angle, first-person view. This means:

Instruments at the periphery of the frame are distorted by the fisheye lens
Rapid head movements cause motion blur
The endoscopic monitor (if in frame) creates a competing visual signal

Recommendation:

Apply lens distortion correction as a preprocessing step before inference
Add motion-blurred samples to training data (or apply motion blur augmentation)
Mask out the endoscopic monitor region if it’s consistently in the same position

4. Lighting Variability

OR lighting changes throughout a case, overhead lights get adjusted, the endoscope light reflects off instruments differently depending on angle.

Recommendation: Aggressive brightness/contrast augmentation during training. Consider histogram equalization as a preprocessing step.

Building the Transition Class: Detailed Approach

The “transition” class is the single highest-impact improvement available. Here’s how to implement it properly.

The Challenge

“Transition” isn’t a single visual pattern like “forceps” is. A transition frame could look like many different things: a bare gloved hand, an instrument halfway out of frame, two instruments overlapping during a handoff, or just the surgical field with nothing in it.

Step 1: Categorize the Unlabelled Frames

Pull 50-100 currently unlabelled frames at random and manually sort them into buckets. You’ll likely find they cluster into:

Hand-only, no instrument visible, just a gloved hand (~40% estimated)
Instrument entering/leaving, partial visibility during handoff (~30%)
Empty field, surgical field with no instrument, surgeon is looking but not working (~20%)
Genuinely ambiguous, motion blur, heavy occlusion (~10%)

The relative proportions tell you whether a single “transition” class makes sense or whether you’d benefit from 2-3 sub-classes.

Step 2: Choose an Architectural Approach

Option A: Add detection classes to YOLO (simpler) Add “hand” and “no-instrument” as YOLO detection targets. The hand is a real object you can draw a bounding box around. “No-instrument” gets handled by a secondary classifier that runs only when YOLO returns no detections above threshold. This doesn’t change your existing pipeline.

Option B: Frame-level classifier on top of YOLO (more robust) After YOLO runs, a lightweight CNN or even a simple heuristic examines the frame and asks: “Is this a transition, or did YOLO genuinely find an instrument?” This two-stage approach can use global frame features (overall brightness, motion magnitude, hand presence) rather than relying solely on bounding boxes.

Step 3: Temporal Smoothing (Fastest Win — No Retraining Needed)

Before retraining anything, you can reclaim a large chunk of unlabelled frames with post-processing logic:

If unlabelled gap < N frames AND same instrument on both sides:
    → label as that instrument (brief occlusion during use)

If unlabelled gap < N frames AND different instruments on both sides:
    → label as "transition"

If unlabelled gap >= N frames:
    → keep as unlabelled (genuinely different activity)

At 30fps, N=15 (0.5 seconds) is conservative and safe. N=30 (1 second) is more aggressive but probably valid since physical swaps rarely take less than a second. This same smoothing logic also fixes the swap count inflation problem, see Swap Count Artifact Analysis.

Step 4: Training Data Strategy

The negative correlation between unlabelled % and swaps helps prioritize labeling effort:

Cases 1 and 8 (33% and 31% unlabelled), richest sources of transition examples
Cases 10 and 13 (18% and 21% unlabelled), where the model already performs best, good “clean” training data for instrument classes

Aim for 500-1000 transition frames, sampled disproportionately from the high-unlabelled cases. Suggested distribution: ~40% hand-only, ~30% instrument entering/leaving, ~20% empty field, ~10% ambiguous.

Step 5: Evaluation Impact

Adding a transition class changes what “recall” means. Currently, an unlabelled frame isn’t penalized, it’s just missing data. Once transition exists as a class, a frame that should be “transition” but gets labelled “forceps” becomes a false positive for forceps. Precision may dip slightly even as overall coverage improves.

Track a new metric alongside precision/recall: frame coverage rate (% of frames assigned any label, including transition). Current coverage: ~75%. Target: >90%.

Improving Recall Without Sacrificing Precision

Short-term (next 5 cases)

Lower confidence threshold selectively. If you’re using a single confidence threshold (e.g., 0.5) for all classes, try per-class thresholds. The microdebrider (large, distinctive) can afford a higher threshold; nav probe may need a lower one.
Add “transition” class. Label 200-300 frames of instrument swaps from existing footage. This alone could reclaim 5-10% of currently unlabelled time.
Temporal smoothing. Apply minimum bout duration filter (see above). A simple post-processing step that could boost effective recall by 3-5% and simultaneously fix the swap count inflation.

Medium-term (next 10-20 cases)

Per-class recall audit. Break out precision/recall by instrument class. This is the most important diagnostic you don’t currently have.
Confusion matrix. Which instruments get confused with each other? The r = 0.83 correlation between swaps and nav suction time strongly suggests suction-class confusion is a primary issue.
Active learning. Use the model’s own low-confidence predictions to prioritize which frames to hand-label next. The frames where the model is uncertain (confidence 0.3-0.5) are the most valuable training data.

Long-term

Temporal model. Move from per-frame YOLO to a video-aware architecture (e.g., YOLO + LSTM, or a transformer-based approach) that uses temporal context. Surgeons don’t teleport instruments, knowing what was in the last 5 frames massively constrains what can be in the current frame.
Multi-camera fusion. If a second angle (e.g., endoscopic camera feed) is available, fusing detections from both views would dramatically reduce occlusion-based misses.

Suggested Metrics to Track Going Forward

Metric	Current	Target	Why
Overall Precision	92.4%	>90% (maintain)	Don’t sacrifice this
Overall Recall	85.3%	>92%	Close the gap
Unlabelled frame %	25.4%	<15%	Direct impact on data quality
Frame coverage rate	~75%	>90%	Includes transition class
Per-class recall (min)	Unknown	>80% all classes	No instrument should be a blind spot
F1 Score	~88.7%	>91%	Balanced performance
Smoothed swap count	Unknown	Track after implementing	True physical swaps

Pharyvac Computer Vision

Explorer