Overall results ยท 6 systems ร all benchmarks
F1 per benchmark plus mean F1. Rows sorted by overall mean. Greener = better.
Ranking proof ยท one document, six systems side by side
Same input text, predictions overlaid. Sorted by this document's F1 โ top panel is the best system on this doc, bottom panel is the worst. green = correctly caught PHI, yellow = false positive, red = missed PHI (leakage).
Per-benchmark detail view (click to expand)
Per-system comparison
| System | P | R | F1 | Leak | Char leak | n |
|---|
Precision vs Recall
Leakage rate (lower = better)
Per-document drill-down
Sampled docs where systems disagree most. Hover a span to see its label.