CustodianAI ยท PHI Benchmark Dashboard

Overall results ยท 6 systems ร— all benchmarks

F1 per benchmark plus mean F1. Rows sorted by overall mean. Greener = better.

Ranking proof ยท one document, six systems side by side

Same input text, predictions overlaid. Sorted by this document's F1 โ€” top panel is the best system on this doc, bottom panel is the worst. green = correctly caught PHI, yellow = false positive, red = missed PHI (leakage).

Per-benchmark detail view (click to expand)

Per-system comparison

System P R F1 Leak Char leak n

Precision vs Recall

Leakage rate (lower = better)

Per-document drill-down

Sampled docs where systems disagree most. Hover a span to see its label.