Residual PHI Risk in Clinical NLP: Why High F1 Scores Still Fail HIPAA Compliance

A de-identification model can achieve 97% F1 and still expose thousands of patients to re-identification risk. HIPAA does not evaluate token accuracy. It evaluates whether a real person can still be identified. Timestamps: 00:00 The F1 score myth in clinical NLP 04:47 Structural failure modes in healthcare de-identification 11:20 Hybrid architectures for HIPAA-compliant AI systems Vaibhava Lakshmi Ravideshik (AI & Growth Lead at GRAIL) breaks down the structural gap between NLP evaluation metrics and real-world healthcare privacy requirements. The central argument is uncomfortable but difficult to ignore: de-identification systems are optimized for the wrong objective. F1 scores measure token-level prediction accuracy, while HIPAA compliance depends on whether an entire patient record can still identify someone through context, rarity, or attribute combinations. The presentation moves beyond benchmark optimism into deployment reality: distribution shift across hospitals, quasi-identifiers invisible to NER systems, probabilistic LLM outputs, and the operational difference between removing PHI tokens versus preventing actual re-identification. A layered architecture emerges as the practical answer: deterministic NER for Safe Harbor identifiers, on-premise open-source LLMs for contextual reasoning, vector similarity search for semantic rarity detection, statistical disclosure analysis, and human review with documented evidence trails. This is less a talk about NLP performance and more a critique of how healthcare AI systems are currently evaluated, approved, and trusted. 📌 Applied Healthcare AI Summit 2026 — what actually works in real-world healthcare AI, from pilots to production systems. #HealthcareAI #ClinicalNLP #HIPAA #AICompliance #PrivacyEngineering #DeIdentification