HIR 09-006
Consortium for Healthcare Informatics Research - De-Identification
Matthew H. Samore, MD VA Salt Lake City Health Care System, Salt Lake City, UT Salt Lake City, UT Funding Period: February 2009 - January 2013 |
BACKGROUND/RATIONALE:
The privacy and the confidentiality of a patient's health information is a cornerstone of the physician-patient relationship. Regulations protecting confidentiality require informed consent of the patient for use of their medical record for purposes other than their own health care, such as research. But obtaining the informed consent of a large population of patients, especially in retrospective research, is a difficult and costly obligation. The informed consent requirement can be waived if the medical record is de-identified. To reduce the time and effort required to manually de-identify medical records, natural language processing (NLP) methods can be applied to automatically de-identify narrative text documents in the EHR (Electronic Health Record). Several systems for automated de-identification have been developed, but they have been adapted to the document types and formats they were designed to process. The VA CPRS narrative text documents have significant differences with documents in other systems, the most prominent being the widespread use of templates. Any automated de-identification system would therefore require significant adaptation efforts to be used with VA CPRS narrative text documents. OBJECTIVE(S): This project was driven by the following research questions: 1) Can automatic text de-identification be applied to VA clinical narratives with good performance? 2) What is the risk that a de-identified clinical note can be re-identified? 3) How much does automatic text de-identification impact subsequent uses of the clinical narratives? The objectives of this study were to: 1. Evaluate existing automated text de-identification methods and develop a best-of-breed application for VA clinical narratives by combining the best performing methods for each type of identifier. 2. Evaluate the risk that de-identified clinical text can be linked to the identity of the corresponding patient. 3. Determine the influence of automated de-identification on the accuracy of information extraction and the optimal combination of both. METHODS: The developments and evaluations in this project were based on a stratified random sample of various VHA clinical narratives authored between April 1, 2008 and March 31, 2009 from VHA patient EHRs in VISN 19. No patient criteria were used for the selection. The 100 most frequent note types (addendum excluded) were used as strata for sampling. We then randomly selected eight documents in each stratum, reaching a total of 800 clinical documents. The First objective included an evaluation of existing text de-identification methods (comprehensive survey and evaluation of a selection of algorithms and systems), the development of a best-of-breed automatic clinical text de-identification application, and the evaluation of this new application. The Second objective consisted in the evaluation of the level of anonymity of automatically de-identified clinical documents when presented to healthcare providers at various levels of proximity to the patient (e.g., nurse working in the ward a patient was hospitalized in versus an attending physician consulting in the same hospital). Discharge summaries from a random sample of 100 patients hospitalized in acute medicine at the Salt Lake City VHA Medical Center between September and December 2012 were automatically de-identified with BoB for this survey. This objective also included an estimation of the re-identification risk based on the uniqueness of automatically de-identified clinical documents and the other identified data sets that could be used for re-identification. The Third objective focused on evaluating the impact of automatic de-identification on clinical data (readability and interpretability) and on subsequent information extraction processes. To guide our efforts and have a better understanding of Information Security and Privacy Officers' opinions about the use of automated de-identification and de-identified notes in research, we conducted a survey of these VHA employees. FINDINGS/RESULTS: Each document in our sample of VHA clinical notes was independently annotated by two reviewers for PHI (Protected Health Information) and clinical eponyms; disagreements were adjudicated by a third reviewer. This annotated corpus served as reference standard for training and testing. First objective: We realized and published the results of a comprehensive survey of research and software developed for clinical text de-identification. We also implemented and evaluated several such applications with VHA clinical documents. Based on the results of the evaluation and analysis of several de-identification applications, we chose the best methods and resources for each type of PHI, and developed a best-of-breed VHA clinical text automatic de-identification application (called BoB). A first version of BoB was released in December 2011, and performance optimization efforts followed this first release, reaching an overall sensitivity of 92.6% (98-100% for highly sensitive PHI) and positive predictive value of 84.1%. Second objective: The anonymity survey used 100 automatically de-identified notes and none was formally identified by healthcare providers. Eight residents and four attending physicians in acute medicine at the Salt Lake City VHA Medical Center participated in the survey, and even residents having taken care of the patients within the past 3 months didn't formally recognize the patients. The uniqueness of automatically de-identified clinical documents was estimated by automatically mapping ICD-9-CM and CPT-4 terms from clinical notes in the 2010 i2b2 NLP challenge corpus. About 23% of the notes had a unique ICD-9-CM or CPT-4 code, and might therefore be linked with some identified database that includes these codes. Third objective: We studied the impact of de-identification on the readability and interpretability of clinical documents, and the impact of de-identification on subsequent information extraction with an existing corpus of clinical notes from the 2010 i2b2 NLP challenge and with part of our VHA clinical narratives corpus. This impact was only minimal (0.81-1.87% of clinical terms). IMPACT: The creation of a de-identified patient data repository would have significant implications for the future of research within the VHA. Such a repository would provide researchers with greatly increased access to patient data across the entire VHA system, thereby facilitating research projects currently not possible within VHA research confines. The BoB system we have developed would enable the creation of such a repository, and our other findings could guide updated or new policies to access this repository. External Links for this ProjectDimensions for VADimensions for VA is a web-based tool available to VA staff that enables detailed searches of published research and research projects.Learn more about Dimensions for VA. VA staff not currently on the VA network can access Dimensions by registering for an account using their VA email address. Search Dimensions for this project PUBLICATIONS:Journal Articles
DRA:
Health Systems
DRE: Technology Development and Assessment, Research Infrastructure Keywords: none MeSH Terms: none |