FORUM - Translating research into quality health care for Veterans

October 2014

Commentary

Embracing "Big Data" and Data Science

Stephan Fihn, M.D., M.P.H., VHA Office of Analytics and Business Intelligence, Washington, D.C.

Toward the close of the 20th century, the HSR&D program established the Management Consultation Program to broker linkages between investigators and operational leaders in VHA who usually desired information about their own programs. In response to requests from VHA leaders— often requests for information about their own programs—investigators crafted hypotheses, created data collection systems, extracted data from the patient treatment file, and conducted program evaluations. This process typically took years and required training researchers to use arcane corporate data.

Data Requests in the Modern Era

Fast forward two decades and HSR&D investigators now typically request permission to access data collected as part of routine operations and seek guidance on relevant questions. Not only has the directionality of the research-operational relationship reversed, but its fundamental nature has also changed. Modern leaders are accustomed to examining and manipulating complex data; today’s leaders do not hesitate to interrogate analysts about technical details of complex analyses such as risk adjustment or propensity matching. One prior VA Secretary regularly performed analyses himself with specialized software and our current Secretary conducts daily briefings, poring over exhaustive reviews of data-rich reports prepared the prior day. In this fast-paced environment, the latency for data requests is measured in hours, and the expectation for accuracy is unforgiving.

The major contributor to these new circumstances is the massive amount of data now available. The Corporate Data Warehouse (CDW), for instance, features 4,000 CPUs, 1.5 petabytes of data representing 20 million patient records arrayed in 1,000 tables consisting of 20,000 columns and 80 billion rows. It is refreshed nightly with data from the CPRS/VistA and soon, the refresh frequency will be upgraded to four hours, permitting “near real-time” analysis and reporting. The CDW, however, contains only a portion of data collected within VHA. Excluded are much of VA’s financial information, data from specialized clinical systems (e.g., the ICU/Anesthesia system [CIS/ ARK], the new surgery package [SQWM], the RFID asset tracking system [RTLS], etc.), patient-reported data from mobile devices, and data transmitted by an increasing number of medical devices—although these data may be added in the future.

Sadly, only a tiny fraction of these data are ever examined outside of the settings in which they were recorded and, when they are analyzed, traditional methods are employed rather than sophisticated techniques of machine learning that are becoming widespread in industry. Such techniques include, for example, geometric data visualization, recursive and spatiotemporal analytics, and Bayesian networks. The failure to exploit the vast wealth of existing data is true not only in VA, but also in the rest of thehealth care sector. This tendency is changing rapidly, however, with the advent of initiatives such as VINCI (funded by HSR&D) and Big Data to Knowledge (BD2K - funded by NIH) as well as the emergence of commercially funded entities, such as Optum Labs.

In this era of “big data,” what accounts for the relative failure of the medical research community to seize the initiative? In a recent article, Krumholz cites several explanations, including:

Failure to appreciate the complexity of health and health care that cannot be understood using standard, reductionist approaches;
Stubborn adherence to methodologies that demand a priori hypotheses;
Routine rejection of data that are inconsistent with existing models or do not support causal inference;
Lack of exposure to and training in new fields of mathematics and data science;
Exaggerated concerns that inductive reasoning based on new approaches may be unduly subject to bias; and
An academic culture that does not promote or reward open-sourcing of data, methods, and results.¹

Added to these obstacles are suffocating compliance requirements and a tortuous funding system that ensures obsolescence of most results before they are reported.

VA’s Office of Analytics and Business Intelligence In VA’s Office of Analytics and Business Intelligence (OABI), we have sought to address some of these hurdles. When charged to identify patients at the greatest risk of adverse outcomes, we constructed large, multivariate models selecting covariates from hundreds of candidates contained within numerous domains of the CDW. Year-on-year validation of these models yielded C-statistics approaching 0.9, confirming their predictive accuracy.² The interval between initiation of the work and its weekly application to all Veterans enrolled in VA primary care was approximately five months, including delays in adding key domains to the CDW. Currently, analysts in OABI develop predictive models for a variety of clinical events with similar degrees of accuracy in weeks whereas in the research community, such work still typically requires months to years.

OABI also undertakes projects lacking concrete hypotheses. When VA established the PACT initiative, existing approaches to assessing implementation of the patientcentered medical home were rudimentary. OABI staff created a large database and then, in partnership with the PACT demonstration laboratories and HSR&D, evaluated hundreds of candidate variables to construct an index that exhibits strong correlation with important objectives, such as reduced frequency of hospitalization and emergency visits, improved clinical quality, better patient experience, and lower staff burnout.³

An Imperative for HSR&D Investigators

The imperative for HSR&D investigators is to develop competencies in this rapidly evolving field of data science so that our health system and the Veterans we serve can benefit from knowledge that is presently sequestered. Contemporaneously, researchers must help to define the methods to discern meaningful signals from random noise or biased observation. These methods will not supplant hypothesis-driven tests, but will make them more efficient and greatly enhance our ability to anticipate critically important clinical events.

Revolutions in science often result from inductive reasoning coupled with novel methods from other fields. Now is the time for medical investigators to ascertain whether data science has the potential to revolutionize how we deliver health care.

References

Krumholz, H.M. “Big Data and New Knowledge in Medicine: the Thinking, Training, and Tools Needed for a Learning Health System,” Health Affairs (Millwood) 2014; 33:1263-70.
Wang, L. et al. “Predicting Risk of Hospitalization or Death among Patients Receiving Primary Care in the Veterans Health Administration,” Medical Care 2013; 51:368-73.
Nelson, K.M. et al. “Implementation of the Patientcentered Medical Home in the Veterans Health Administration: Associations with Patient Satisfaction, Quality of Care, Staff Burnout, and Hospital and Emergency Department Use,” JAMA Internal Medicine 2014; 174:1350-8.

Questions about the HSR website? Email the Web Team

Any health information on this website is strictly for informational purposes and is not intended as medical advice. It should not be used to diagnose or treat any condition.

VA Health Systems Research

Commentary

Embracing "Big Data" and Data Science

Data Requests in the Modern Era

An Imperative for HSR&D Investigators

References