Scientists Create Labor-Saving Automated Method for Studying Electronic Health Records

Benjamin Glicksberg, Assistant Professor, Genetics and Genomic Sciences, Mount Sinai

In an article published in the journal Patterns, scientists at the Icahn School of Medicine at Mount Sinai described the creation of an automated, artificial intelligence-based algorithm that can learn to read patient data from electronic health records. In a side-by-side comparison, they showed that their method, called Phe2vec, accurately identified patients with certain diseases as well as the traditional, “gold-standard” method, which requires much more manual labor to develop and perform.

“There continues to be an explosion in the amount and types of data electronically stored in a patient’s medical record. Disentangling this complex web of data can be highly burdensome, thus slowing advancements in clinical research,” said Benjamin S. Glicksberg, PhD, assistant professor of genetics and genomic sciences and a senior author of the study, which was led by Jessica K. De Freitas, a graduate student in Glicksberg’s lab. “In this study, we created a new method for mining data from electronic health records with machine learning that is faster and less labor intensive than the industry standard.”

Currently, scientists rely on a set of established computer programs, or algorithms, to mine medical records for new information. To study a disease, researchers first have to comb through reams of medical records looking for pieces of data, such as certain lab tests or prescriptions, which are uniquely associated with the disease. They then program the algorithm that guides the computer to search for patients who have those disease-specific pieces of data, which constitute a “phenotype.” In turn, the list of patients identified by the computer needs to be manually double-checked by researchers. Each time researchers want to study a new disease, they have to restart the process from scratch.

In this study, the researchers tried a different approach—one in which the computer learns, on its own, how to spot disease phenotypes and thus save researchers time and effort. The new Phe2vec method was based on studies the team had already conducted.

“Previously, we showed that unsupervised machine learning could be a highly efficient and effective strategy for mining electronic health records,” said Riccardo Miotto, PhD, a former assistant professor at Mount Sinai and a senior author of the study. “The potential advantage of our approach is that it learns representations of diseases from the data itself. Therefore, the machine does much of the work experts would normally do to define the combination of data elements from health records that best describes a particular disease.”

When the researchers compared the effectiveness between the new and the old systems, in 90% of the diseases tested, they found that the new Phe2vec system was as effective as, or performed slightly better than, the gold standard phenotyping process at correctly identifying a diagnosis from electronic health records.

“Overall our results are encouraging and suggest that Phe2vec is a promising technique for large-scale phenotyping of diseases in electronic health record data,” Glicksberg said. “With further testing and refinement, we hope that it could be used to automate many of the initial steps of clinical informatics research, thus allowing scientists to focus their efforts on downstream analyses like predictive modeling.”

Edited by Gary Cramer