Data-Tech Connect: Conducting Research with EMR Data? You Need an Epidemiologist

Randi Foraker headshot

Randi Foraker, PhD, MA,
FAHA, associate professor, The Ohio State

Researchers have long characterized the etiology of disease using data from population-based cohort studies, such as the Framingham Heart Study.1 Cohort studies such as these comprise decades of follow-up and rich amounts of questionnaire and physical exam data; however, they are also expensive endeavors. Electronic medical record (EMR) data represent a resource-efficient source of information for researchers to use as they investigate myriad disease states, and these data are particularly pragmatic for answering health services–related research questions.

Due to the convenience of these data, which are collected electronically at medical centers and clinics across the United States,2 their use for research purposes is becoming ever more common.3 Still, using secondary data comes with inherent pitfalls stemming from the fact that the data were either not originally collected for research purposes, or were not collected to answer a particular research question. Further, data central to a research question may be missing from the dataset or characterized in a way that is not ideal for the analysis.4

As EMR data grow in popularity as a secondary data source for researchers, it is critical to establish best practices for data generation, study design, and data analysis. As experts in study design, epidemiologists have the necessary skills and tools to anticipate the effects of, and in some cases account for, threats to internal validity (e.g., confounding, selection bias, information bias), and to deal with the implications of the concept of representativeness for the ability of the study’s results to translate to the population level to improve health.

The rest of this column addresses internal validity and representativeness, and suggests some related solutions and strategies for implementation across data platforms.

Internal Validity

Confounders are factors that explain, completely or in part, the effect of a given exposure on an outcome.4 In any dataset, confounders may be measured or unmeasured. Adequate control for confounding is a concern when using EMR data, since many confounders (i.e., comorbidities) go un- or under-documented. For example, one limitation of using EMR data for research purposes is that one often cannot adequately characterize the social and behavioral determinants of health from the available data.

Meanwhile, selection bias results from factors differentiating study participants from all those eligible to participate.4 For example, when a research sample is drawn from a population of inpatients, it must be acknowledged that the findings that arise from that sample are going to represent the experience of patients who are generally sicker (i.e., in terms of disease stage or severity of illness) than others. Selection bias can also arise when choosing controls from a hospitalized patient population for a case-control study, as there exists a risk of selecting controls who are not representative of the source population (i.e., all patients without the disease of interest) in terms of exposure prevalence. Choosing a sample of controls from a hospitalized patient cohort, for example, may yield a different exposure-outcome association that would have been observed had a random sample of the source population been selected instead.

Another internal validity concern—information bias—can arise from differential or nondifferential misclassification. In epidemiological studies, misclassification of the exposure, outcome, or confounders can occur.4 Nondifferential misclassification of a study outcome happens when the extent of misclassification changes according to level of exposure or the value of other confounding variables. An example in the clinic setting would be if the determination of the outcome for all patients is made by a laboratory test done with uncalibrated equipment.

Often of greater concern are the results of differential misclassification leading to an over- or under-estimation of an effect. In EMR data, an exposure such as smoking status may be more accurately defined and up-to-date for patients with lung cancer compared to patients without lung cancer. In this simple example, if current smoking is under-recorded in patients without lung cancer and accurately recorded among patients with lung cancer, then the observed relationship of smoking to lung cancer would be exaggerated in the dataset.


Patients seen at a single clinic or medical center within the population’s catchment area may not represent the “typical” patient from the target population. That means that, even if the internal validity of the study is high, the observed associations may not apply beyond that sample of patients. Of note, internal validity should be prioritized above generalizability of the results, since lack of internal validity precludes external validity.

Proposed Solutions

Several strategies can be applied to enhance the internal validity of EMR data for answering our most important clinical research questions. Some of the solutions proposed below involve capitalizing on other sources of data, and others require more careful treatment of the EMR data at our fingertips (see Figure 1* for an overview of the strengths, opportunities, and limitations of such data). These solutions are:

  • Data on unmeasured confounders may be obtained by linking EMR data to other sources, such as socioeconomic data from the U.S. Census or patient-reported outcomes and behavioral data from questionnaires. If researchers suspect that data on a particular confounder are under-documented within a single field in the EMR, they may choose to develop an algorithm incorporating multiple sources of data from the EMR.
  • Results should always be interpreted consistent with how the data were collected. Investigators should also carefully consider from which population a comparison group will be selected; in many cases, publicly available data may be used to characterize controls selected from the general population.4
  • To combat information bias, data should be generated and stored in the same way, regardless of the exposure or disease status of the participant. Researchers using EMR data often do not have control of the manner in which the data are originally collected, and should be prepared to link EMR data with other sources (i.e., patient-reported outcome questionnaires) and to conduct sensitivity analyses to assess the robustness of the findings to potential misclassification.
  • Representativeness of data can be assessed by comparing the distribution of key demographics in the clinical sample to that of the target population. For a given clinic or medical center, the catchment area can be thought of as the target population. Quantifying these differences, if they exist, should be considered the first step. If few differences exist, stronger inferences can be made. Otherwise, once the researcher knows how much the study population differs from the target population, the second step should employ the observed differences to weigh or standardize the observations in the analytic dataset.5


A goal of investigators should be to accurately estimate the association between a given exposure and disease, or in some cases to build a predictive model to account for all sources of variability in the study outcome. A challenge to be faced when conducting observational studies in the EMR is that one doesn’t, and won’t, know the “true” association between a given exposure and a disease.

The researcher’s job is to use the tools at his or her disposal to approximate the truth as closely as possible. Enhancing the internal validity of studies using EMR data may require them to leverage additional sources of data, minimize the influence of selection factors, and conduct sensitivity analyses. Steps can be taken in the study design phase to ultimately enhance both internal and external validity. They should also consider consulting a local epidemiologist for specific guidance regarding the research project at hand.


  1. Dawber TR, Meadors GF, Moore FE. 1951. Epidemiological approaches to heart disease: the Framingham Study. Am J Pub Health Nation’s Health 41(3):279– 86.
  2. HIMSS Analytics. Healthcare Information and Management Systems Society. Electronic Medical Record Adoption Model.
  3. Kite B, Tangasi W, Kelley M, Bower J, Foraker R. 2014. Electronic medical records and their use in health promotion and population research of cardiovascular disease. Curr Cardiovasc Risk Rep 9(1):1–8.
  4. Rothman KJ, Greenland S, Lash TL. 2008. Modern Epidemiology (3rd ed). Philadelphia, Pa.: Lippincott Williams & Wilkins.
  5. Bower JK, Bollinger CE, Foraker RE, Hood DB, Shoben AB, Lai AM. 2017 (in press). Active use of electronic health records and personal health records for epidemiologic research. eGEMs.

Randi Foraker, PhD, MA, FAHA, ( is associate professor of epidemiology and biomedical informatics at The Ohio State University College of Public Health.

[DOI: 10.14524/CR-17-4008]

*To see all figures and/or tables published originally in this article, please visit the full-issue PDF of the April 2017 Clinical Researcher.