Supplementary data for 'How Confident Are We About Observational Findings in Healthcare: A Benchmark Study'

Supplementary data for:

Schuemie MJ, Cepeda MS, Suchard MA, Yang J, Tian Y, Schuler A, Ryan PB, Madigan D, Hripcsak G How Confident Are We About Observational Findings in Healthcare: A Benchmark Study. Harvard Data Science Review, 2(1), 2020

Article abstract

Health care professionals increasingly rely on observational health care data, such as administrative claims and electronic health records, to estimate the causal effects of interventions. However, limited prior studies raise concerns about the real-world performance of the statistical and epidemiological methods that are used. We present the “Observational Health Data Sciences and Informatics (OHDSI) Methods Benchmark” that aims to evaluate the performance of effect estimation methods on real data. The benchmark comprises a gold standard, a set of metrics, and a set of open source software tools. The gold standard is a collection of real negative controls (drug-outcome pairs where no causal effect appears to exist) and synthetic positive controls (drug-outcome pairs that augment negative controls with simulated causal effects). We apply the benchmark using four large health care databases to evaluate methods commonly used in practice: the new-user cohort, self-controlled cohort, case-control, case-crossover, and self-controlled case series designs. The results confirm the concerns about these methods, showing that for most methods the operating characteristics deviate considerably from nominal levels. For example, in most contexts, only half of the 95% confidence intervals we calculated contain the corresponding true effect size. We previously developed an ‘empirical calibration’ procedure to restore these characteristics and we also evaluate this procedure. While no one method dominates, self-controlled methods such as the empirically calibrated self-controlled case series perform well across a wide range of scenarios.


Since the publication of our manuscript in Harvard Data Science Review we have discovered an error in our implementation of IPTW (Inverse Probability of Treatment Weighting), which in this app is identified as CohortMethod analysis 5. We have corrected the error, and have updated the performance statistics accordingly. Note that although the observed performance has now increased somewhat, the correction has no material impact on our conclusions.

External links

Figure S.2. Estimates with standard errors for the negative and positive controls, stratified by true effect size. Estimates that fall above the red dashed lines have a confidence interval that includes the truth. Hover mouse over point for more information.
Figure S.3. Receiver Operator Characteristics curves for distinguising positive controls from negative controls.