Secondary use of electronic medical records for clinical research: challenges and opportunities

Wen-Wai Yim; Amanda J Wheeler; Catherine Curtin; Todd H Wagner; Tina Hernandez-Boussard

doi:10.1088/2057-1739/aaa905

1. Introduction

With the passage of the HITECH Act as part of the 2009 American Recovery and Reinvestment Act [1, 2], government-led incentives for electronic medical record (EMR) adoption and demonstration of its meaningful use has propelled its ubiquity, as well as spurred interest in leveraging EMR data for scientific research. Parallel to this, President Barack Obama's promotion of the Precision Medicine Initiative [3, 4] and the Cancer Moonshot Initiative [5] have raised increased awareness—as well as much influx in funding—towards leveraging large-scale computable EMR data for driving healthcare innovation and improving patient care.

However, despite the excitement, the resources and the availability of these data, there remains immense hurdles, clinically and technically, to effectively apply EMR data for secondary use. It is part of these many difficulties that have led to issues in the scientific community in terms of contradictory and non-reproducible studies emerging from these new data [6–8].

In this work, we examine the challenges associated with EMR research by stepping through the life cycle in designing, developing, and implementing a clinical research study using retrospective EMR data, with a study of post-mastectomy pain as an example. Some challenges discussed are the role of multi-institutional data availability and study variable choice on study design, the differences inherent in datasets, the challenges of measuring outcomes, the effects of modeling choice and data sparsity on results, and the difficulties of handling longitudinal data. The goal of this paper is not to give clinical recommendations, but rather to illustrate that typical measures of 'statistical significance' in many clinical EMR studies can be undercut by the realities of the data collection, patient population biases, and limitations of data analytic tools, which are imperative to understand if we are to make meaningful use of EMR data.

2. Case study on pain outcomes following mastectomy

The study of pain following mastectomy and reconstructive surgeries is ideal to understand hurdles associated with EMR research. On one hand, breast cancer has a clear acute consequence without intervention (death); on the other hand, it is treatable with several combinations of treatment options, with some element of elective ability, and it is associated with many different symptoms. It is estimated that in the US one in eight women will develop breast cancer during their lifetime [9]. Following a diagnosis of breast cancer, women have many different treatment options, and variation in treatment pathways is well documented [10]. Often, treatment may include the surgical removal of cancerous tissue (e.g. mastectomy, lumpectomy) and a following breast reconstruction surgery. Unfortunately, chronic pain after breast surgery does occur, estimated to be between 20 to 40% [11]. However, little evidence exists regarding the association between adverse pain outcomes and different treatment choices and therefore both clinicians and patients lack necessary evidence to guide treatment decisions.

3. Methods

3.1. Data

The motivation of our case study is to characterize postoperative pain following different breast cancer treatments using EMRs from two healthcare systems: Stanford Healthcare Medical Center, a tertiary academic hospital, and the Veterans Health Administration (VHA). In the Stanford Healthcare Medical Center, further referred to as AH, patient records were identified from Epic's Clarity relational database. VHA clinical data were extracted from the VA Corporate Data Warehouse (CDW), which includes national data from multiple medical centers [12].

3.2. Patient population

Patients with breast cancer excision surgeries with a previous documented diagnosis of breast cancer, taken from 2008/01/01 to 2016/06/30, were extracted from the EMRs. Breast cancer diagnosis was identified by ICD-9-CM codes '174*' and '175*' and ICD-10-CM code 'C50*'. When multiple surgeries occurred, only the first excision surgery was included. Only patients with body mass index (BMI), preoperative pain, postoperative pain, pain at 30 d (+14/− 7 d), and 3 months (±14 d) were included. The number of patients with excision surgeries filtered by diagnosis codes were at 3387 and 3308 patients for AH and VHA respectively; after all exclusionary criteria was applied, 735 and 1210 patients remained.

3.3. Outcomes

To understand different aspects of patient pain after surgery, we assess multiple outcomes: postoperative length-of-stay in days (LOS), discharge day pain (a 0–10 numeric score), prescription of analgesics, physical therapy (PT) visits, and pain management (PM) visits (identified through institution-specific visit department names or classifications) at 30 d (+14/−7 d), 3 months (±14 d), and 1 year (±84 d) outcome time points. Each outcome is meant to be a proxy measure for postoperative pain in different ways. For example, a patient may have an extended LOS if their postoperative pain is not appropriately controlled. Similarly, PM, analgesics prescription, and PT may be prescribed for patients with high pain.

3.4. Study variables

Patient demographics variables included age, race/ethnicity, gender, Charlson score, and BMI at the time of surgery. Race/ethnicity was merged as follows: if hispanic was marked, they were considered hispanic, if there were more than one race marked, they were considered 'other', otherwise the race was given as the final race/ethnicity value. We only took female and male genders in this study. The last BMI prior to surgery was taken. We also included diagnosis of anxiety (supplemental table 1 (stacks.iop.org/CSPO/4/014001/mmedia)) and diagnosis of depression (supplemental table 2), as they have been shown to have longer LOS and postsurgical complications [13]. In AH, medication names are mapped to RxNorm and in the VHA, drugs are mapped to the national drug formulary (supplemental table 3). We developed categories to rate the invasiveness of a breast cancer excision surgery. The list of codes using current procedural terminology (CPT) were identified by examining the CPT ontology as well as including frequent billing codes by the department. A domain expert reviewed the list of codes and classified the excision surgeries as mild, moderate and severe (supplemental table 4). Reconstructions were similarly categorized as one of 3 categories: autologous, implant, and oncoplastic (supplemental table 5).

To control for confounding variables, we added information about other types of patient treatments that may occur from breast cancer patients measured at multiple time ranges. Included are additional breast excision and reconstruction surgeries, as well as common breast cancer treatments such as chemotherapy (supplemental table 6), radiation therapy (supplemental table 7), hormone therapy (supplemental table 8) [10]; in addition to PT, PM and other surgery (identified through HCUP surgery flag excluding excision/reconstruction surgery [14]). Time periods included several time ranges prior to surgery {(anytime up to 1 year prior to surgery), (between 1 year and 3 months prior to surgery), (3 months to surgery date)}, as well as several time periods between surgery date and an outcome time point {(surgery day to 3 months prior to outcome), (3 months to 1 month prior to outcome), (less than 1 month to outcome)}. These are not factored in simple bivariate analysis, however are included as variables in multivariate analysis.

3.5. Statistical analysis

We measured bivariate association for each variable against each outcome using spearman correlation, Fisher's exact test, chi-squared test, and Kruskal–Wallis ANOVA test as appropriate. Specifically, for two numeric variables, we used Spearman correlation; for two categorical variables, Fisher's exact test was used if any values were less than five, otherwise chi-squared test was used; finally, for one categorical and one numeric variable, we used Kruskal–Wallis ANOVA. We developed regression models to test the association between our dependent and independent variables of interest. We applied linear regression for numeric outcomes and binary logistic regression for binary outcomes. For example, dependent variables pain and LOS are numeric variables so a linear regression model was developed; meanwhile, whether or not patients had an analgesic prescription or had a PM visit or a PT visit are binary dependent outcomes and therefore a logistic regression was used. Processing of EMR data was done using SQL and java. Statistical analysis was performed using python packages pandas and scipy and R packages glm and lm. This study was approved by the IRB at both the AH and the VHA.

3.6. Qualitative analysis

In addition to the standard statistical analysis, we use our case study to exemplify challenges of clinical studies using EMR. This includes the effects of using multiple institutions with different data organization, the influence of EMR definitions on results, the bias of disease populations on outcomes, the choice of modeling on results, and the difficulty of incorporating long-term information in EMR studies. When possible, we include quantitative analysis in addition to qualitative analysis.

4. Case study quantitative study results

Table 1 shows the populations between AH and VHA. As expected, the AH population includes more minority groups as well as a relatively younger population. The VHA population included more males, higher BMI, and higher reported rates of depression and anxiety status based on ICD-9-CM coding. A majority of patients in both institutions were outpatient (discharged the same day) or were discharged the next day. In our study population, a minority of AH patients (20%) had a LOS of 2 nights or longer. AH did not include any severe class excision surgeries.

Table 1. Patient demographics stratified by study site.

Variable		AH	VHA
N		735	1210
Age (year), mean (SD)		52 ± 12	59 ± 11
Gender, n (%)	Female	729 (99.2)	986 (81.5)
Gender, n (%)	Male	^a	224 (18.5)
BMI, mean (SD)		26 ± 6	31 ± 7
Race, n (%)	Asian	182 (24.8)	0 (0)
	Black	18 (2.4)	330 (27.3)
	Hispanic	84 (11.4)	70 (5.8)
	White	385 (52.4)	724(59.8)
	Other^b	66 (9.0)	86 (11.7)
Charlson score, mean (SD)		5 ± 3	5 ± 3
Diagnosed anxiety, n (%)	No	632 (86.0)	774 (64.0)
Diagnosed anxiety, n (%)	Yes	103 (14.0)	436 (36.0)
Diagnosed depression, n (%)	No	661 (89.9)	703 (58.1)
Diagnosed depression, n (%)	Yes	74 (10.1)	507 (42.9)
LOS (days), mean (SD)	0	309 (42.0)	940 (77.7)
	1	265 (36.1)	148 (12.2)
	2+	161 (21.9)	122 (10.1)
Excision surgery class, n (%)	Mild	451 (61.4)	533 (44.0)
	Moderate	284 (38.6)	613 (50.7)
	Severe	0 (0)	64 (5.3)
Reconstruction type, n (%)	Autologous	40 (5.4)	16 (1.3)
	Implant	155 (21.1)	112 (9.3)
	Oncoplastic	19 (2.6)	^a

^aValue less than 10 at time of first surgery. ^bOther is an aggregate of Native American, Pacific Islander, and Unknown categories, as well as any mixed combinations.

Tables 2 and 3 show factors significant (as related to p-values for these models) for bivariate and regression analysis. While there are reoccurring themes for each outcome between AH and VHA, there was no major consensus regarding the role of each study variable for overall outcomes. Specifically, for each outcome, 'clinically significant factors' change: (1) across different time points for the same outcome measurement (horizontal rows give different significance factors, e.g. pain at discharge, 30 d, and 1 year), (2) across different outcome measure proxies (vertical columns give different significance factors), (3) across different institutions (the same horizontal, vertical location in AH may have different factors in VHA) and (4) compared to the different analysis types (the same corresponding boxes in tables 2 and 3). Differences could either be due to genuine differences in the population however could also be due to fitting to data idiosyncrasies.

Table 2. Bivariate analysis significant indicators. Bivariate association for each variable against each outcome using spearman correlation, Fisher's exact test, chi-squared test, and Kruskal–Wallis ANOVA test as appropriate.

	Discharge day	30 d	3 month	1 year
AH

LOS	Surgery category^a	—	—	—
	Reconstruction (autologous)^a
	Reconstruction (implant)^a
Pain	Surgery category^a	Surgery category^c	Race^c	Surgery category^c
	Race^c	Diagnosed anxiety^c		Diagnosed anxiety^c
	Reconstruction (implant)^a	Reconstruction (autologous)^c		Reconstruction (implant)^a
	Reconstruction (implant)^a	Reconstruction (oncoplastic)^a		Reconstruction (implant)^a
Analgesic	—	—	—	Age^c
				Charleston score^a
				Surgery category^a
				Reconstruction (implant)^a
PM	—	—	—	—
PT	—	Age^c	Age^c	Age^b
		Charson_score^c	Charleston score^b	Charleston score^b
		Depression^a	Deppress_stat^b	Diagnosed anxiety^c
		Diagnosed anxiety^a	Diagnosed anxiety^a	Diagnosed anxiety^c

VHA

LOS	Surgery category^a	—	—	—
	Race^b
	Reconstruction (autologous)^c
	Reconstruction (implant)^a
Pain	Depression^a	Depression^aGender^aDiagnosed anxiety^a	Depression^a	Depression^a
	Gender^a		Gender^a	Gender^a
	Surgery category^a		Diagnosed	Diagnosed
	Diagnosed anxiety^aReconstruction (implant)^a		Anxiety^a	Anxiety^a
	Diagnosed anxiety^aReconstruction (implant)^a		Race^c	Race^b
Analgesic	—	Age^c	Age-	Age^a
			Charleston score^b	BMI^c
			Depression^a	Charleston score^b
			Diagnosed anxiety-	Depression^a
			Race^c	Surgery category^a
			Reconstruction (oncoplastic)^c	Reconstruction (implant)^a
PM	—	Diagnosed anxiety^c		BMI^b
				Depression^a
				Diagnosed anxiety^c
PT	—	Charleston score^a	Charleston score^b	BMI^c
		Surgery category^b	Sugery class^b	Depression^b
				Gender^c
				Diagnosed anxiety^c

Signif. codes: ^a0.001, ^b0.01, ^c0.05. For two numeric variables, correlations larger than 0.6 are shown with ' = '.

Table 3. Significant study variables for both AH and VHA for regression analysis. Treatments such as breast cancer excision surgery, reconstruction surgery, radiation therapy, hormone therapy, other surgery, at different time points relative to outcome points need to be controlled for, as they could be associated with our outcome of interest. Other surgery-related factors such as post-operative pain and pre-operative pain were also included in the models. Here surgery class 1, 2, 3 refer to mild, moderate, and severe, respectively. Linear and logistic regression were used for numeric and binary outcomes, respectively.

	Discharge day	30 d	90 d	365 d
AH

LOS	Charleston score^c	—	—	—
	Surgery category (2)^a
	Reconstruction (autologous)^a
Pain	Age^b	Age^c	Race (Asian)^c	Race (Black)^c
	Race (unknown)^c	Reconstruction (implant)^c	Race (Black)^a	Race (other)^c
	Reconstruction (implant)^a	Reconstruction (implant)^c	Race (Black)^a	Race (other)^c
Analgesic	—		Race^c	Charleston score^b
PM	—	N/A	N/A	N/A
PT	—	Age^a	Race^c	Age^b
		Anxiety^b		Surgery category (2)^c
		Reconstruction (implant)^c		Surgery category (2)^c

VHA

LOS	BMI^c	—	—	—
	Gender^c
	Surgery category (2)^a
	Surgery category (3)^a
	Race (Black)^b
	Race (Pacific Islander)^c
	Reconstruction (autologous)^b
	Reconstruction (implant)^c
Pain	Age^a		Age^c	Age^a
	Charleston score-		Charleston score^c	Depression^c
	Depression-		Depression-	Race^c
	Gender^b		Surgery category (2)^c
	Surgery category (2)^a		Race^a
	Surgery category (3)^bReconstruction (implant)^a		Reconstruction (autologous)^c
	Surgery category (3)^bReconstruction (implant)^a
Analgesic	—		Depression status^b
Analgesic	—		Diagnosed anxiety^c
PM	—	N/A	N/A	BMI^b
PT	—		Depression_stat^c	BMI^c
PT	—		Surgery category (2)^c	Gender^b

Signif. codes: ^a0.001, ^b0.01, ^c0.05.

5. Challenges in secondary use of retrospective EMR studies

In this section, we identify challenges that are part of the life cycle for engaging in any EMR study. We describe the problems both in general and with respect to our specific post-mastectomy pain case study described previously.

5.1. Discrepancies in data availability directly impact experimental design decisions

AH is a tertiary care center, and therefore more likely to have patients come for treatment and receive postoperative care elsewhere. This results in a scenario in which many patients may be lost to follow up. On the other hand, VA data represents a national system, including approximately 153 hospital facilities and 788 community-based outpatient clinics [15], in which patients going to multiple VA hospital clinics will still be captured by the system. Furthermore, pain (one of our defining filters) is not always uniformly recorded across visits (pain score were not always collected, and if collected they were collected at different frequencies). In an attempt to capture long-term postoperative pain scores (i.e. 1 year follow-up), the time-window of assessment had to be larger, ranging from 281 to 449. In AH, the drop-off discharge (2277 patients in the beginning), went to 64, 41 and 32% at the 30 d, 3 month, and 1 year follow-up; in contrast, in the VHA, the drop off from discharge (2429 patients) was 74, 54, and 50 percent respectively. Thus, despite such a lenient window, the rates of loss-to follow up were substantial and only 32% had pain information recorded during the 1 year follow-up period.

Table 4 shows the raw number of patients that visit after surgery, prior to filtering by pain information available at different time points. The percentage of patients that drop off is steeper for AH compared to VHA; meanwhile, the percent of patients with pain information is higher in the VHA. The return of patients may signal a bias in the resulting population for people who had worse problems for both locations. VHA pain data because of its larger care system and mandated collection as a vital sign had substantially less of dropout using the same time specifications.

Table 4. Frequency of patients with visits and pain scores, stratified by days 90 d intervals following surgery.

	Days following surgery
	1–90	91–180	181–270	271–360	361–450	451–540	541–630	631–720
AH (N = 2378)^a

Patients with visits, n (%)	2378 (100)	2107 (89)	2040 (86)	1892 (80)	1771 (74)	1538 (65)	1389 (58)	1229 (52)
Patients with a pain score, n (%)	2323 (98)	1719 (82)	1563 (77)	1278 (68)	1088 (61)	886 (58)	803 (58)	715 (58)

VHA (N = 3255)^a

Patients with visits, n (%)	3255 (100)	3130 (96)	3031 (93)	2973 (91)	2896 (89)	2776 (85)	2675 (82)	2514 (77)
Patients with a pain score, n (%)	3168 (97)	2783 (89)	2614 (86)	2538 (85)	2405 (83)	2264 (82)	2139 (80)	2011 (80)

^aVisits are measures with respect to the first time period (between 1 d after surgery and 90 d); patients with pain score measured are in respect to the number of patients with a visit within the given time period. Population numbers represent numbers prior to filtering by requiring a pain score at discharge and at subsequent time points.

5.2. Multi-institutional involvement drives study design to the lowest common (and less ideal) granularity

While AH is a private single institution, the VHA CDW data is national data already transformed from individual centers around the country, some data granularity may be lost due to data processing. Biases arise from accommodating for multi-institutions' differences in practice. For example, while 0–10 numeric pain scores were available both in AH and VHA, there were differences in the collection methods. AH pain score data had much more granularity including location, descriptions, alleviating, aggravating factors, etc.

In order to normalize for the substantially less informative VHA CDW data, anatomic locations of pain scores were dropped from the AH data. Therefore, it would be possible (especially for time points further out) that the pain measured from the EMR was not in fact due to the breast excision or reconstruction surgery. The implications of these various constraints led to two-fold results: (1) noisier time windows for further time points and (2) unspecified pain measurements.

5.3. Methods of billing will affect data completeness

Furthermore, the difference in payment can be reflected in records. For example, AH is paid for care it provides (thus it has incentives to maintain coding systems that support its revenue). VHA is capitated and there is no incentive to code effectively. VHA physicians, therefore, tend to rely on more data in notes than in structured codes. For our cohort, the AH population patients were found to have a max, minimum, and average of 552, 16, and 141 ICD-9 codes for their entire record; in contrast it was 355, 1, and 96 in the VHA.

5.4. Study variables are defined on inherently imperfect EMR element proxies which may be designed arbitrarily and may be institution-dependent

Despite using common vocabularies such as ICD9, ICD10, CPT codes, there remains many different ways in which the same information, e.g. chemotherapy, can be identified. The chief reason for this is that EMR data are imperfect proxies for events we wish to study. An event can be identified using different methods which have different accuracies depending on institution. Even for the same concept, e.g. diagnosis of anxiety, there may be different definitions available [13, 16]. In this study, chemotherapy was defined using CPT codes alone, however it could also be defined using chemotherapy medications alone, or both in conjunction. In table 5, we show that different definitions can lead to different frequency counts. These minor differences may propagate to differences in analytical conclusions. For example, in our bivariate analysis, depending on the identified outcome, e.g. LOS, discharge pain, 30 d, 3 month or 1 year pain, different chemotherapy-related confounding variables were found to be significant and at different levels. For discharge pain, whether or not if the patient received chemotherapy within 1 year to 3 months prior to surgery and whether or not if the patient received chemotherapy within 1 month prior to surgery were significant in using CPT and medication prescription dates, however the significance level for the latter for the former was for an alpha level of 0.1 in contrast to the latter which was at 0.05. For 30 d pain, whether or not the patient received chemotherapy immediately after surgery was shown to be significant within 0.05 alpha levels for CPT codes, but only within 0.15 alpha levels for medications; and only medications showed a 0.05 significance for concurrent chemotherapy within the 30 d window.

Table 5. Frequency of patients receiving chemotherapy using CPT versus medication prescription for chemotherapy identification in AH.

	CPT	Medication prescription
>1 year prior to surgery	12	18
1 year to 3 months prior to operation day	169	173
3 months prior to and up to operation day	172	172
Operation day		2
Operation day to 3 months prior to time-point	375	374
3 to 1 month prior to time-point^a	139	126
1 month prior to time-point^a	109	88
Time-point^a after surgery	119	107

^aTime-point was set to 365 (±84) d.

Another challenge is that some variables may be institution-dependent which introduces room for data mapping issues. For example, for this study we identified use of PT and PM services as a visit to a particular department(s) within the healthcare setting. This required manual review of institutional-dependent coding. Identification of the closest corresponding data in the VHA were stop-codes, which are classifications of visit type that are assigned after the time of the visit. In another example, for identification of medication related variables such as hormone therapy and analgesic, required using two separate vocabularies (RxNorm in AH and NDF-RT in VHA) which are not easily mapable [17].

5.5. Missingness and scarcity of observable adverse outcomes biases findings

In order to compare how certain processes are correlated with good and bad outcomes, various procedures and outcomes must be recorded for fair comparison. However, patients may obtain their health care services from multiple locations, e.g. a patient that receives surgery in one institution but continues care back at their local unconnected clinics. Therefore, the information recorded in one EMR may not be complete or accurate. Furthermore, health care is a service that is typically sought out for when there are health concerns which leads to data representation problems. For example, we can only capture pain scores or use of PT if patients request care—healthy patients are unlikely to return to the hospital for unnecessary care. For our outcomes, the percent of patients with PT or PM visits were lower than 25 percent and average pain scores were lower than pain score of 3 (supplemental table 9). Thus, there is an inherent bias in conclusions derived from EMR studies as they often represented sicker patients.

Besides challenges due to problems in follow-up, outcomes are difficult to measure for several reasons. First, patients do not always report their issues (patients may not seek help for pain). Second, some outcomes may be subjective and between patient variation can be significant. Third, even if reported by the patient, problems may not be recorded in the EMR. If reported, it may not be in structured form, and therefore sophisticated data mining algorithms may be necessary to utilize these data. Finally, another challenge is that typically the number of patients with severe issues is much smaller compared to the number of patients with 'no problems'. In our study, in all time periods, pain score measurements were concentrated towards the low pain scores as indicated by the percentile values (supplemental table 8). For all time periods, the 75 percentiles are at most at a pain score of 4. Meanwhile, use of PM and PT are relatively less populated, especially the PM usage criteria, all less than 23% for both institutions.

5.6. Model choice and data idiosyncrasies can create arbitrary constructs of clinical significance

The differences in the predictive results for bivariate versus regression analysis and AH and VA populations is indicative of some very challenging issues faced by clinical sciences. The first is the modeling choices and defining which variables need to be controlled for in the population. For example, bivariate analysis gives a very simple view. However, they do not take into account a host of influencing variables available in the EMR, such as additional treatments after initial operation. On the other hand, adding many other variables makes statistical inference more difficult—thus without sufficient representation of various combinations of treatments, true relationships may be masked by idiosyncrasies in the dataset. The second issue is related to the latter point, which is, the influence in sample populations in influencing results. We see that in the VHA population diagnosed depression and anxiety is much more common, meanwhile the representation of breast cancer among men is higher as a result of a large male population in the VA. These can be true indicators of some connection of the disease or they can be spurious relations to the dataset.

The use of multivariate analysis allows for the inclusion of confounding effects such as other treatments and operations. Our choice of simple linear regression for numeric outcomes and binary logistic regression for binary outcomes was based on simplicity and interpretability. While this offers only a first order approximate picture of how individual study variables may have some effect on outcomes, the complexity of human biology guarantees there are many non-linear interaction amongst input variables that can affect outcomes that are difficult to capture in such a simplistic way. Restated, the assumptions of each model (as well as its significance measures) may not be acceptable for every question—and since we do not know the answer, it is impossible to know when they are or are not appropriate. Use of other models that fit to the data, for example decision trees, higher order regression models, or regression models with interaction terms can do better to model nonlinearities in data. Unfortunately, again, it is impossible to know when you are doing the 'right' amount of fitting when you do not know the truth.

5.7. Longitudinal studies require controls for confounding events however then create problems of data sparsity

Although understanding of long-term treatments outcomes is crucial, longitudinal studies have more confounding variables to control for as patients will undergo various physiological changes and additional treatments. Indeed, we found that many variables that were significantly related to the outcomes, were confounding variables such as recent or current (within the time frame) treatments for chemotherapy, radiation therapy, hormone therapy, reconstruction surgery, and past opioid prescription. However, some variables indicate a more long-term pain problems (e.g. Chemotherapy, PM, or pain prior to surgery). Effectively, longitudinal studies are riddled with a practical problem: if confounding variables are not included, the study is not realistic, if they are included the data becomes sparse, making it even harder to draw conclusions. Table 6 show the amount of patients who have different other treatments, e.g. chemotherapy, within the time-frame of the study for AH, broken down relative to the operation day. For example, from the first row, 12 people had chemotherapy more than 1 year prior to the operative day Models not only have to generalize all the different types of treatments available over time (e.g. each box in the tables), but also have to deal with all permutations of them! Taking a logistic regression model as an example, a plain logistic regression with non-processed inputs will miss many non-linear relationships if no interaction terms are allowed. However, if interaction terms are allowed, then there becomes combinatorially more variables to reason over, for which it is easy to overfit, and for which there is not enough data to provide enough examples.

Table 6. Number of patients with confounding events relative to the operative day.

	Prior to operative day			Operative day	Relative to 365 d time window
	More than 1 year	1 year to 3 months	3 months to operative day	Operative day	Operative day to 3 months prior to time point	3 months to 1 month prior to time point	1 month prior to and up to time point	Time point
Chemotherapy	12	169	172		375	139	109	119
Excision					184	^a	^a	^a
Hormone therapy	12	24	13	^a	384	155	64	268
Analgesic prescription	124	198	144	735	635	140	96	261
PT^c	33	26	25	^a	128	55	39	88
RT^b	^a	^a	^a	10	373	91	20	^a
Reconstruction (autologous)	^a			40	19	^a	^a	23
Reconstruction (implant)	^a			155	102	34	21	57
Reconstruction (oncoplastic)	^a	^a		19	47	21	11	69
Other surgery	46	26	13		173	59	39	111

^aLess than 10 patients in cell. ^bRT: radiation therapy. ^cPT: physical therapy.

To get a sense of patients in our selected cohort, we randomly sampled 10 patients from AH and manually reviewed all their medical notes within their 1 year time periods. Only one patient during their 1 year time period was not receiving additional revision surgeries, biopsies, or undergoing chemo- or hormone therapy. This manual review suggests that we are selecting patients who will tend to have additional medical problems whether or not they may be related to the original breast cancer, a metastatic condition, additional medical procedures, or continuing health issues. While one can attempt to control for this in the multivariate analysis, this would be a serious issue of bivariate analysis of long-term outcomes.

6. Discussion

In this study that compared data on breast cancer surgeries from two healthcare systems, we found that throughout the process of formulating a retrospective clinical experiment, there are a myriad of intertwining factors which provide deep challenges to unearthing clinically meaningful insights.

From conception, data availability and institutional differences in practice already exert limitations in experimental design. Specifically, outcomes at further time points can may be more biased because of a decrease in visits. Moreover, measurable outcomes and variables are limited by institution data capturing practices, often determined by the lowest granularity of data available across multiple institutions, as shown in the identification of pain information. There are practical challenges regarding the transferability of study definitions, even within the same institution, though this is incensed with multiple institutions which are organized under dissimilar data models. This was demonstrated by our experiment of using different chemotherapy identification techniques as well as our discussion of RxNorm versus NDF used for medication identification. Furthermore, generalizability of studies across disparate populations may confound experiments; our AH and VHA populations are good examples of this.

Finally, there are limitations in the ability of analytic methods to accurately model real-world effects. The nature of medicine and the complexity of biology leaves deep impressions on the data, which affects the practicality of experimental designs. Some of this was exhibited by the choice of modeling technique between bivariate and univariate analysis, revealing different significant variables per different outcomes across time. However, our experiments only showed the surface, even within bivariate analysis, there are different analytical methods, e.g. Fisher's exact, chi-squared tests. For multivariate analysis, we chose linear regression, however we could have easily chosen higher-order fitting (e.g. polynomial regressions) or kernel methods. With many different variables, it is difficult for any algorithm to reason over and give clear results. Though it is difficult to reduce the number of variables as real patient clinical histories are in fact riddled with many confounding clinically relevant events.

Although many of the challenges discussed would be difficult to eliminate, there are several trends that can mitigate these problems related to population differences and data sparsity: which is the movement towards shared data repositories, the application of natural language processing, and increasing attention towards user-centered design for software development. All three developments target a strategy of increasing data points (patient numbers) and adding more variety in the sampling. The last item has an additional potential to increase the quality of data capture.

6.1. Moving towards common data repositories and normalized data representation

Efforts are underway to develop shared data repositories as well as data normalization standards [18–21]. This includes virtual warehouses that can query multiple institutions using the same data model for larger cohorts. These efforts will be critical for obtaining greater cohort population numbers which can be used in retrospective clinical studies. With a greater number of cases, many combinations of patient treatment and characteristic nuances may be taken account of whilst not exasperating the data sparsity problem (small numbers can give arbitrary attribution of clinical significance). Of course, pooling data for meaningful research is nontrivial, and before such efforts can take place, problems such as data security and privacy needs careful attention.

6.2. Application of natural language processing into the clinical domain

Use of natural language processing applied to biomedicine is increasing in prevalence [22]. As free text is a critical mode of communication among clinical teams, this will mitigate the difficulties of data incompleteness. Moreover, this area offers rich new information as a majority of clinical patient information is locked in text. For example, a clinical note may include information about patients' histories at other institutions or provide detailed and nuanced medical histories. In fact, for several patients who had reconstruction surgeries prior to their first excision surgery, manual review of clinical notes revealed that 4/6 had a previously unrecorded structured data signally an excision surgery. Though careful quantification of errors associated with NLP necessitates additional quality assessments, if applied judiciously, this may lead to an increase in the quality of data, as well as the quantity of data captured.

6.3. Push towards user-centered design for EMR software

Another strategy of maximizing information from data, is to decrease noise by increasing the quality of data at the point of capture. However, the burden of entering data generally falls on over-worked clinical staff, whose first priority is to care for their patients. In fact, identified barriers for EMR adoptions includes perceived disruption to clinical workflow, software adaptability and complexity, as well as training and leadership engagement [23]. As the primary purpose of EMRs in a healthcare organization is for billing and record keeping, The result is uneven data capture that is sensitive to local institution protocol, software usability, and changes to billing practices. There is increased awareness towards usability problems and frustrations in EMR software [23, 24]. This is leading to proposals for more user-centric design of EMR software that can shift the burden of making minute, nuanced data input decisions away from clinicians [25, 26]. With design of EMR software with emphasis on usability, clinicians can focus on care of patients with the burden of sifting through correct documentation eased through better human interfaces. Cleaner and more consistent datasets would reduce data definition transferability problems, and with a greater number of data points, would help mitigate large variances in modeling to identify clinically significant variables.

7. Conclusions

This study highlights the importance of secondary use of EMR data for clinical research. The adoption of EMRs has significantly increased the amount of detailed treatment information available today, that would be difficult to get using survey data or costly and laborious manual chart review. However, coupled with these data are important variability issues, that are magnified in multi-center studies. For example, the multiple-center studies can promote generalizable solutions and address population generalizability, it also carries with it limitations such as the problem of defaulting to a common lowest denominator in variable definitions, population differences, and information capture problems. Fortunately, movement towards collective data repositories and common data models will help mitigate these problems. Moreover, application of NLP, aptly used, will increase the quality and quantity of available data to build larger cohorts. Finally, better user-centric design of EMR software may increase the quality of data at the point of capture.

In conclusion, though EMRs offer opportunities for providing very granular and detailed items regarding a patient's clinical history and care, studies must account for many potential data modeling challenges, as we have highlighted in this work.

Acknowledgments

The authors would like to thank Tina Seto, Karishma Desai, and Yingjie Weng for their advice and contributions. We would also like to acknowledge the helpful advice of those in the Palo Alto Veterans Affairs fellows' work-in-progress group and to Erqi Pollum for clinical input.

This project was supported by the National Cancer Institute of the National Institutes of Health under Award Number R01CA183962. The content is solely the responsibility of the authors and does not necessarily represent the official views of the National Institutes of Health. WY was also supported by the Big Data Scientist Training Enhancement Program (BD-STEP) fellowship. CC was supported by the Department of Veterans Affairs. The views expressed in this article are those of the authors and do not necessarily reflect the position or policy of the Department of Veterans Affairs or the United States government.

Disclosures

No conflicts of interest, financial or otherwise, are declared by the authors.

Secondary use of electronic medical records for clinical research: challenges and opportunities

Article metrics

Permissions

Author e-mails

Author affiliations

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Case study on pain outcomes following mastectomy

3. Methods

3.1. Data

3.2. Patient population

3.3. Outcomes

3.4. Study variables

3.5. Statistical analysis

3.6. Qualitative analysis

4. Case study quantitative study results

5. Challenges in secondary use of retrospective EMR studies

5.1. Discrepancies in data availability directly impact experimental design decisions

5.2. Multi-institutional involvement drives study design to the lowest common (and less ideal) granularity

5.3. Methods of billing will affect data completeness

5.4. Study variables are defined on inherently imperfect EMR element proxies which may be designed arbitrarily and may be institution-dependent

5.5. Missingness and scarcity of observable adverse outcomes biases findings

5.6. Model choice and data idiosyncrasies can create arbitrary constructs of clinical significance

5.7. Longitudinal studies require controls for confounding events however then create problems of data sparsity

6. Discussion

6.1. Moving towards common data repositories and normalized data representation

6.2. Application of natural language processing into the clinical domain

6.3. Push towards user-centered design for EMR software

7. Conclusions

Acknowledgments

Disclosures

Secondary use of electronic medical records for clinical research: challenges and opportunities

Article metrics

Permissions

Share this article

Author e-mails

Author affiliations

ORCID iDs

Dates

Peer review information

Abstract

1. Introduction

2. Case study on pain outcomes following mastectomy

3. Methods

3.1. Data

3.2. Patient population

3.3. Outcomes

3.4. Study variables

3.5. Statistical analysis

3.6. Qualitative analysis

4. Case study quantitative study results

5. Challenges in secondary use of retrospective EMR studies

5.1. Discrepancies in data availability directly impact experimental design decisions

5.2. Multi-institutional involvement drives study design to the lowest common (and less ideal) granularity

5.3. Methods of billing will affect data completeness

5.4. Study variables are defined on inherently imperfect EMR element proxies which may be designed arbitrarily and may be institution-dependent

5.5. Missingness and scarcity of observable adverse outcomes biases findings

5.6. Model choice and data idiosyncrasies can create arbitrary constructs of clinical significance

5.7. Longitudinal studies require controls for confounding events however then create problems of data sparsity

6. Discussion

6.1. Moving towards common data repositories and normalized data representation

6.2. Application of natural language processing into the clinical domain

6.3. Push towards user-centered design for EMR software

7. Conclusions

Acknowledgments

Disclosures