Levitra enthält Vardenafil, das eine kürzere Wirkdauer als Tadalafil hat, dafür aber schnell einsetzt. Männer, die diskret bestellen möchten, suchen häufig nach levitra kaufen ohne rezept. Dabei spielt die rechtliche Lage in der Schweiz eine wichtige Rolle.
      
Automated identification of type 2 diabetes mellitus: code versus text
University of South Carolina
AUTOMATED IDENTIFICATION OF TYPE
2 DIABETES MELLITUS: CODE VERSUS
TEXTVanessa L. Congdon
University of South Carolina - Columbia
Follow this and additional works at: 
Recommended CitationCongdon, V. L.(2014). 
AUTOMATED IDENTIFICATION OF TYPE 2 DIABETES MELLITUS: CODE VERSUS TEXT. (Doctoraldissertation). Retrieved from 
This Open Access Dissertation is brought to you for free and open access by Scholar Commons. It has been accepted for inclusion in Theses andDissertations by an authorized administrator of Scholar Commons. For more information, please contact .
AUTOMATED IDENTIFICATION OF TYPE 2 DIABETES MELLITUS: 
CODE VERSUS TEXT 
Vanessa L. Congdon 
Bachelor of Science 
Longwood University, 2007 
Submitted in Partial Fulfillment of the Requirements 
For the Degree of Master of Science in Public Health in 
The Norman J. Arnold School of Public Health 
University of South Carolina 
Anwar T. Merchant, Director of Thesis 
Robert Moran, Reader 
Linda J. Hazlett, Reader 
Lacy Ford, Vice Provost and Dean of Graduate Studies 
 Copyright by Vanessa L. Congdon, 2014 
All Rights Reserved 
This work is dedicated to my family and friends. Thank you all for believing in me and 
continually encouraging me to achieve my dreams. 
ACKNOWLEDGEMENTS 
This thesis would not have been possible without the continued support and 
guidance from a number of people. First I would like thank my committee chair, Dr. 
Anwar Merchant for his knowledge, guidance, and flexibility to work with me from afar. 
I am also indebted to my thesis committee members, Dr. Linda Hazlett and Dr. Robert 
Moran, for their time, honest critiques, and willingness to guide me through the entire 
process. I would also like to acknowledge my PPRNet mentors, Dr. Steven Ornstein and 
Dr. Ruth Jenkins for providing me with the PPRNet data and for molding me into a true 
My success as a student would not have been possible without the unwavering 
support of my family and friends. A special thanks to my parents for their unconditional 
love and support through my darkest of days during this long process. And lastly, thank 
you to my biggest cheerleader and best friend, Jason, for your limitless support and for 
making every day of this journey a lot more enjoyable. 
Background: A growing emphasis in the healthcare industry today is being placed on 
demonstrating meaningful use of one's Electronic Health Record (EHR) system. As rates 
of chronic disease, including diabetes mellitus (DM) rise, it has become clear that 
accurate and timely disease surveillance could be greatly improved utilizing the 
technologies available to clinicians today. As the Centers for Medicare and Medicaid 
Services (CMS) meaningful use incentive program deadlines fast approach, it remains 
unclear if their limited attestation criteria clearly reflect their end goal of improving 
patient care. The objective of this research was to determine the diagnostic accuracy of 
an automated text- based algorithm for identifying patients with diabetes mellitus from 
the longitudinal PPRNet Database. 
Methods: The longitudinal PPRNet database is comprised of McKesson's Practice 
Partner, Lytec or Medisoft EHR system users nationwide. The analysis included data 
from the 115 PPRNet practices that submitted their 4th quarter data extract in January 
2014. An unstructured free-text algorithm was used to determine the number of type 2 
diabetics among all active adult patients. This algorithm which examines unstructured 
free-text data documented within the EHR title lines was compared to a previously 
established protocol which used a combination of ICD-9 diagnostic codes and/or active 
DM prescriptions. 
Results: Between all algorithm comparisons, the patients identified as having diabetes 
varied considerably. Using the combination of ICD-9 diagnostic codes and/or active DM 
prescriptions as comparison method, the resulting sensitivity was 77.8% and specificity 
was 97.2% for the free-text definition. Using diagnostic codes alone as the standard for 
comparison resulted in a much higher sensitivity (99.3%), and lower specificity (91.9%). 
However, when we compared the free-text definition to the ICD-9 diagnostic codes 
alone, 70% of free-text identified cases were found to be un-coded. 
Conclusions: As EHR use continues to rise, it is crucial that we continue to develop 
ways to accurately translate patient data out of these systems in order to meaningfully 
utilize these powerful technologies. This thesis has helped clarify the need for further 
development of accurate data translation platforms in order to capture each patient's full 
and unique health story as well as for monitoring treatment and outcomes all while 
minimizing physician burden. 
TABLE OF CONTENTS 
DEDICATION . iii 
ACKNOWLEDGEMENTS . iv 
LIST OF TABLES . ix 
LIST OF ABBREVIATIONS .x 
CHAPTER I – Introduction .1 
1.1 Statement of the Problem .1 
1.2 Purpose and Objectives .4 
1.3 Significance of Research .5 
CHAPTER II – Literature Review .6 
2.1 Diabetes Mellitus .6 
2.2 U.S Healthcare's Transition to Electronic Health Record Systems .7 
2.3 Data Structure .10 
CHAPTER III – Methods .17 
3.1 Study Design .17 
3.2 Measurement .18 
3.3 Statistical Analysis .21 
CHAPTER IV – Results .23 
4.1 Sample Characteristics .23 
4.2 Sample Characteristics of Test Identified Diabetes Mellitus Population .23 
4.3 Algorithm Evaluation: DM Prevalence, Sensitivity and Specificity .24 
CHAPTER V – Discussion .27 
5.1 Strengths of Study .28
5.2 Limitations of Study .28 
5.3 Future Research .29 
5.3 Conclusions .29 
TABLE 2.1: Description of comparative studies that examine the reliability and validity of EHR derived algorithms for clinical quality measurement .14 
TABLE 3.1: Drugs for treatment of Type 2 Diabetes Mellitus .20 
TABLE 4.1: Sample Characteristics of PPRNet Population and Adults with Text-Identified Type 2 Diabetes Mellitus .25 
TABLE 4.2: 2-year DM Prevalence among All Active Adult Patients in 115 PPRNet Practice Sites by Algorithm .26 
TABLE 4.3: Sensitivity and Specificity of Unstructured Free-Text Algorithm Using Different Standards of Comparison .26 
LIST OF ABBREVIATIONS 
CDC………………………………………….Centers for Disease Control and Prevention 
HITECH…………….Health Information Technology for Economic and Clinical Health 
PBRN…………………………………………….Primary Care Based Research Network 
PPRNet……………………………………………….Practice Partner Research Network 
Statement of the Problem 
Diabetes mellitus (DM) is one of the most prevalent, costly and burdensome, 
chronic illnesses in the U.S, with nearly 10% of the entire population diagnosed with 
diabetes and 35% with prediabetes. The American Diabetes Association predicts that as 
many as 1 in 3 Americans will have diabetes by 2050 . As Americans become 
increasingly plagued by diabetes, accurate and timely disease surveillance is becoming 
increasingly important for clinicians, clinical researchers, policy makers and health plan 
administrators. Historically, disease surveillance required manual review of paper charts 
or large national surveys, both of which are time consuming and costly; however the 
nationwide shift to electronic health records (EHR) provides the potential for a more 
efficient alternative. 
The Health Information Technology for Economic and Clinical Health (HITECH) 
Act passed by the U.S Congress in 2009 is investing billions of dollars in incentives to 
clinicians who can demonstrate meaningful use of their EHR systems over the next 
several years. This act was set into motion with hopes of molding EHR's from data 
graveyards into data warehouses. Ideally these warehouses will contain extractable, 
secure, comprehensive, and standardized health information . Meaningful use 
includes both a core set and a menu set of objectives that are specific to eligible 
providers, hospitals and critical access hospitals (CAH). There are a total of 24 
meaningful use objectives for eligible providers, and 23 objectives for eligible hospitals 
and CAHs. To qualify for an incentive payment, 19 of these 24 or 18 of the 23 objectives 
must be met. Due to the significant requirements for meaningful use attestation, the 
program is divided into 3 stages for qualification. In the first stage of participation, 
providers must demonstrate meaningful use for a 90-day EHR reporting period; in 
subsequent stages, providers will demonstrate meaningful use for a full year EHR 
reporting period. Programs are not required to demonstrate meaningful use in consecutive 
years; however, there are deadlines for attesting to each stage. All hospitals and practices 
that choose not to participate in the program will face reductions in Medicare 
reimbursement rates . 
The overarching goals of this meaningful use incentive program are to push the 
U.S health care system to exploit and expand health information technology; however 
this major overhaul presents many challenges to all parties involved. As the deadlines for 
qualifying as a stage 2 meaningful use vendor quickly approach, EHR software 
companies struggle to keep up, preventing proper usability assessments during 
development . A certified stage 2 meaningful use EHR vendor must enable providers 
to record data in a structured format, allowing for data to be more easily retrieved and 
transferred, with hopes of optimizing health technology to improve patient care. 
Meanwhile, practitioners continue to struggle with current insufficient interfaces, and 
clinical researchers suffer from lacking standardized terminologies, yet both have little 
say in future system developments . EHRs contain two types of data; structured, 
coded data and, unstructured, free text data. Both types of data contain important 
information about the patient's unique health story. Many providers find that entering 
standardized data, rather than free text takes more time and effort. Some feel that current 
software is lacking in standardized matches for many common chronic conditions . 
West et al highlighted that the fragmentation of the US healthcare system hinders chronic 
disease management as well as longitudinal research on these diseased populations. 
Because patients see multiple providers in their lifetime, tracking a patient's care remains 
extremely difficult . Researchers advise further validation on electronic database 
extraction techniques before using them to assess quality of care . 
Diabetes surveillance remains a top priority of the CDC, who developed and 
maintains the world's first diabetes surveillance system. These surveillance data rely on 
national and state-based household, telephone, and hospital-based surveys and vital 
statistics to monitor diabetes trends. In collaboration with the NIH, the CDC has also 
initiated the SEARCH for Diabetes in Youth study, the largest major surveillance system 
to quantify and track the diabetes burden in Americans under 20 years of age. The 
SEARCH study provides population-based information on the underlying factors, trends, 
impact and level of care provided as well as allows researchers to clarify the degree to 
which type 2 diabetes is affecting youth of different racial and ethnic backgrounds. 
Overall, the CDC's surveillance data is used to understand the diabetes epidemic, identify 
vulnerable at-risk populations, set prevention objectives and monitor successes of 
programs over time, all at the national level. 
Purpose and Objectives 
The purpose of this thesis is to optimize methods for identification of patients 
with type 2 diabetes mellitus (DM) from de-identified EHRs of primary care practices in 
the Practice Partner Research Network (PPRNet). PPRNet is a practice based research 
network (PBRN) that was established in 1995 as a collaborative effort between the 
Department of Family Medicine at the Medical University of South Carolina (MUSC), 
McKesson in Seattle, WA, and participating primary care or internal medicine practices 
nationwide. The PPRNet database contains historical clinical data from 1987 through 
2013 from 340 practices and more than 5 million patients. Currently PPRNet has 151 
active member practices who electronically submit quarterly data extracts to PPRNet for 
aggregation and analysis. 
Our structured coded-data algorithm used for comparison was developed from the 
previously established definition that Miller et al. used in 2004 to auto-identify DM 
patients in the Department of Veteran Affairs database to calculate best estimates of DM 
prevalence and incidence rates . Our unstructured text data algorithm uses a 
developed data dictionary based on natural language processing to identify cases of DM 
through evaluation of unstructured text data from the title lines within the EHR. This 
thesis will test the diagnostic accuracy of the unstructured text algorithm in comparison 
with Miller's identification protocol. The specific aims for this thesis are: 
Specific Aim 1: Unstructured text data 
• Identify cases of DM from de-identified EHR's of primary care practices 
participating in PPRNet using developed algorithms based on natural 
language processing to identify cases of DM through evaluation of 
unstructured text data from the title lines within the EHR. 
Specific Aim 2: Structured coded data 
• Identify cases of DM from de-identified EHR's of primary care practices 
participating in PPRNet using an algorithm established by Miller et al. that 
assesses ICD-9 codes and diabetes medications from structured diagnostic 
Specific Aim 3: Diagnostic accuracy 
• Compare the unstructured text-based algorithm versus Miller's algorithm that 
assesses ICD-9 codes and diabetes medication prescriptions for identifying 
patients with diabetes. 
Significance of Research 
Specific aims of this thesis will assess the diagnostic accuracy of a new 
unstructured text-based algorithm in comparison to an established structured code-based 
algorithm. Several studies have been conducted to evaluate methods for estimating 
disease prevalence or identifying high-risk patients from structured EHR data, or claims 
data. Much existing research focuses on the use of automated data retrieval strategies to 
assess quality of care, although a study comparing the data documented within structured, 
coded fields with unstructured, narrative fields has yet to be performed. As the goals of 
the meaningful use EHR incentive program continue to propel the U.S healthcare system 
forward at a rapid rate, it's important to evaluate the current system operations in order to 
monitor the impact these changes have on achieving desired long-term outcomes. This 
thesis intends to not only present the diagnostic accuracy of this proposed diagnostic tool, 
but also highlight the fundamental differences between data recorded in structured and 
unstructured formats. 
Literature Review 
Diabetes Mellitus 
Prevalence of type 2 DM in the United States is increasing at a rapid rate, along 
with it are health care costs, and other associated complications. From 1980 to 2011, the 
crude prevalence of diagnosed diabetes rose 176% (from 2.5% - 6.9%) . The 
American Diabetes Association (ADA) reported as of March 2013, 25.8 million (8.3%) 
Americans have diabetes, listing 7.0 million of those as undiagnosed. The total annual 
costs attributable to diabetes are estimated to be nearly 245 billion dollars, accounting for 
20% of all health care expenditures in the U.S. Another 79 million Americans have 
prediabetes, of which only 7.3% have been told by their physician . Prediabetes, also 
commonly referred to as impaired glucose tolerance (IGT) or impaired fasting glucose 
(IFG) almost always precedes the development of type 2 diabetes. 
While risk factors such as genetics, ethnicity, birth weight and metabolic 
syndrome certainly play a role in the development of diabetes, several controllable 
lifestyle factors, such as one's weight, diet, exercise regimen and smoking status also 
influence a person's probability of acquiring the disease. The ADA reported 85.2% of 
people with type 2 diabetes are overweight or obese . Given the magnitude of this 
problem, the U.S healthcare system needs accurate, automated data retrieval methods to 
estimate and monitor its prevalence and evaluate the quality of care. 
U.S Healthcare's Transition to Electronic Health Record Systems 
Many large institutions nationwide have adopted EHR systems, while fewer small 
clinics and primary care practices, who treat a majority of Americans, have integrated 
health information technology (HIT) into their practices. Among these early adopters, 
few properly utilized advanced features such as clinical decision support, point of care 
alerts, patient activation, and overdue service reminder letter generation . While 
clinical decision support has been shown to improve things like preventive care screening 
rates among primary care doctors, an unintended inverse effect of alert fatigue has 
surfaced when used too frequently 15). Lacking standard data definitions and 
interoperability hinder nationwide implementation of comprehensive Personal Health 
Records (PHR), highlighting the urgent need for clinical informatics . These patient 
portals are currently utilized by less than 1% of the U.S population. The healthcare 
system recognizes the potential these portals could have on stimulating patient 
engagement. This platform would allow patients access to their personal health 
information, as well as educational material and tools, empowering them to become 
active participants in the management of their own health 18). 
The U.S congress enacted the Health Information Technology for Economic and 
Clinical Health (HITECH) Act as part of the American Reinvestment and Recovery Act 
of 2009 to allow the Center for Medicare and Medicaid to provide incentives to clinicians 
and hospitals who demonstrate meaningful use of their EHR system . The 
requirements for participation gradually increase throughout the three stages, qualifying 
providers that attest to each stage with significant incentive payments, and penalizing 
those that don't successfully attest to stage two requirements at least three months before 
the end of the 2014 payment year. 
2.2.1 Electronic Health Records and Quality Clinical Care and Measurement 
As clinicians across the country strive to earn these meaningful use incentives, 
greater emphasis has been placed on the validity of current EHR-derived clinical quality 
measures. Although the potential rewards are enormous, the accompanying challenges 
should not be underestimated. Historically, clinical researchers, health plan 
administrators and policymakers have relied on administrative, claims-based databases, 
and self-report to deduce clinical context, often producing misleading results that 
underestimate quality-of-care measures . Self-report has been shown to over-
estimate diabetes quality of care measures . 
Claims databases were developed to collect insurance payments, not track clinical 
information. Consequently, much relevant health information that is unnecessary for 
processing payments may not be collected or recorded accurately. Pharmacy claims 
often fail to identify chronic conditions like diabetes and hypertension that are being 
controlled by diet alone . The comparison of claims with medical record data 
produced complementary information on diabetes quality of care measures, resulting in 
mixed reliability, the highest being microalbumin testing and the lowest agreement for 
eye examination . A later study compared a claims-based strategy and an EHR-based 
method with a manual review reference group in the identification of pharyngitis. 
Overall, a larger proportion of cases were correctly identified by the EHR-based strategy 
than the administrative data-based strategy. The administrative data-based strategy did 
however boast a higher specificity than the EHR-based method, emphasizing the need for 
more rigorously defined EMR-based retrieval strategies, before utilizing them for quality 
of care measurement . In 2012, Ganz et al extracted structured coded data on falls in 
the elderly, and compared it with manual review. He found that only 54% of falls were 
identified within the coded data, and that much documentation regarding the care 
surrounding each event was recorded in non-structured form. In conclusion, because the 
accuracy of quality of care measures vary greatly between the types of care process being 
evaluated, and prevent unique challenges, future validation studies comparing automated 
algorithms to manual review will be beneficial . 
2.2.2 Chronic Disease identification within the Electronic Health Record 
Accurate chronic disease identification within the EHR is essential to surveillance 
efforts, the development of patient care plans, and clinical research advancements. 
Clinician documentation style remains the essential focus for improvement. Chronic 
disease management often requires the coordination of many physicians. Due to 
incongruent EHR systems, much treatment documentation from specialists fails to be 
entered into the EHR utilized by the patient's primary care providers. Most information 
that is relayed winds up in the free text portion of office notes, which automated searches 
do not detect . Shifting to a more team-based care approach is necessary for 
improved identification and care of chronic illness. 
Strict algorithms for identification also prove to be important. In 2004, a study to 
estimate DM rates over a three year period within the Department of Veterans Affairs 
DEpic electronic database was conducted. This study compared varying combinations of 
EHR derived DM criteria to self-reported DM cases. The algorithm with the highest 
sensitivity (93%) and specificity (98%) used DM medication prescription records in the 
current year and/or 2 diabetes codes from inpatient and/or outpatient visits (VA and 
Medicare) over a 24 month period. When similar algorithms were applied to claims 
databases in 2006, Solberg et al reported final positive predictive values (PPV) between 
0.965 and 1.0. All algorithms were tested on a small sample population and then 
adapted, producing a final algorithm with the following inclusion criteria; 2 or more 
outpatient or 1 inpatient ICD-9 codes for diabetes within one year, or a filled prescription 
for diabetes-specific medication in the same calendar year. After initial chart review, 
Metformin was found to be used to treat other conditions, such as polycystic ovary 
syndrome, infertility and reactive hyperglycemia, and was removed as a diabetes-specific 
medication from the final algorithm . 
Data Structure 
The type of data contained in an EHR can be classified into one of two types; 
structured, coded data, or, unstructured, free-text data. Much recent research has focused 
on comparing the type of data stored in each form and its relation to clinical quality 
measurement. The meaningful use incentive program has identified many of the 
limitations in using unstructured data for these purposes, thus encouraging clinicians to 
document in structured, coded formats in order to attest in both stage 2 and stage 3. 
Many structured fields successfully capture all relevant information needed for some 
quality measures, such as blood pressure recorded in vital signs for hypertension 
measures . Although, much of the literature suggests that the completeness of the 
medical records and ease of extractability vary greatly depending on the clinical area of 
focus . The literature referenced in the following sections present the positive and 
negative attributes of both data types. 
2.3.1 Unstructured Data 
Unstructured, narrative text provides unique insight into the quality of care 
because it represents a provider's thought process, unrestricted by structured 
vocabularies. This extensive narrative data is made valuable through the use of natural 
language processing (NLP). Most challenges in NLP arise in the process of deriving 
meaning from human or natural language input. Although NLP continues to improve, 
recall and precision rates vary significantly between systems. Narrowly and consistently 
defined variables, such as gender, race and test results tend to demonstrate the highest 
rates of both, while variables with multiple definitions remain difficult to capture and 
Studies that have only evaluated structured data fields have regularly stated that 
the algorithms missed recognition because relevant information, such as exclusion 
criteria, was only documented in narrative form . Another study found that their NLP 
system consistently out-performed the use of ICD-9 billing codes in identifying the 
condition of interest . Overall, the condition of interest being evaluated has the 
largest impact on NLP results. 
Existing literature highlights the limitations associated with manual review, the 
use of administrative data, EHR data structure and format, and extraction procedures 
. One major issue with auto-extracted data stems from under recording in 
reasonably accessible fields such as medication lists . This type of automated 
recognition software has been applied to discharge summaries, radiology reports, and 
other qualitative data from limited sections of the patient's EHR resulting in a validity 
ranging from low to high . When used in combination with ICD-9 codes, Zeng et 
al found that accuracy improved. NLP systems have been shown to accurately identify 
risk factors and diagnostic criteria associated with certain medical conditions. Byrd et al 
successfully developed NLP algorithms using Framingham criteria for early detection of 
heart failure patients . 
2.3.2 Structured Data 
Structured, coded data allows for interoperability between systems. This type of 
data eases the accuracy for secondary use purposes. Readily available and directly 
analyzable EHR data reduces the need for extensive manual chart review, thus allowing 
for performance measures to be more easily assessed on a larger proportion of patients in 
care. When structured data was compared with full chart review results from the 
Veterans Health Administration's External Peer Review Program (EPRP) on several 
measures, over 80% of the data on these selected measures was found in a directly 
analyzable format within the EHR. While the EPRP data were found to be more 
complete, the correlation of measures between sources was very high (0.89-0.98) . 
Much focus been placed on standardizing EHR output, while very little emphasis, 
until recently has been aimed at standardizing EHR data inputs. All clinicians are 
initially trained on proper documentation techniques in their EHR training. These 
techniques are often reinforced by quality improvement specialists; however no 
mechanism within the EHR forces providers to document in a particular location in the 
chart. Intensive training, automatic prompts and proper feedback are necessary in 
standardizing their documentation habits to reflect the care given in EHR-derived quality 
Even standardized data comes with drawbacks. Botsis et al found much 
inaccuracy within coded data. Often times a non-specific ICD-9 code is selected, such as 
250 for diabetes, when a more accurate diagnosis is actually made at the point of care. 
Inconsistencies within the data also prove to be troublesome, sometimes displaying both 
250.01 and 250.02 for type-1 and type-2 diabetes respectively. He also highlights the 
lack of contextual information the current ICD-9 coding system supports . 
Table 2.1: Description of Comparative Studies that examine the Reliability and Validity of EHR derived Algorithms for Clinical Quality Measurement 
Citation 
Attribute 
Study Population 
Study Design 
Examined 
Baker et al., 
Automated review of the EHR was comparable to 
failure patient with 
manual review for Left ventricular ejection fraction 
2 or more clinic 
(LVEF) measurement (94.6% vs. 97.3%), prescription of 
visits within the 18 
beta blockers (90.9% vs. 92.8%), and prescription of 
ACE inhibitors or ARBs (93.9% vs. 98.7%). Performance was lower for prescription of warfarin for atrial fibrillation (70.4% vs. 93.6%). 
Baldwin et al., Accuracy 
N= 60; Women ≥ 
A significant difference between Natural Language 
40 years structured 
Processing (NLP) methods and manual review was 
convenience sample 
found. The NLP method found a false positive rate of 0, 
and a false negative rate of .035. 
Health Center in 2001 
Benin et al., 
N= 479; possible 
When comparing each group to the reference; 91% of 
EMR-based strategy episodes were confirmed and 59% 
of the administrative data-based strategy. 
analyzed using; (1.) EMR-based, (2.) administrative data-based, and (3.) manual review reference strategies 
Fowles et al., 
Cross-sectional Reliability between primary medical record and claims 
with Diabetes, aged 
varied by measure; Eye examination (K= 0.371), Oral 
agents(K= 0.699), Insulin (K= 0.548), HbA1c (K= 
Minnesota health 
0.678) and Microalbumin (K= 0.748) 
maintenance organization 
Ganz et al., 
N=215; Falls data 
A structured visit note was found in 54% of charts 
within 3 months of the date patients had been identified 
as falling. The reliability of the codable-data algorithm 
initiative in primary 
was good (K=0.61) compared with full medical record 
care medical groups 
review for three care processes. 
Goulet et al., 
VA patients with 
Over 80% of the selected measures were found in 
directly analyzable form within the EMR. The degree of 
correlation between automated algorithms assessing 
structured fields in comparison to the Veterans Health 
Administration's External Peer Review Program(EPRP) was high (0.89-0.98). 
Hivert et al., 
N=122,715; Active 
Directly measured EHR-defined MetS had 73% 
adult patients from 
sensitivity and 91% specificity. DM incidence was 1.4% 
in the No MetS group vs. 4% in the At-Risk-for-MetS 
practices in eastern 
Miller et al., 
The most accurate criterion was a prescription for 
Veterans Affairs 
diabetes medication in the current year and/or 2 + 
patients recorded in 
diabetes codes from inpatient and/or outpatient visits 
the longitudinal, 
(VA and Medicare) over a 24-month period (Se= 93% 
national database 
and Sp= 98%) against patient self-report. 
Owen et al., 
The percent agreement between automated algorithms 
sample of inpatient 
and manual review among patients with chlorpromazine 
and outpatient visits 
equivalents < 300, 300-1,000, and > 1,000, are .11, .41, 
for Schizophrenia 
and .21, respectively for inpatients, and .19, .21 and .40 
patients from the 
for outpatients. The overall weighted Kappa for 
inpatients (K=0.55) and outpatients (K= 0.63). 
Administration database (VistA) 
Parsons et al., Accuracy; 
N=4,081; patient 
The majority of diagnoses for chronic conditions had 
EHR records from 
information documented in the problem list (a structured 
field) and were recognized by the automated quality measures, including diabetes (>91.4% across measures), hypertension (89.3%), ischemic cardiovascular disease (>78.8% across measures) and dyslipidemia (75.1%). 
Persell et al., 
N=1,006; All CAD 
Performance on 7 quality measures varied from 81.6% 
for lipid measurement to 97.6% for blood pressure 
measurement. After including Free-text data, the 
medicine practice 
adherence rate increased, ranging from 87.5% for lipid measurement and low-density lipoprotein cholesterol to 99.2% for blood pressure measurement. 
Study Design 
3.1.1 PPRNet 
We used a cross-sectional study of diagnostic accuracy design, analyzing data 
from the longitudinal PPRNet database. PPRNet was established in 1995 as a 
collaborative effort between the Department of Family Medicine at the Medical 
University of South Carolina (MUSC), Practice Partner/McKesson in Seattle, WA and 
participating primary care and internal medicine practices. PPRNet is a practice based 
research network (PBRN) that strives to improve the quality of healthcare in its member 
practices by; turning clinical data into actionable information, empirically testing 
theoretically sound quality improvement interventions, and disseminating successful 
interventions to primary care providers across the country. Currently PPRNet has 151 
physician practices, representing over 1068 health care providers, and approximately 1.4 
million patients located in 38 states. All of PPRNet's member practices currently use 
McKesson's Practice Partner, Lytec or Medisoft's EHR systems. These data are 
extracted and sent to PPRNet on a quarterly basis. Data are then cleaned, appended to the 
longitudinal database and analyzed to produce quality improvement reports on 65 clinical 
quality measures (CQM). These quality measures include ten diabetes mellitus measures 
and track the quality of care on several other common conditions such as cardiovascular 
disease, respiratory disease with other focuses on women's health, cancer screening, 
immunizations, mental health, substance abuse, and medication safety. 
3.1.2 Study population 
This eligible patient population was comprised of active patients from 115 
PPRNet practices that sent their fourth quarter data extract in January 2014. A patient 
was defined as active if he/she had a visit within 1 year and was not designated with a 
deceased or inactive status. A visit was determined by a progress note title that did not 
include text indicating a cancelled appointment or no show. Similarly, in either 
approach, the recorded data must not be designated with an inactive status or a resolved 
3.1.3 Inclusion and exclusion criteria 
The electronic health record of all active patients ≥ 18 years of age were evaluated 
for an active diagnosis of type 2 diabetes mellitus made within the last 2 years. 
Measurement 
The aims of this study were to assess DM diagnosis in a database of electronic 
medical records using 3 methods: NLP, Miller's protocol, and ICD-9 codes. NLP is a 
newer method that uses an algorithm based on unstructured text data, while the other two 
methods have been used in the past. 
3.2.1 Unstructured text evaluation 
 The unstructured text algorithm utilizes NLP techniques for automated 
identification of diagnoses. We first developed common text variations of DM, including 
full diagnosis names, ICD-9 codes, abbreviations, synonyms, and common misspellings. 
These 341 text string variations were then compared to the free text data, flagging 
possible diagnoses of type 2 DM and suggesting a corresponding ICD-9 code. All 
flagged diagnoses with a frequency of 4 or more were then manually reviewed by a 
research assistant for correctness. Text strings were then either classified as definite 
diagnoses of type 2 DM, or excluded from future analysis. These text string 
classifications were then reviewed by a clinician for accuracy. This review process is 
conducted on a quarterly basis. Each quarter, only new text variations, with a frequency 
greater than 3 are flagged for manual review. Currently, the PPRNet database contains 
13,231 text variants included as DM. 
3.2.2 Structured data evaluation 
The coded, structured data evaluation algorithm we used is based on Miller's 
definition for DM identification in a VA population [Miller 2004]. This criterion 
included a prescription for a diabetes medication in the current year and/or 2 or more 
recorded type 2 diabetes ICD-9 diagnostic codes within a 24-month period. As of 
January, 2014, the PPRNet database contained data through December 31, 2013 from 
115A practices. The DM codes included for analysis were comprised of the following 
ICD-9 codes; 250(excluding type 1 codes), 357.2, 362.01, 362.02, 366.41. These were 
extracted from the 4 code fields within the EHR. The medications included for DM 
treatment will be taken from the most current Treatment Guidelines from The Medical 
Letter. The DM medications included in the analysis are listed in Table 2 . 
 
Table 3.1: Drugs for Treatment of Type 2 Diabetes Mellitus 
Formulation 
Biguanide 
500,850,1000 mg tabs 
Glucophage 
500,850,1000 mg tabs 
 extended- release – generic 
500, 750 mg tabs 
Glucophage XR 
500, 750 mg tabs 
500, 1000 mg tabs 
Fortamet  
500, 1000 mg tabs 
Riomet- liquid 
500 mg/ 5 mL (4, 16 oz) 
Second- Generation Sulfonylureas 
Glimepiride – generic 
Glipizide – generic 
Glucotrol 
 extended- release – generic 
2.5, 5, 10 mg tabs 
Glucotrol XL 
Glyburide – generic 
1.25, 1.5, 2.5, 3, 5, 6 mg tabs 
1.25, 2.5, 5 mg tables 
Micronase 
1.25, 2.5, 5 mg tabs 
 micronized tablets – generic 
1.5, 3, 4.5, 6 mg tabs 
Glynase Prestab 
1.5, 3, 6 mg tabs 
Non-Sulfonylurea Secretagogues 
Nateglinide – generic 
Repaglinide -- 
Prandin 
0.5, 1, 2 mg tabs 
Pioglitazone – 
Actos 
15, 30, 45 mg tabs 
Rosiglitazone -- 
Avandia 
Alpha-Glucosidase Inhibitors 
Acarbose – generic 
25, 50, 100 mg tabs 
25, 50, 100 mg tabs 
25, 50, 100 mg tabs 
DPP-4 Inhibitors 
Sitagliptin -- 
Januvia 
25, 50, 100 mg tabs 
Saxagliptin -- 
Onglyza 
Linagliptin -- 
Tradjenta 
GLP-1 Agonists 
Exenatide – 
Byetta 
250 mcg/mL (1.2, 2.4 mL 
Liraglutide – 
Victoza 
6 mg/mL (3 mL prefilled pen) 
Colesevelam – 
Welchol 
Bromocriptine – 
Cycloset 
Pramlintide -- 
Symlin 
1000 mcg/mL (1.5, 2.7 mL 
Combination Products 
Metformin/glipizide – generic 
Metformin/glyburide 
1000 mcg/mL (1.5, 2.7 mL 
Glucovance 
Metformin/pioglitazone 
500/15, 850/15 mg tabs 
Actoplus Met 
500/15, 850/15 mg tabs 
Actoplus Met XR 
1000/15, 1000/30 mg tabs 
Metformin/repaglinide – 
Prandimet 
500/1, 55/2 mg tabs 
Metformin/rosiglitazone – Avandamet 
500/2, 55/4, 1000/2, 1000/4 
Glimepiride/rosiglitazone – 
Anandryl 
1/4, 2/4, 4/4, 2/8, 4/8 mg tabs 
Glimepiride/pioglitazone – 
Duetact 
2/30, 4/30 mg tabs 
Metformin/sitagliptin --
 Janumet 
500/50, 1000/50 mg tabs 
Metformin/saxagliptin -- 
Kombiglyze 
500/5, 1000/2.5, 1000/5 mg 
Statistical analysis 
Statistical analysis was performed using SAS software version 9.2 (SAS Institute, 
Cary, NC). The number of type 2 DM cases was calculated using both algorithms 
(described above), as well as an algorithm that evaluated ICD-9 diagnostic codes, alone. 
The accuracy of the unstructured text algorithm was compared to Miller's approach as 
well as the ICD-9 diagnostic code algorithm by calculating sensitivity and specificity. 
The unstructured text algorithm was used to calculate the 2-year prevalence of DM in 
PPRNet. Rates are presented overall and in population subsets defined by patient 
characteristics: age, sex, body mass index (BMI), as well as practice characteristics, 
including; practice type, being either internal medicine or family practice, a mix of both, 
multi-specialty, or "other". 
Sample Characteristics 
There were a total of 368,384 active adult patients among the 115 practices who 
sent their 4th quarter data extracts to PPRNet in January 2014 (Table 3). More than half 
of the population was female (57.5%). Within the sample, 36.6% were aged 18-44 years 
old, 18.6% were 45-54 years old, 19.5% were 55-64 years old, 13.9% were 65-74 years 
old, 7.6% were 75-84 years old, and 3.2% were 85-108 years old. Nearly a quarter of the 
population was underweight/normal weight (24.7%), while 29.8% were overweight, and 
38.9% were obese. A majority of PPRNet practices are family practices, accounting for 
70.5% of the patient sample. The majority of remaining patients belong to internal 
medicine practices (17.1%). A small sample of patients belongs to mixed practices made 
up of both family practitioners and internists. Rounding out the sample are multispecialty 
practices (2.6%), and "other" which consists of Rheumatology, Pulmonary, Gynecology, 
Neurology, Urology and Pediatric practices (4.5%). 
Sample Characteristics of Text-identified Diabetes Mellitus Population 
Just over half of adult diabetics are female (51.1%). The percentage of diabetics 
increases with age before leveling off at age 74 and declining thereafter. As expected, 
most of these type-2 diabetics fell in the overweight (23.7%) or obese (63.0%) BMI 
categories. Less than 10% of PPRNet's diabetic patients are underweight (0.8%) or
normal weight (8.6%). The DM patient sample was representative of the full population 
in regards to practice type as displayed in Table 3. 
Algorithm Evaluation: DM Prevalence, Sensitivity and Specificity 
Table 4 presents 2-year DM prevalence estimates based on each of the three 
algorithms (detailed description provided above in Section 3.2). Both the unstructured 
free-text algorithm and Miller's algorithm produced the same prevalence (11.1%), while 
the ICD-9 diagnostic code algorithm identified far fewer cases of DM, resulting in a 
prevalence of 3.4%. 
Between all algorithm comparisons, the patients identified as having diabetes 
varied considerably. When we compared the unstructured free-text algorithm to Miller's, 
each protocol found close to 10,000 patients that were missed by the opposing definition. 
Using Miller's protocol as the standard of comparison, the resulting sensitivity was 
77.8% and specificity was 97.2%. However, when we compared the free-text definition 
to the ICD-9 diagnostic codes alone, 70% of free-text identified cases were found to be 
un-coded. Only 86 additional patients had 2 or more recoded ICD-9 diagnostic codes but 
were not identified using the free-text algorithm. All 86 cases identified by the code 
definition alone were due to the low frequency of the corresponding text string. As 
described in detail in the methodology, only those unstructured text diagnoses that occur 
4 or more times within the data are included for review to be counted as a definite 
diagnosis of DM. Using diagnostic codes alone as the standard for comparison resulted 
in a much higher sensitivity (99.3%), and lower specificity (91.9%). 
Table 4.1: Sample Characteristics of PPRNet Population and Adults with Text-Identified Type 2 Diabetes Mellitus 
 All Adult patients (≥18) 
Overall Number and DM Prevalence 
Age (years) 
Underweight (< 18.5) 
Normal (18.5-25) 
Overweight (25-30) 
Practice Type 
Family Practice/Internal Medicine 
Internal Medicine 
Table 4.2: 2-year DM Prevalence among All Active Adult Patients in 115 PPRNet Practice Sites by Algorithm 
Definition 
No. (368,384) 
Prevalence (%) 
(2012-2013) 
Miller's structured-coded: Active medication prescription and/or 2+ 
ICD-9 codes recorded within the previous 2 years 
Unstructured free-text: Active text diagnoses recorded in unstructured 
title lines within previous 2 years 
ICD-9 diagnostic codes: 2+ ICD-9 diagnostic code recorded within 
previous 2 years 
 
 Table 4.3: Sensitivity and Specificity of Unstructured Free-Text Algorithm Using Different Standards of Comparison
Compared with 
unstructured free-text 
Standard of Comparison 
algorithm 
Standard of 
Agreement: 
Text definition 
Comparison 
Agreement: 
Miller's structured-coded 
ICD-9 diagnostic codes 
The first aim of this study was to replicate, in PPRNet, the best definition for 
automated DM identification within EHR data from Miller's 2004 study comparing 
various definitions for DM identification using the Department of Veteran Affairs 
electronic health record database. We found that while the same overall percentage of 
diabetic patients were identified using this method as compared to the free-text method, 
there were several thousand diagnoses that had clear evidence of a free-text diagnoses 
that were missing a corresponding diagnostic code, and that were not on an active 
prescription for a DM medication. Similarly, there were close to the same number of 
diabetic patients identified by Miller's definition alone when compared to the free-text 
algorithm. Miller's best definition includes an active prescription for DM recorded 
within the last year, or 2 or more ICD-9 diagnostic codes recorded within the last 2 years. 
One of the main limitations of this definition is that some commonly used medications 
for DM, such as Metformin, which is the first-line drug of choice for the treatment of 
type 2 diabetics who are overweight or obese and with normal kidney function is also 
used in the treatment of polycystic ovary syndrome and other diseases where insulin 
resistance may be an important factor. 
Secondly, this paper aimed to test a newly developed unstructured free-text based 
algorithm in accurate identification of DM cases within an active PPRNet patient 
population. One overarching limitation was due to our inability to access and manually 
review each individual patient record, leaving us with no true gold standard for 
comparison. We chose Miller's definition because it had been found to be quite accurate 
when compared to patient survey. Using this standard of comparison, the free-text 
definition resulted in a fair sensitivity and very good specificity. Although we did not 
manually review each patient record, each unique text string with a frequency of 4 or 
more that was flagged for review using our automated DM text string dictionary 
consisting of 341 unique and comprehensive text strings was reviewed by a trained 
research assistant. Text diagnoses that were unclear were then also reviewed by a 
physician. While we cannot say with certainty that all cases of DM identified using the 
text algorithm is an actual case of DM, we are very confident that the rate of 
misclassification is very low due to this extensive processing. After comparing our 
algorithm with ICD-9 diagnostic codes alone, it also appears that we are missing very 
few coded cases of DM, resulting in a very high sensitivity (99.3%) and specificity 
(91.9%). Several more cases were identified when adding prescriptions for DM to the 
definition, but as we previously stated, we cannot be sure that the medication is being 
used to treat DM. 
Strengths of the Study 
A major strength of this study is the large sample size. This sample represents the 
differing documentation styles of hundreds of physicians nationwide treating hundreds of 
thousands of patients in both urban and rural practice settings. 
Limitations of the Study 
PPRNet has very little variation in practice type and practice size, consisting of 
mostly small to mid-size family practices and internal medicine clinics. Another 
limitation is the fact that all PPRNet practices use one common EHR software product in 
an ever growing market place of products with varying configurations. Lastly, we did not 
compare our free-text based algorithm with a gold standard (physician diagnosis) 
preventing the estimation of its sensitivity and specificity. However, the development of 
the NLP algorithm is an iterative process. After a query is used to identify diabetes cases, 
a physician reviews the cases that the query identifies for accuracy. The query is then 
modified and the process is repeated. This happens on an ongoing basis. This rather 
efficient NLP algorithm was used to identify cases in this study. 
Future Research 
We recommend that similar studies in the future use databases that contain data 
from several EHR software systems to reduce bias. It would be interesting to replicate 
this study in a more diverse research network; stratifying by practice site characteristics 
such as size, location and specialty as well as provider characteristics such as degree and 
specialty. In looking at both practice and provider characteristics, we could get a better 
understanding of what major factors influence physician EHR documentation styles. It 
would also be useful to attain patient records for manual chart review to use as a gold 
standard for comparison when testing new algorithms that could potentially aid in a 
variety of arena's such as population health. In a similarly large research network, one 
could collect a randomized sample of a small percentage of the total population rather 
than manually review the charts of the entire population. 
Conclusions 
Our unstructured free-text evaluation performed quite well in accurately 
identifying Type 2 DM patients within the PPRNet active patient population. As EHR 
use is on the rise, it is crucial that we continue to develop ways to accurately translate 
patient data out of these systems in order to meaningfully utilize these powerful 
technologies. This paper has helped clarify the need for further development of accurate 
data translation platforms in order to capture each patient's full and unique health story as 
well as for monitoring treatment and outcomes all while minimizing physician burden. 
FAST FACTS Data and Statistics about Diabetes. In. 3/1/2013 ed: American 
Diabetes Association; 2013. p. 2. 2. 
Holmes C. The problem list beyond meaningful use. Part I: The problems with 
problem lists. J AHIMA 2011;82(2):30-3; quiz 34. 3. 
Prokosch HU, Ganslandt T. Perspectives for medical informatics. Reusing the 
electronic medical record for clinical research. Methods Inf Med 2009;48(1):38-44. 4. 
EHR Incentive Programs: Meaningful Use. In: Centers for Medicare and 
Medicaid Services; 2013. 5. 
Lobach DF, Detmer DE. Research challenges for electronic health records. Am J 
Prev Med 2007;32(5 Suppl):S104-11. 6. 
Richesson RL, Krischer J. Data standards in clinical research: gaps, overlaps, 
challenges and future directions. J Am Med Inform Assoc 2007;14(6):687-96. 7. 
West S, Blake C, Zhiwen L, McKoy J, Oertel M, Carey T. Reflections on the use 
of electronic health record data for clinical research. Health Informatics Journal 2009;15(2):108-21. 8. 
Benin AL, Vitkauskas G, Thornquist E, Shapiro ED, Concato J, Aslan M, et al. 
Validity of using an electronic medical record for assessing quality of care in an outpatient setting. Med Care 2005;43(7):691-8. 9. 
Miller DR, Safford MM, Pogach LM. Who has diabetes? Best estimates of 
diabetes prevalence in the Department of Veterans Affairs based on computerized patient data. Diabetes Care 2004;27 Suppl 2:B10-21. 10. 
Diabetes Data & Trends: Crude and Age-adjusted Percentage of Civilian Non-
institutionalized Adults with Diagnosed Diabetes, United States, 1980-2011. In. Atlanta, GA: U.S. Department of Health and Human Services, Centers for Disease Control and Prevention; 2011. 11. 
Goetz Goldberg D, Kuzel AJ, Feng LB, DeShazo JP, Love LE. EHRs in primary 
care practices: benefits, challenges, and successful strategies. Am J Manag Care 2012;18(2):e48-54. 12. 
Greiver M, Barnsley J, Glazier RH, Moineddin R, Harvey BJ. Implementation of 
electronic medical records: effect on the provision of preventive services in a pay-for-performance environment. Can Fam Physician 2011;57(10):e381-9. 13. 
O'Connor PJ, Crain AL, Rush WA, Sperl-Hillen JM, Gutenkauf JJ, Duncan JE. 
Impact of an electronic medical record on diabetes quality of care. Ann Fam Med 2005;3(4):300-6. 14. 
Harrison MI, Koppel R, Bar-Lev S. Unintended consequences of information 
technologies in health care--an interactive sociotechnical analysis. J Am Med Inform Assoc 2007;14(5):542-9. 
DeJesus RS, Angstman KB, Kesman R, Stroebel RJ, Bernard ME, Scheitel SM, et 
al. Use of a clinical decision support system to increase osteoporosis screening. J Eval Clin Pract 2010;18(1):89-92. 16. 
Katzan IL, Rudick RA. Time to integrate clinical and research informatics. Sci 
Transl Med 2012;4(162):162fs41. 17. 
Tang PC, Lansky D. The missing link: bridging the patient-provider health 
information gap. Health Aff (Millwood) 2005;24(5):1290-5. 18. 
Nagykaldi Z, Aspy CB, Chou A, Mold JW. Impact of a Wellness Portal on the 
delivery of patient-centered preventive care. J Am Board Fam Med 2012;25(2):158-67. 19. 
Blumenthal D, Tavenner M. The "meaningful use" regulation for electronic health 
records. New England Journal of Medicine 2010;363(6):501-4. 20. 
Pawlson LG, Scholle SH, Powers A. Comparison of administrative-only versus 
administrative plus chart review data for reporting HEDIS hybrid measures. Am J Manag Care 2007;13(10):553-8. 21. 
Tang PC, Ralston M, Arrigotti MF, Qureshi L, Graham J. Comparison of 
Methodologies for Calculating Quality Measures Based on Administrative Data versus Clinical Data from an Electronic Health Record System: Implications for Performance Measures. Journal of the American Medical Informatics Association 2007;14(1):10-15. 22. 
Fowles JB, Rosheim K, Fowler EJ, Craft C, Arrichiello L. The validity of self-
reported diabetes quality of care measures. Int J Qual Health Care 1999;11(5):407-12. 23. 
Rector TS, Wickstrom SL, Shah M, Thomas Greeenlee N, Rheault P, Rogowski J, 
et al. Specificity and sensitivity of claims-based algorithms for identifying members of Medicare+Choice health plans that have chronic medical conditions. Health Serv Res 2004;39(6 Pt 1):1839-57. 24. 
Ganz DA, Almeida S, Roth CP, Reuben DB, Wenger NS. Can structured data 
fields accurately measure quality of care? The example of falls. J Rehabil Res Dev 2012;49(9):1411-20. 25. 
Roth CP, Lim YW, Pevnick JM, Asch SM, McGlynn EA. The challenge of 
measuring quality of care from the electronic health record. Am J Med Qual 2009;24(5):385-94. 26. 
Persell SD, Wright JM, Thompson JA, Kmetik KS, Baker DW. Assessing the 
validity of national quality measures for coronary artery disease using an electronic health record. Arch Intern Med 2006;166(20):2272-7. 27. 
Solberg LI, Engebretson KI, Sperl-Hillen JM, Hroscikoski MC, O'Connor PJ. Are 
claims data accurate enough to identify patients for performance measures or quality improvement? The case of diabetes, heart disease, and depression. Am J Med Qual 2006;21(4):238-45. 28. 
Borzecki AM, Wong AT, Hickey EC, Ash AS, Berlowitz DR. Can we use 
automated data to assess quality of hypertension care? Am J Manag Care 2004;10(7 Pt 2):473-9. 29. 
Weiskopf NG, Hripcsak G, Swaminathan S, Weng C. Defining and measuring 
completeness of electronic health records for secondary use. J Biomed Inform 2013. 30. 
Baldwin KB. Evaluating healthcare quality using natural language processing. J 
Healthc Qual 2008;30(4):24-9. 
Baker DW, Persell SD, Thompson JA, Soman NS, Burgner KM, Liss D, et al. 
Automated review of electronic health records to assess quality of care for outpatients with heart failure. Annals of Internal Medicine 2007;146(4):270-7. 32. 
Pakhomov SS, Hemingway H, Weston SA, Jacobsen SJ, Rodeheffer R, Roger 
VL. Epidemiology of angina pectoris: role of natural language processing of the medical record. Am Heart J 2007;153(4):666-73. 33. 
Chan KS, Fowles JB, Weiner JP. Review: electronic health records and the 
reliability and validity of quality measures: a review of the literature. [Review]. Medical Care Research & Review 2010;67(5):503-27. 34. 
Parsons A, McCullough C, Wang J, Shih S. Validity of electronic health record-
derived quality measurement for performance monitoring. J Am Med Inform Assoc 2011. 35. 
Tu K, Mitiku T, Lee DS, Guo H, Tu JV. Validation of physician billing and 
hospitalization data to identify patients with ischemic heart disease using data from the Electronic Medical Record Administrative data Linked Database (EMRALD). Canadian Journal of Cardiology 2010;26(7):e225-8. 36. 
Owen RR, Thrush CR, Cannon D, Sloan KL, Curran G, Hudson T, et al. Use of 
electronic medical record data for quality improvement in schizophrenia treatment. J Am Med Inform Assoc 2004;11(5):351-7. 37. 
Chapman WW, Fizman M, Chapman BE, Haug PJ. A comparison of 
classification algorithms to automatically identify chest X-ray reports that support pneumonia. J Biomed Inform 2001;34(1):4-14. 38. 
Denny JC, Peterson JF, Choma NN, Xu H, Miller RA, Bastarache L, et al. 
Extracting timing and status descriptors for colonoscopy testing from electronic medical records. J Am Med Inform Assoc 2010;17(4):383-8. 39. 
Hripcsak G, Friedman C, Alderson PO, DuMouchel W, Johnson SB, Clayton PD. 
Unlocking clinical data from narrative reports: a study of natural language processing. Ann Intern Med 1995;122(9):681-8. 40. 
Melton GB, Hripcsak G. Automated detection of adverse events using natural 
language processing of discharge summaries. J Am Med Inform Assoc 2005;12(4):448-57. 41. 
Zeng QT, Goryachev S, Weiss S, Sordo M, Murphy SN, Lazarus R. Extracting 
principal diagnosis, co-morbidity and smoking status for asthma research: evaluation of a natural language processing system. BMC Med Inform Decis Mak 2006;6:30. 42. 
Jain NL, Knirsch CA, Friedman C, Hripcsak G. Identification of suspected 
tuberculosis patients based on natural language processing of chest radiograph reports. Proc AMIA Annu Fall Symp 1996:542-6. 43. 
Byrd RJ, Steinhubl SR, Sun J, Ebadollahi S, Stewart WF. Automatic 
identification of heart failure diagnostic criteria, using text analysis of clinical notes from electronic health records. Int J Med Inform 2013. 44. 
Goulet JL, Erdos J, Kancir S, Levin FL, Wright SM, Daniels SM, et al. Measuring 
performance directly using the veterans health administration electronic medical record: a comparison with external peer review. Med Care 2007;45(1):73-9. 45. 
Botsis T, Hartvigsen G, Chen F, Weng C. Secondary Use of EHR: Data Quality 
Issues and Informatics Opportunities. AMIA Summits Transl Sci Proc 2010;2010:1-5. 46. 
Treatment Guidelines from the Medical Letter. The Medical Letter, Inc 
Source: http://scholarcommons.sc.edu/cgi/viewcontent.cgi?article=3812&context=etd
   OMBRE DEL MEDICAME TO   Leflunomida medac 20 mg comprimidos recubiertos con película 2.  COMPOSICIÓ CUALITATIVA Y CUA TITATIVA   Cada comprimido recubierto con película contiene 20 mg de leflunomida. Excipiente(s) con efecto conocido: Cada comprimido recubierto con película contiene 152 mg de lactosa (como monohidrato) y 0,12 mg de lecitina de soja. Para consultar la lista completa de excipientes, ver sección 6.1. 3. 
    World Transport Policy & Practice Vol ume 4, Num ber 1, 1998 Abstracts & keywords Dutch Transport Policy: From Rhetoric to RealityGary Haq and Machiel Bolhuis Urban Transport and Equity: the case of São PauloEduardo A. Vasconcel os Sustainable Transport: Some challenges for Israel and PalestineYaakov Garb