Information Technology Reference
In-Depth Information
clinical documentation. This is the most clinically relevant information, representing
what was clinically important, but is also the most diffi cult to extract. Narrative text
allows a high degree of expressiveness and fl exibility, but this same fl exibility makes
it diffi cult to extract information from the records on a large scale. Researchers have
for years been refi ning approaches to natural language processing (NLP) to extract
information from narrative text [ 27 ], and more recently have been applying this to
records specifi cally for phenotype extraction [ 28 ].
Because of the multiple data types in the electronic health record that require
different methods of extraction for phenotype representation, a signifi cant amount
of research in using EHR data for phenotypes has been done just on extracting data
from electronic health records. The SHARPn project, for example, was funded spe-
cifi cally to determine and demonstrate best approaches for extracting data from
EHRs for secondary data analysis [ 28 ]. Research in medical language processing
for extracting phenotypic information has grown substantially to be a signifi cant
focus of the fi eld [ 29 ]. And multiple initiatives have emerged that focus on defi ning
different phenotypes that can be extracted from the different data sources from
EHRs [ 30 , 31 ]. Usually, the actual extraction algorithm is a set of rules that query
for data from different data sources. For example, a diabetes phenotype extraction
algorithm is a combination of administrative visit data, laboratory results, diagnosis
codes, prescribed medications, and family history data from narrative text [ 32 ].
4.3.2
Performing Genome-Wide Association Studies
As mentioned above, when researchers can successfully extract disease phenotypes
from EHR data, they can use this information to perform GWAS. GWAS analyze a
large number of genotypes and matched phenotypes. Genotypes must be sequenced
from biological samples, so the collection of biospecimens determines what geno-
types can be done. Currently most genotyping is done through chip-based microar-
ray techniques that identify millions of markers on the genetic code for one
individual, but it is anticipated that future sequencing techniques will provide the
full DNA sequence within a few years. Currently the markers used in the genotype
are single nucleotide polymorphisms, or SNPs. SNPs are small changes in the DNA
sequence that occur relatively frequently in the human genome. They typically do
not have substantial impact on biological processes, but are helpful for marking
genetic variation among individuals. Regardless of the method of genotyping, it is
critical that the sample and genotype be matched to an identifi ed subject, so that a
matching phenotype can be queried from the data in the EHR [ 13 ].
The order of the two tasks (genotyping or phenotyping) is less important, as long
as a genotype is linked to a phenotype for the same patient. The methods of the study
will often dictate which must be done fi rst based on dependencies. In some cases,
the genotype is fi rst collected from all biospecimens in a population of subjects who
also have data in EHRs. Then extraction rules for a specifi c phenotype of interest
can be developed, validated and used to query that phenotype for the subjects from
Search WWH ::




Custom Search