Chapter 3 Application to Statistical Genetics
The goals of epidemiology explicitly revolve around studying the incidence, distribution, and control of disease. As one can imagine, the boundaries of the field are highly nebulous. So far, we’ve looked at Hidden Markov Models as mathematical objects devoid of any real context— aside from a toy frog example. This section intends to tie together the mathematical principles of HMMs with their connections to big areas in biostatistical & epidemiological research. We find that, yes, HMMs are cool because they can help save lives, bridge the gaps across survival disparities, and study our bodies.
3.1 Local Ancestry Inference
Local Ancestry Inference (LAI) is an approach in statistical genetics & bioinformatics that aims to study genetic diversity, disease prevalence, or evolutionary development by linking each segment of a genome to some ancestral population1. LAI has been used to study crop plant analysis2, Tilapia breeding3, and the genetic adaptations of cows in East Africa4. However, using LAI in public health studies is a critically important step when studying ethnically diverse cohorts.
Diasporic movement, mass migration, colonial empire, and slavery have profoundly changed the genetic landscape of the human population. Prior to these admixture events, human communities were largely isolated from each other, and over time, these isolated communities genetically diverged into human sub-populations. The differences between sub-populations are largely minuscule. While the differences are mostly nonexistent, generations of independent breeding did allow for the development of genes & mutations that are more common in some sub-populations rather than others5.
NOTE: Ancestry & Race are two different concepts that do not measure the same thing in the same way.
Admixed populations are products of the intense migratory & diasporic events that marked the 14th-19th centuries, which resulted in populations whose genetic ancestries were from 2 or more human sub-populations. We call these ancestrally-mixed populations admixed populations. Notable examples of admixed populations include African Americans, Latines, Ashkenazi Jews, and SWANA peoples. However, admixture is everpresent across most global populations. For example, this plot made by Hodgson et al. shows the admixture proportion across 81 different African populations with colorings for 10, 11, and 12 respective common ancestries 6.
Unfortunately, the violence of diaspora, colonialism, and enslavement also brought the dangers of inheriting potentially dangerous mutations by family, ancestry, or ethnicity. For example, the inheritance of Tay-Sachs, Ellis-van Creveld Syndrome, diabetes, and prostate cancer all occur with strong associations to ancestry7.
So, when studying disease with evidence of some ancestral association, LAI allows public researchers to first map the ancestry of an individual’s genome, before going on to find associations between disease & ancestral lineage.
Local Ancestry
Local Ancestry is formally defined as “the genetic ancestry of an individual at a particular chromosomal location, where an individual can have 0, 1, or 2 copies of an allele derived from each ancestral population.8”
Due to the structure of gene inheritance, we can model local ancestry as a Hidden Markov Model, with the observed! This idea has been formalized throughout the literature on statistical genetics, but we will look at one particular approach of using HMMs in the software HAPMIX to infer local ancestry in admixed populations.
HAPMIX
NOTE: Most content in this section originates from HAPMIX’s official release paper by Price et al.9.
Definitions
- A haplotype is a gene that is solely inherited from one parent and thus one parental population. Haplotypes are generally considered good markers of ancestral lineage.
Assumptions:
- We assume that the admixed population comes from only two ancestral populations
- We have access to reference data derived from unadmixed reference populations
- The unadmixed reference populations are closely related to the true ancestral populations.
Process
At every locus, HAPMIX estimates the likelihood that a “parental” haplotype from an admixed individual comes from a referenced ancestral population.
Loci that are next to each other are more likely to come from the same ancestral lineage10, but that doesn’t hold true after enough positions. So, we say that the number of genes inherited from an ancestral population at the \(j^th\) locus can be strictly predicted by the ancestry of the \({j-1}^{th}\) locus.
We can then use an HMM where the hidden states are the number of ancestral genes inherited \((0,1,2)\) at locus \(j\), and the observations are the likelihoods of observing some mutation at locus \(j\), given the ancestral reference panel.
Using a smoother analysis, similar to the Forward-Backward Algorithm, we can combine these likelihoods to provide a probabilistic estimate of ancestry at each locus.
Visual
Below is a visual made by Price et al. describing the HMM employed in HAPMIX.
- The thick bottom bar represents the true 2-state ancestral composition of an individual, and the dotted line above it represents the ancestral reference panel, from which we compute likelihoods.
- The line line represents a chromosomal segment of an individual, while the black circles represent typed mutations sites that can be distinguished between the ancestries.
- When the yellow line crosses a red or blue line, it is visualizing the selection of an ancestral group, given the individual’s observed chromosome. Notice that when the line crosses through a dot on one particular side, it will choose the side with equal or greater dots that are vertically aligned.
- The yellow line can misclassify the ancestries since this is a probabilistic model. Notice the portion of blue dots above the red bar.
3.1.1 Conclusion
LAI is only one application of HMMs. Variations of HMMs can be used to study the differential use of condoms for HIV prevention11, the race-differences in the progression of HIV for vulnerable, infected, and AIDS positive people between African American & White People12, or even when automatically monitoring the development of Influenza outbreaks13.
This is only a primer (with a de-emphasis on application), but I absolutely suggest reading through the footnote references to so how cool HMMs become when using them to find ways of helping people, treating illness, and more!
https://www.frontiersin.org/articles/10.3389/fgene.2014.00443/full↩︎
http://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1004393↩︎
https://onlinelibrary.wiley.com/doi/abs/10.1002/gepi.21819↩︎
https://journals.plos.org/plosgenetics/article?id=10.1371/journal.pgen.1000519↩︎
due to a concept called Linkage Disequilibrium↩︎
https://link.springer.com/chapter/10.1007/978-3-540-45231-7_48↩︎