Sebastian Zöllner

University of Michigan, USA

The benefit of coalescent theory in the framework of genetic association studies

Fine mapping of complex trait loci with coalescent methods in large case-control studies
Ziqian Geng, Paul Scheet and Sebastian Zöllner

Case-control studies are widely used to identify genomic regions containing disease variants. However, identifying the underlying risk variants for complex diseases is challenging due to the complicated genetic dependence structure caused by linkage disequilibrium (LD). By modeling the evolutionary process of a target region, coalescent-based approaches improve this identification by using all available haplotype information. Such methods estimate the genealogy at all sites in the region and thus model the probability of carrying risk variants at all loci jointly. From these probabilities we obtain Bayesian confidence intervals (CIs) where true risk variants are most likely to occur. Additionally, the genealogy at each position provides more information about the shared ancestry of neighboring sites. Indeed, such careful modeling of the shared ancestry of sequences is also beneficial in haplotyping and variant calling in regions of interests (ROI) where traditional hidden Markov approaches struggle. However, existing coalescent-based methods are computationally very challenging and can only be applied to samples below 200 individuals. Here, we propose a novel approach to overcome this difficulty, so that it can be applied to large-scale studies. First, we infer a set of clusters from the sampled haplotypes so that haplotypes within each cluster are inherited from a common ancestor. Then, we apply coalescent-based approaches to approximate the genealogy of ancient haplotypes at different positions across the ROI. Doing so, the dimension of external nodes in coalescent models is reduced from the total sample size to the number of clusters. Finally, we evaluate the position-specific cluster genealogy and their descendants’ phenotype distribution, to integrate over all positions and establish CIs where risk variants are most likely to occur. In simulation studies, our method correctly localizes short segments around true risk positions for both rare (1%) and common (5%) risk variants in datasets with thousands of individuals. In summary, we have developed a novel approach to estimate the genealogy throughout sequenced regions. In fine mapping of complex trait loci, our method is applicable for large-scale case-control studies using sequencing data.