IGM identifies correlation between enhancer domains and gene pathogenicity

February 6, 2020

Understanding and interpreting the functional consequences of genetic variation, including disease-associated genetic variation in the 98% of the human genome that does not code for proteins, remains one of the signature challenges of the human genetics field. In the February issue of the American Journal of Human Genetics, authors Xinchen Wang, PhD, and David Goldstein, PhD, uncovered relationships between the total size of a gene's transcriptional enhancer elements, which are non-coding segments of DNA that control where and when a gene is active, and the gene’s importance in development and disease. This has practical implications in the discovery of genes involved in both rare and common diseases, and establishes a framework for identifying non-coding DNA variation with consequences in human disease.

The finding demonstrates that the sizes and number of enhancers linked to a gene reflects its likelihood of being associated with both rare and common human diseases. Moreover, genes with redundant enhancer domains are depleted of cis-acting genetic variation that disrupts gene expression, and are buffered against the effects of disruptive non-coding mutations. These results demonstrate that dosage-sensitive genes have evolved robustness to the disruptive effects of genetic variation by expanding their regulatory domains. This resolves a puzzle in the genetic literature about why genes associated with human disease are depleted of cis-eQTLs (cis-expression quantitative trait loci, which are instances where genetic variation affects gene expression), suggesting this relationship may complicate identification of causal genes in complex human diseases using this eQTL information, and establishes a framework for identifying non-coding regulatory variation with phenotypic consequences.

The fact that the “enhancer domain score” is strongly predictive of whether genes cause disease when mutated, has the implication that important genes have evolved redundancy in regulation, and provides an explanation for why it has been difficult to identify regulatory mutations (e.g. those not in protein-coding regions, but instead affect regulation of gene expression) with major effects on gene expression. This work also establishes a framework for identifying non-coding DNA variation with consequences in human disease.

This work has implications for the discovery of genes involved in both rare and common diseases. For example, we create an "enhancer domain score" (EDS) that is predictive for disease-causing genes and has comparable accuracy to commonly used metrics of gene essentiality generated using data from human exome sequencing cohorts. When EDS differs with these population-scale metrics of gene constraint, EDS is often more effective at identifying developmental genes and genes associated with human disease, especially in genes with short coding sequences.

Another implication is for the prioritization of genes involved in common human diseases. Most studies aimed at identifying inherited genetic variants that contribute to genetic risk for complex human diseases have found that these genetic variants are within non-coding regions of DNA, rather than inside genes themselves. This has made it difficult to figure out which genes, and therefore which biological pathways, are important for various diseases. To address this issue, the genetics community has adopted eQTL datasets (expression quantitative trait loci) to try to figure out which genes are affected by looking for changes in gene expression correlated with changes in genetic variant genotype. Wang and Goldstein’s work shows that the genes most likely to contribute to human disease are actually depleted of these eQTL relationships, setting a prior expectation that eQTLs at these disease-relevant regions are more likely to point to genes not involved in disease, therefore complicating the use of eQTL data for disease gene identification.