Software and Genomic Data Analysis

Macro of a nucleotide sequence print out.

The analyst team performs various analyses on the sequencing data that is generated from the Institute of Genomic Medicine (IGM) pipeline. Our analyses include rare-variant collapsing studies as well as diagnostic and individualized reporting. We both optimize current analytic methods and design novel ones, implementing them in user-friendly programs. With a variety of backgrounds, programming language proficiencies, and specialized interests, we have formed an efficient team with a strong collaborative spirit.

In our efforts to conduct wide human genome analysis, our researchers use a number of powerful bioinformatics tools and programs to annotate, visualize and interpret next-generation sequence data. Links to these tools and publically available genetic variant catalogs can be found here:

Software

ATAV (Analysis Tool for Annotated Variants)

ATAV (Analysis Tool for Annotated Variants) is a statistical toolset that is designed to detect complex disease-associated rare genetic variants by performing association analysis on annotated variants derived from whole-genome or whole-exome sequencing data.

Scores

RVIS (Residual Variation Intolerance Score)

RVIS (Residual Variation Intolerance Score) is a gene-based score intended to help in the interpretation of human sequence data. The intolerance score in its current form is based upon allele frequency data as represented in whole exome sequence data from the gnomAD data set. The score is designed to rank genes in terms of whether they have more or less common functional genetic variation relative to the genome wide expectation given the amount of apparently neutral variation the gene has. A gene with a positive score has more common functional variation, and a gene with a negative score has less and is referred to as "intolerant". By convention, we rank all genes in order from most intolerant to least. As an example, a gene such as ATP1A3 has a RVIS score of -1.53 and a percentile of 3.37%, meaning it is amongst the 3.37% most intolerant of human genes. Depending on what disease area you are studying, you may way to consider either intolerant genes (neurodevelomental disease) or tolerant genes (some immunological diseases) as better candidates.

subRVIS (sub-region Residual Variation Intolerance Score)

SubRVIS is a gene sub-region based score from the RVIS franchise intended to help in the interpretation of human sequence data. It provides users with a score denoting the degree of intolerance of the exon or protein domain in which a variant falls.

TraP (Transcript-inferred Pathogenicity score)

The Transcript-inferred Pathogenicity score, or TraP-score, is constructed to evaluate a single nucleotide variant’s ability to cause disease by damaging the final transcript. Possible effects on the final transcript may include changes in splicing machinery recognition of the pre-mRNA sequence, an introduction of a new cryptic splice site, a regulatory change that will affect exon inclusion or an intron retention that could result in a protein loss/gain of function or the mRNA’s nonsense-mediated decay. TraP does not predict the structure of the novel transcript isoforms, but evaluates whether a change will occur to the original transcripts of a gene in question.

LIMBR (Localized Intolerance Model using Bayesian Regression)

Different parts of a gene can be of differential importance to development and health. This regional heterogeneity is also apparent in the distribution of disease-associated mutations, which often cluster in particular regions of disease-associated genes. The ability to precisely estimate functionally important sub-regions of genes will be key in correctly deciphering relationships between genetic variation and disease. Localized Intolerance Model using Bayesian Regression (LIMBR) is a sub-regional (domains or exons) genic intolerance score. We fit a Bayesian hierarchical model explicitly characterizing depletion in functional variation at both the gene and sub-regional level.

EDS (Enhancer domain score)

Non-coding transcriptional regulatory elements are critical for controlling the spatiotemporal expression of genes. We demonstrated that the sizes and number of enhancers linked to a gene reflects its disease pathogenicity. Moreover, genes with redundant enhancer domains are depleted of cis-acting genetic variants that disrupt gene expression and are buffered against the effects of disruptive non-coding mutations. We have shown that the size and redundancy of a gene’s regulatory domains is closely related to the gene’s importance in development and disease. By combining multiple metrics reflecting enhancer domain size and redundancy, we constructed an “enhancer domain score” that is predictive for disease-causing genes and has comparable accuracy to commonly used metrics of gene essentiality, including pLI, LOEUF and RVIS, that were generated using human exome sequencing data. When EDS differs with these population-scale metrics of gene constraint, EDS is often more effective at identifying developmental and disease genes, especially in genes with short coding sequences.

SynRVIS (Synonymous RVIS)

Synonymous codon usage has been identified as an important determinant of translational efficiency and mRNA stability in model organisms and human cell lines. However, synonymous variation is largely overlooked as a component of human genetic diversity. We introduced synRVIS (synonymous RVIS), which ranks genes by their constraint against changes in codon optimality. We have illustrated that intolerant synRVIS genes are enriched for disease genes and dosage sensitive genes.

If you experience any IGM Information Technology related issues, please contact us at igm-it@columbia.edu.

Director

Ali G Gharavi, MD
- Jay Meltzer, M.D. Professor of Nephrology and Hypertension (in Medicine)

Analysts

Ayan Malakar
- Bioinformatician
- am5153@cumc.columbia.edu
Joshua Motelow
- Assistant Professor of Pediatrics
- jm4279@cumc.columbia.edu
Josh focuses on using whole exome and whole genome sequencing data to probe genetic mechanisms in disease processes including epilepsy, ALS and pediatric critical illness. Josh received his MD and PhD (in neuroscience) from Yale University in 2015 where his work focused on subcortical mechanisms of arousal during complex partial limbic seizures. He then completed his pediatric residency at NewYork-Presbyterian - Columbia University Irving Medical Center in 2018 and is continuing his clinical training as a fellow in pediatric critical care medicine.