Below are some important concepts, software, standards, and other things you might encounter in these Docs and working with SHAPEIT5.
- Genotype imputation
- Haplotype phasing
- Haplotype scaffold
- Low-coverage whole genome sequencing
- Reference panel
- SNP array
- Switch error rate
- UK Biobank Research Analysis Platform
- Whole exome sequencing (WES)
- Whole genome sequencing (WGS)
GLIMPSE (Genotype Likelihoods IMputation and PhaSing mEthod) is a method for haplotype phasing and genotype imputation of low-coverage WGS data, developed by Simone Rubinacci and Olivier Delaneau. The latest version of the tool is GLIMPSE2, which offers the best accuracy and computational performance for large reference panels at rare variants.
Genotype imputation is a probabilistic inference of the genotypes, typically used for SNP array data or low-coverage WGS. Imputation uses a large reference panel of phased haplotypes to determine large shared identity-by-descent segments that can be used for the statistical inference. Two widely used tools for genotype imputation are IMPUTE5 for SNP array data and GLIMPSE tool for low-coverage imputation.
Haplotype phasing involves distinguishing the two parentally inherited copies of each chromosome into haplotypes. SHAPEIT5 is a tool for statistical population-based haplotype phasing which focuses specifically at extremely rare variants.
A haplotype scaffold is a set of highly confident haplotypes, typically on a subset of the data. SHAPEIT5 uses the haplotypes derived at common variants as haplotype scaffolds onto which heterozygous genotypes are phased one rare variant at a time.
IMPUTE is a method for genotype imputation of SNP array data, developed by Jonathan Marchini, Bryan Howie and Simone Rubinacci. The latest version of the tool is IMPUTE5, which offers efficient computational performances for large reference panels.
Low-coverage whole genome sequencing is whole genome sequencing data sequenced at low-depth usually data with a coverage between 0.1x-8x. Conversely to high-coverage WGS, genotype calling in this case is uncertain and very often this data needs to be imputed with specialised methods (e.g. GLIMPSE) in order to be used and processed.
A reference panel is a large set of deeply-sequenced haplotypes used for genotype imputation and haplotype phasing typically for cohorts genotyped using SNP array or low-coverage WGS.
SHAPEIT (Segmented HAPlotype Estimation and Imputation Tools) is a commonly used method for haplotype phasing, mainly developed by Olivier Delaneau. The latest version of the tool is SHAPEIT5, which offers the best accuracy for large cohorts at rare variants.
SNP array is a DNA microarray which is used to detect common SNPs within a population. In the context of GWAS, SNP array data is often imputed using a reference panel of haplotypes (e.g. using IMPUTE5
A singleton is a rare variant for which genetic variation is carried by a unique chromosome in the dataset (minor allele count of 1). As these variants are unique in the datasets, there is no information to phase the variant, therefore statistical population-based phasing has always reported a switch error rate of 50% at these sites. The typical way to phase these variants is to use family information. However, SHAPEIT5 is able to provide for the first time non-random phasing at these sites without the need of family information, by using a simple coalescent-inspired model.
Switch error rate (SER) is a popular metric to measure the quality of phased data. SER is used when known maternal and paternal haplotypes are available and is defined as the number of switch errors divided by the number of opportunities for switch errors.
The UK Biobank Research Analysis Platform (RAP), enabled by DNAnexus and Amazon Web Services (AWS), is a cloud-based platform that enables researchers to work with the UK Biobank WGS and WES data.
Whole exome sequencing (WES) is a technique for sequencing all of the protein-coding regions of genes in a genome. In the UK Biobank dataset, we combined WES with SNP array data and performed phasing with SHAPEIT5.
Whole-genome sequencing (WGS) is the process of determining the DNA sequence of an organism at a single time. WGS is intended at high coverage, from 30x of coverage or more, therefore the quality of genotype calling is high. SHAPEIT5 is designed for large WGS cohorts (e.g. in the UK Biobank), to distinguish the two inherited copies of each chromosome into haplotypes (see haplotype phasing).