GLIMPSE is a software for imputation and phasing of low-coverage datasets in the form of genotype gikelihoods (GLs) at all variant positions. This tutorial requires GLIMPSE v1.1.0 (or later) and HTSlib and BCFtools v1.10 (or later). Here we also used BCFtools to compute genotype likelihoods.

In this tutorial, we show how to run GLIMPSE to perform chromosome X imputation. Imputation of chromosome X is different from autosomes, as chromosome X contains a non pseudo-autosomal region (non-PAR) where males are haploid and females are diploids. The chromosome also contains two pseudo-autosomal regions (PAR1, PAR2), where all samples are diploid. PAR1 and PAR2 are both at the end of the chromosomes. A schematic representation of the regions of the chromosome is given below:

ChrX

This tutorial describes the process of getting refined genotype and haplotype calls starting from aligned sequencing reads data (BAM/CRAM file) for the non-PAR region of chromosome X. Imputation of PAR regions of the chromosome is not performed here, as it is not conceptually different from imputation of autosomes.

The data and the scripts for this tutorial can be downloaded HERE.
A minimal set of data to run this pipeline can be found in the folder tutorial_chrX. Here we assume that the folder is located as a subfolder of the GLIMPSE installation: tutorial_chrX. All datasets used here are in GRCh38/hg38 genome assembly

A dataset containing chromosome X downsampled 1x sequencing reads (BAM files) for 2 individuals (one male and one female) from ASW population is provided with this tutorial, together with a reference genome FASTA file needed for genotype likelihood computations.

1.1. Setting the environment and binaries

All scripts have been written assuming  >GLIMPSE/tutorial_chrX/ as current directory. Therefore, we move to the tutorial_chrX directory:

 > cd GLIMPSE/tutorial_chrX/

From this folder we build all the GLIMPSE software needed (chunk, phase, ligate, sample and concordance). For this reason, we require that GLIMPSE can correctly compile on your machine and you might be required to edit the Makefiles manually. See installation instructions if there are any problems at this stage. We created a script to configure a directory containing symbolic links to the GLIMPSE binaries we will need later. To run the setup script, simply run:

 > ./step1_script_setup.sh

If the script runs correctly, the software compiles and the bin folder containing symbolic links to the binaries has been created, everything is correctly setup.

1.2. Low coverage reads

In this example we downsampled the publicly available 30x data available for ASW population as part of the 1000 Genomes Project. We downloaded the full dataset, kept only non-PAR chromosome X reads and downsampled to 1x. The resulting BAM file can be found in:

asw_1x_bam/*.bam[.bai], where each bam file in the folder represents a single sample (2 in total).

In order to run accurate imputation we recommend to use a targeted reference with ancestry related to the target samples, if possible. For this tutorial we use the 1000 Genomes Project phase 3 reference panel. Since the reference panel contains data of our target samples, we need to remove them from the reference panel. This steps download the 1000 genomes b38 data from the EBI ftp site and remove the target samples from it.

2.1. Download of the reference panel files

The 1000 Genomes phase 3 reference panel is publicly available at EBI 1000 genomes ftp site. We can download chromosome X data using:

 > wget http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/supporting/GRCh38_positions/ALL.chrX_GRCh38.genotypes.20170504.vcf.gz{,.tbi}

The reference panel size is approximately 950 MB.

2.2. Remove ASW samples from the reference panel and basic QC

We use BCFtools to remove ASW samples from the reference panel and export the dataset in BCF file format (for efficiency reasons). We also performs a basic QC step that includes:

  1. Generate a text file containing space-spearated old and new chromosome names. This is required to rename the numerical chromosome names with 'chr' tag. Apply the new chromosome names with 'bcftools annotate'.
  2. Keep only SNPs and remove multiallelic records.
The original reference panel files are then deleted from the main tutorial_chrX folder:

 > for CHR in {1..23} X ; do
 > echo ${CHR} chr${CHR}
 > done >> chr_names.txt

 > CHR=X
 > bcftools annotate --rename-chrs chr_names.txt \
 >     ALL.chr${CHR}_GRCh38.genotypes.20170504.vcf.gz -Ou | \
 > bcftools view -m 2 -M 2 -v snps -S ^NA19625,NA19703 --threads 4 -Ob -o reference_panel/1000GP.chr${CHR}.filt.bcf
 > bcftools index -f reference_panel/1000GP.chrX.filt.bcf
 > rm ALL.chrX_GRCh38.genotypes.20170504.vcf.gz*
 > rm chr_names.txt

The resulting reference panel files can be found in:

reference_panel/1000GP.chrX.filt.bcf[.csi]

The reference panel can now be used for genotype likelihoods calculations (only the sites, not the data) and imputation.

GLIMPSE requires input data to take the form of Genotype Likelihoods (GLs). GLs need to be computed at all target individuals and all variant sites present in the reference panel of haplotypes used for the imputation. In this section, we describe a simple procedure to compute GLs from sequencing data using BCFtools. However, other callers, such as GATK for instance, can be also considered as soon as the GLs are encoded in a VCF/BCF file using the FORMAT/PL field.

Please note that BCFtools does not call correctly genotype likelihoods for indels and this might affect the quality of the imputation even at non-indels. For this reason in this tutorial we only focus on bi-allelic SNPs.

3.1. Extracting variable positions in the reference panel

We want this information to tell BCFtools at which position to make a call. Since BCFtools does not seem to compute correctly genotype likelihood for indels, here we only focus on SNPs (however, GLIMPSE can impute any type of variants as soon it is bi-allelic and has GLs being properly defined). To perform the extraction from the chromosome X of a reference panel 1000GP.chrX.filt.bcf, run first BCFtools as follows:

 > bcftools view -G -m 2 -M 2 -v snps reference_panel/1000GP.chrX.filt.bcf -Oz -o reference_panel/1000GP.chrX.filt.sites.vcf.gz
 > bcftools index -f reference_panel/1000GP.chrX.filt.sites.vcf.gz

Then, convert the output file 1000GP.chrX.filt.sites.vcf.gz into TSV format and index the file using tabix (requires htslib in PATH), using this command:

 > bcftools query -f'%CHROM\t%POS\t%REF,%ALT\n' reference_panel/1000GP.chrX.filt.sites.vcf.gz | bgzip -c > reference_panel/1000GP.chrX.filt.sites.tsv.gz
 > tabix -s1 -b2 -e2 reference_panel/1000GP.chrX.filt.sites.tsv.gz

In a general pipeline, we should generate this pair of files (*.vcf.gz + *.tsv.gz) for all chromosome separately.

3.2. Computing GLs for each individual at specific positions

Once ready the position files, we can then start performing the calling step. We use the downsampled 1x BAM files generated in section 1.1 asw_1x_bam/*.chrX.1x.bam.
BCFtools requires the reference genome in order to align the BAM file. We used the reference genome version used by the 1000 Genomes Project (reference_genome/hs38DH.chrX.fa.gz[.fai,.gzi])
We can run the following command:

 > list.txt
 > FILES=asw_1x_bam/asw_samples/*
 > for file in $FILES; do
 > f=$(basename $file)
 > BAM=asw_1x_bam/${f}.bam
 > VCF=reference_panel/1000GP.chrX.filt.sites.vcf.gz
 > TSV=reference_panel/1000GP.chrX.filt.sites.tsv.gz
 > REFGEN=reference_genome/hs38DH.chrX.fa.gz
 > OUT=asw_1x_vcf/${f}.chrX.1x.vcf.gz
 > bcftools mpileup -f ${REFGEN} -I -E -a 'FORMAT/DP' -T ${VCF} -r chrX ${BAM} -Ou | bcftools call --ploidy GRCh38 -S asw_1x_bam/asw_samples/${f} -Aim -C alleles -T ${TSV} -Oz -o ${OUT}
 > bcftools index -f ${OUT}
 > ls ${OUT} >> list.txt
 > done

Note here, that we use -T reference_panel/1000GP.chrX.filt.sites.vcf.gz in the first part of the command line and
-T reference_panel/1000GP.chrX.filt.sites.tsv.gz in the second part of the command line.

You may also tune the options of BCFtools to your specific needs, requirements and data.

As an output of this step we have a VCF file format containing genotype likelihoods at each site in the reference panel.

3.3. Merging genotype likelihoods of multiple individuals

As we have multiple target individuals, an additional step is to generate a single file containing genotype likelihoods for all target individuals. This can be easily done using BCFtools, for example:

 > bcftools merge -m none -r chrX:2781480-155701382 -Oz -o asw_1x_vcf/merged.chrX.1x.vcf.gz -l list.txt
 > bcftools index -f asw_1x_vcf/merged.chrX.1x.vcf.gz
 > rm list.txt

Where list.txt is a text file containing the full list of VCF/BCF files containing GLs of each target individual in the study, one individual file per line. The resulting file must be indexed and can be used in the subsequent steps of the pipeline.

One important step prior is to define the chunks where to run imputation and phasing. This step is not trivial because different long regions increase the running time, but small regions can drastically decrease accuracy. Also there are several variables to take into account: the amount of missingness in the defined region and the length in Mb. For these reasons we developed a tool in the GLIMPSE suite (GLIMPSE_chunk) that is able to quickly generate imputation chunks taking into account all this information.

4.1. Chunking a chromosome

We use GLIMPSE_chunk to generate imputation regions for the full non-PAR region of chromosome X, modifying few default parameters in this way:

 > ./bin/GLIMPSE_chunk --input reference_panel/1000GP.chrX.filt.sites.vcf.gz --region chrX:2781480-155701382 --output chunks.chrX.txt

This generates a file containing the imputation chunks and larger chunks including buffers that we will use to run GLIMPSE_phase.