1. Datasets and preliminaries
GLIMPSE is a tool of software for imputation and phasing of low-coverage datasets in the form of genotype gikelihoods (GLs) at all variant positions.
GLIMPSE requires HTSlib and BCFtools v1.7 (or later) as a requirement, since it makes use of indexed VCF/BCF files.
Here we also used BCFtools to compute genotype likelihoods.
In this tutorial, we show how to run GLIMPSE from sequencing reads data (BAM/CRAM file) to obtain refined genotype and haplotype calls.
A small dataset containing chromosome 22 downsampled 1x sequencing reads for one individual (NA12878) is provided with GLIMPSE,
together with a reference genome FASTA file (for chromsome 22) needed for genotype likelihood computations.
A minimal set of data to run this pipeline can be found in the folder GLIMPSE/tutorial
. All datasets used here are in GRCh38/hg38 genome assembly.
1.1. Setting the environment and binaries
All scripts have been written assuming > cd GLIMPSE/tutorial/
as current directory. Therefore, we require to move to the tutorial directory:
> cd GLIMPSE/tutorial/
From this folder we build all the GLIMPSE software needed (chunk, phase, ligate, sample and concordance). For this reason, we require that GLIMPSE can correctly compile on your machine and you might be required to edit the Makefiles manually. See installation instructions if there are any problems at this stage. We created a script to configure a directory containing symbolic links to the GLIMPSE binaries we will need later. To run the setup script, simply run:
> ./step1_script_setup.sh
If the script runs correctly, the software compiles and the bin folder containing symbolic links to the binaries has been created, everything is correctly setup.
1.2. Low coverage reads
In this example we downsampled the publicly available 30x data available for NA12878 provided by the Genome In A Bottle consortium (GIAB). We downloaded the full dataset, kept only chromosome 22 reads and downsampled to 1x. The resulting BAM file can be found in:
GLIMPSE/tutorial/NA12878_1x_bam/NA12878.bam[.bai]
Details of how we downsampled the dataset are provided in appendix section A1.1.
2. Reference panel preparation
In order to run accurate imputation we recommend to use a targeted reference with ancestry related to the target samples, if possible. For this tutorial we use the 1000 Genomes Project 30x NYGC reference panel. Since the reference panel contains data of our target sample and few relatives, we need to remove them from the reference panel. This steps downloads the 1000 genomes b38 data from the EBI ftp site.
2.1. Download of the reference panel files
The 1000 Genomes Project reference panel is publicly available at EBI 1000 genomes ftp site. We can download chromosome 22 data using:
> wget -c http://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000G_2504_high_coverage/working/20201028_3202_phased/CCDG_14151_B01_GRM_WGS_2020-08-05_chr22.filtered.shapeit2-duohmm-phased.vcf.gz{,.tbi}
Update (20/05/2021): users reported that the download of the reference panel can be slow and it is often interrupted. Unfortunately, we do not have control on that. However, using wget with the -c option allows to restart the download from where it was interrupted. Additionally, we recommend to compare the checksum to what it was reported. Instructions for this can be found in the following script:
step2_script_reference_panel.sh
2.2. Remove NA12878 (and family) from the reference panel and perform basic QC
We used BCFtools to remove sample NA12878 and related samples from the reference panel and we export the dataset in BCF file format (for efficiency reasons).
We also performs a basic QC step by keeping only SNPs and remove multiallelic records.
The reason for this is multiple: GLIMPSE only handles bi-allelic SNPs and Bcftools does not perform a good job at calling indels, causing potential problems at neighboring SNPs, therefore we remove them completely from the analysis.
Alternatively, it is possible to keep the indels in the reference panel and force missing-data imputation there.
The original reference panel files are then deleted from the main tutorial folder:
> CHR=22
> bcftools norm -m -any CCDG_14151_B01_GRM_WGS_2020-08-05_chr22.filtered.shapeit2-duohmm-phased.vcf.gz -Ou --threads 4 |
bcftools view -m 2 -M 2 -v snps -s ^NA12878,NA12891,NA12892 --threads 4 -Ob -o reference_panel/1000GP.chr22.noNA12878.bcf
> bcftools index -f reference_panel/1000GP.chr22.noNA12878.bcf --threads 4
> rm CCDG_14151_B01_GRM_WGS_2020-08-05_chr22.filtered.shapeit2-duohmm-phased.vcf.gz*
The resulting reference panel files can be found in:
GLIMPSE/tutorial/reference_panel/1000GP.chr22.noNA12878.bcf[.csi]
The reference panel can now be used for genotype likelihoods calculations (only the sites, not the data) and imputation.
3. Computation of genotype likelihoods
GLIMPSE requires input data to take the form of Genotype Likelihoods (GLs). GLs need to be computed at all target individuals and all variant sites present in the reference panel of haplotypes used for the imputation. In this section, we describe a simple procedure to compute GLs from sequencing data using BCFtools. However, other callers, such as GATK for instance, can be also considered as soon as the GLs are encoded in a VCF/BCF file using the FORMAT/PL field.
Please note that BCFtools does not call correctly genotype likelihoods for indels and this might affect the quality of the imputation even at non-indels. For this reason in this tutorial we only focus on bi-allelic SNPs.
For this tutorial we only follow step 3.1 and step 3.2, because step 3.3 is not necessary since we have only one target sample in our study panel.
The code for sections 3.1 and 3.2 is included in the script step2_script_GL.sh
.
3.1. Extracting variable positions in the reference panel
We a VCF/BCF file containing only sites to tell BCFtools at which positions to make a genotype call.
Since BCFtools does not compute correctly genotype likelihood for indels, here we only focus on SNPs (however, GLIMPSE can impute any type of variants as soon it is bi-allelic and has GLs being properly defined).
To perform the extraction from the chromosome 22 of a reference panel 1000GP.chr22.noNA12878.bcf
, run first BCFtools as follows:
> bcftools view -G -m 2 -M 2 -v snps reference_panel/1000GP.chr22.noNA12878.bcf -Oz -o reference_panel/1000GP.chr22.noNA12878.sites.vcf.gz
> bcftools index -f reference_panel/1000GP.chr22.noNA12878.sites.vcf.gz
Then, convert the output file 1000GP.chr22.noNA12878.sites.vcf.gz
into TSV format and index the file using tabix (requires htslib in PATH), using this command:
> bcftools query -f'%CHROM\t%POS\t%REF,%ALT\n' reference_panel/1000GP.chr22.noNA12878.sites.vcf.gz | bgzip -c > reference_panel/1000GP.chr22.noNA12878.sites.tsv.gz
> tabix -s1 -b2 -e2 reference_panel/1000GP.chr22.noNA12878.sites.tsv.gz
In a general pipeline, we should generate this pair of files (*.vcf.gz + *.tsv.gz) for all chromosomes separately.
3.2. Computing GLs for a single individual at specific positions
Once we have the VCF and TSV files ready, we can then start performing genotype calling for our low-coverage sample.
We use the downsampled 1x BAM file generated in section 1.1 NA12878_1x_bam/NA12878.chr22.1x.bam
.
BCFtools requires the reference genome in order to align the BAM file. We here use the reference genome version used by the 1000 Genomes Project (reference_genome/hs38DH.chr22.fa.gz[.fai,.gzi]
)
We can run the following command:
> BAM=NA12878_1x_bam/NA12878.bam
> VCF=reference_panel/1000GP.chr22.noNA12878.sites.vcf.gz
> TSV=reference_panel/1000GP.chr22.noNA12878.sites.tsv.gz
> REFGEN=reference-genome/hs38DH.chr22.fa.gz
> OUT=NA12878_1x_vcf/NA12878.chr22.1x.vcf.gz
> bcftools mpileup -f ${REFGEN} -I -E -a 'FORMAT/DP' -T ${VCF} -r chr22 ${BAM} -Ou | bcftools call -Aim -C alleles -T ${TSV} -Oz -o ${OUT}
> bcftools index -f ${OUT}
Note here, that we use -T reference_panel/1000GP.chr22.noNA12878.sites.vcf.gz
in the first part of the command line and
-T reference_panel/1000GP.chr22.noNA12878.sites.tsv.gz
in the second part of the command line.
You may also tune the options of BCFtools to your specific needs, requirements and data (for example reducing the threshold for the minimum base/mapping quality to another positive number, in order to have less missing data).
The output of this step is a VCF file format containing genotype likelihoods at each site in the reference panel.
3.3. Merging genotype likelihoods of multiple individuals
In the case of multiple target individuals, an additional step, not required in this pipeline, is to generate a single file containing genotype likelihoods for all target individuals. This can be easily done using BCFtools, for example:
> bcftools merge -m none -r chr22 -Oz -o merged.chr22.1x.vcf.gz -l list.txt
Where list.txt
is a text file containing the full list of VCF/BCF files containing GLs of each target individual in the study, one individual file per line.
The resulting file must be indexed and can be used in the subsequent steps of the pipeline.
4. Split the genome into chunks
One important step prior is to define the chunks where to run imputation and phasing. This step is not trivial because different long regions increase the running time, but small regions can drastically decrease accuracy. Also there are several variables to take into account: the amount of missingness in the defined region and the length in Mb. For these reasons we developed a tool in the GLIMPSE suite (GLIMPSE_chunk) that is able to quickly generate imputation chunks taking into account all this information.
4.1. Chunking a chromosome
We use GLIMPSE_chunk to generate imputation regions for the full chromosome 22, modifying few default parameters in this way:
> bin/GLIMPSE_chunk --input reference_panel/1000GP.chr22.noNA12878.sites.vcf.gz --region chr22 --window-size 2000000 --buffer-size 200000 --output chunks.chr22.txt
5. Impute and phase a whole chromosome
The core of GLIMPSE is the GLIMPSE_phase method. The algorithm works by iteratively refining the genotype likelihoods of the target individuals in the study. The output of the method is a VCF/BCF file containing:
- the best guess genotype in the FORMAT/GT field
- the imputed genotype dosage in the FORMAT/DS field
- the genotype probabilities in the FORMAT/GP field
- the last (max 15) sampled haplotypes in the FORMAT/HS field
5.1. Running GLIMPSE
To run GLIMPSE_phase we only need to run a job for each imputation chunk. Each job runs on 1 thread in this example. Input data are the dataset containing the genotype likelihoods, a reference panel of haplotypes, a fine-scale genetic map (e.g. from the International HapMap project) and buffered and imputation regions. We run GLIMPSE phase using default parameters that can apply to most datasets. For extremely low coverage and small reference panels, increasing the number of iterations might give more accuracy.
> VCF=NA12878_1x_vcf/NA12878.chr22.1x.vcf.gz
> REF=reference_panel/1000GP.chr22.noNA12878.bcf
> MAP=../maps/genetic_maps.b38/chr22.b38.gmap.gz
> while IFS="" read -r LINE || [ -n "$LINE" ];
> do
> printf -v ID "%02d" $(echo $LINE | cut -d" " -f1)
> IRG=$(echo $LINE | cut -d" " -f3)
> ORG=$(echo $LINE | cut -d" " -f4)
> OUT=GLIMPSE_imputed/NA12878.chr22.1x.${ID}.bcf
> bin/GLIMPSE_phase --input ${VCF} --reference ${REF} --map ${MAP} --input-region ${IRG} --output-region ${ORG} --output ${OUT}
> bcftools index -f ${OUT}
> done < chunks.chr22.txt
6. Ligate multiple chunks together
Merging together different chunks of the sample chromsome is a standard procedure and it is usually straightforward in the case of genotype data. However GLIMPSE phase generates phased information in addition of genotype information. In this case, ligation is a required step to merge multiple chunks together without losing phased imputation. For this purpose we developed a software in the GLIMPSE software suite, called GLIMPSE_ligate. The output of this step is a chromosome wide VCF/BCF file containing information coherent phasing information, stored in the FORMAT/HS field
6.1. Ligate whole chromosome chunks
GLIMPSE_ligate only requires a list containing the imputed files that need to be merged together:
> LST=GLIMPSE_ligated/list.chr22.txt
> ls GLIMPSE_imputed/NA12878.chr22.imputed.*.bcf > ${LST}
> OUT=GLIMPSE_ligated/NA12878.chr22.merged.bcf
> bin/GLIMPSE_ligate --input ${LST} --output $OUT
> bcftools index -f ${OUT}
In the case phased information is not needed, this can be the final step, providing chromosome-wide genotype information in the FORMAT/DS and FORMAT/GP field.
7. Sample haplotypes
In the case haplotype-level information is required, GLIMPSE_sample can sample from the FORMAT/HS field accurate, consensus based haplotypes. The output of this step is a VCF/BCF file containing information at the chromosome level, where the GT field contains accurate phased information.
7.1. Sample whole chromosome data
There are two main modes for GLIMPSE_sample can operate.
The first mode can sample haplotypes based on the probabilities given from the FORMAT/HS field (--sample), while the second mode outputs the most likely haplotypes given the values in the FORMAT/HS field (--solve).
We run GLIMPSE_sample using the --solve mode:
> VCF=GLIMPSE_ligated/NA12878.chr22.merged.bcf
> OUT=GLIMPSE_phased/NA12878.chr22.phased.bcf
> bin/GLIMPSE_sample --input ${VCF} --solve --output ${OUT}
> bcftools index -f ${OUT}
From the output dataset, accurate phased information, stored in the GT field can be directly used.
8. Check imputation accuracy
Since we downsampled the reads from the original 30x data, we might be interested now in checking how accurate the imputation is, compared the original 30x dataset. For this purpose we use the GLIMPSE_concordance tool, which can be used to compute the r2 correlation between imputed dosages (in MAF bins) and highly-confident genotype calls from the high-coverage dataset.
8.1. Running GLIMPSE_concordance
The GLIMPSE_concordance requires tool requires a file (--input) indicating:
- the region of interest;
- a file containing allele frequencies at each site;
- the validation dataset called at the same positions as the imputed file (in a similar way as showed in appendix A1, without the downsampling);
- the imputed data.
concordance.lst
having all the correct files in the right order for this tutorial.
Other parameters specify how confident we want a site to be in the validation data and the MAF bins.The GLIMPSE_concordance too can be run as follows:
> ./bin/GLIMPSE_concordance --input concordance.lst --minDP 8 --output GLIMPSE_concordance/output --minPROB 0.9999 \
> --bins 0.00000 0.00100 0.00200 0.00500 0.01000 0.05000 0.10000 0.20000 0.50000ex \
> --af-tag AF_nfe --thread 4
8.2. Visualising the results
The GLIMPSE_concordance folder contains several files describing the quality of the imputation. In particular, we are interested in the file GLIMPSE_concordance/output.rsquare.grp.txt.gz
.
We can visualise the results by going in the plot
folder and running:
> ./concordance_plot.py
The command requires python3 and matplotlib installed.
The plot shows that the 1000 Genomes reference panel can be used to impute accurately variants up to ~1% MAF and there is a drop at rare variants, as expected.
A1. Low coverage reads
A1.1. Downsample 1000 Genome Project NYGC high coverage data
30x data for NA12878 has been made available by the New York Genome Center and the 1000 Genomes project. Please refer to the official documentation to download the data:
https://www.internationalgenome.org/data-portal/data-collection/30x-grch38
Once the CRAM file (and index) have been downloaded, we create our validation dataset for chromosome 22 using samtools and bcftools (assuming the reference panel has already been downloaded and step 3.1. has already been performed):
> samtools view -T reference-genome/hs38DH.chr22.fa.gz -bo GLIMPSE_validation/NA12878.bam NA12878.final.cram chr22
> samtools index GLIMPSE_validation/NA12878.bam
> BAM=GLIMPSE_validation/NA12878.bam
> VCF=reference_panel/1000GP.chr22.noNA12878.sites.vcf.gz
> TSV=reference_panel/1000GP.chr22.noNA12878.sites.tsv.gz
> REFGEN=reference-genome/hs38DH.chr22.fa.gz
> OUT=GLIMPSE_validation/NA12878.chr22.validation.bcf
> bcftools mpileup -f ${REFGEN} -I -E -a 'FORMAT/DP' -T ${VCF} -r chr22 ${BAM} -Ou | bcftools call -Aim -C alleles -T ${TSV} -Ob -o ${OUT}
> bcftools index -f ${OUT}
> rm GLIMPSE_validation/NA12878.bam*
Finally, we can downsample the 30x data to low coverage (1x) using samtools
> samtools view -T reference-genome/hs38DH.chr22.fa.gz -s 1.03045 -bo NA12878_1x_bam/NA12878.bam NA12878.final.cram chr22
> samtools index NA12878_1x_bam/NA12878.bam
> rm NA12878.final.cram*
The output datasets NA12878_1x_bam/NA12878.bam
and GLIMPSE_validation/NA12878.chr22.validation.bcf
are the files provided in the GLIMPSE tutorial folder.