Introduction

SHAPEIT4 is a fast and accurate method for estimation of haplotypes (aka phasing) for SNP array and high coverage sequencing data. The version 4 is a refactored and improved version of the SHAPEIT algorithm with multiple key additional features:

If you use the SHAPEIT4 in your research work, please cite the following paper:

Olivier Delaneau, Jean-Francois Zagury, Matthew R Robinson, Jonathan Marchini, Emmanouil Dermitzakis. Integrative haplotype estimation with sub-linear complexity. BioRxiv 2018.

SHAPEIT4 is available under the MIT licence on the github repository https://github.com/odelaneau/shapeit4.

Previous versions of SHAPEIT are still available on original webpages: SHAPEIT2 and SHAPEIT3.

installation

Make sure that the two following libraries are installed on your system:

  1. HTSlib: A great C library for reading/writing high-throughput sequencing data.
  2. BOOST: A free peer-reviewed portable C++ source libraries. SHAPEIT4 uses two specific BOOST libraries: iostreams and program_options.

Make sure that the following standard library flags can be used by g++ on your system:

  • -lz, -lbz2 and -llzma for reading/writing compressed files.
  • -lm for basic math operations.
  • -lpthread for multi-threading.

You can do so by checking the outcome of the following commands:

locate -b '\libz.so'
locate -b '\libbz2.so'
locate -b '\liblzma.so'
locate -b '\libm.so'
locate -b '\libpthread.so'

Then, download the last version of the SHAPEIT4 code using:

git clone https://github.com/odelaneau/shapeit4.git

Navigate to the downloaded folder using cd shapeit4.

You'll find there the following folders and files:

  • bin: folder for the compiled binary.
  • docs: source code of this website.
  • maps: genetic maps for human taken from HapMap.
  • obj: folder with all binary objects.
  • src: folder with all SHAPIT4 source code.
  • test: folder with small example datasets.
  • README.md: front page of the GitHub.
  • LICENSE: copy of the MIT license.
  • makefile to compile the code.

Edit the makefile at lines 5-6 and 9-11 so that the following variables are correctly set up (look at the paths already there for an example):

  • HTSLIB_INC (line 5): path to the HTSlib header files,
  • HTSLIB_LIB (line 6): path to the static HTSlib library (file libhts.a),
  • BOOST_INC (line 9): path to the BOOST header files (often /usr/include),
  • BOOST_LIB_IO (line 10): path to the static BOOST iostreams library (file libboost_iostreams.a),
  • BOOST_LIB_PO (line 11): path to the static BOOST program_options library (file libboost_iostreams.a),

The static libraries (*.a files) can be located with this command:

locate libboost_program_options.a libboost_iostreams.a libhts.a

Once all paths correctly set up, proceed with the SHAPEIT4 compilation using make.

documentation

1. Simple run

To do so with default parameters, use the following command line:

shapeit4 --input unphased.vcf.gz --map chr20.b37.gmap.gz --region 20 --output phased.vcf.gz

All four options are mandatory and their descriptions are:

  • --input specifies the input file containing unphased genotype data.
  • --map specifies the genetic maps. Maps for humans can be found HERE.
  • --region specifies the genomic region to be phased.
  • --output specifies the output VCF/BCF file containing the phased genotype data (i.e. haplotypes).

The file provided to --input needs to be indexed. This can be done using bcftools index unphased.vcf.gz.

2. Phasing a chunk of data

To phase the 1Mb of input data located in the genomic interval 2Mb-3Mb, use:

shapeit4 --input unphased.vcf.gz --map chr20.b37.gmap.gz --region 20:2000000-3000000 --output phased.vcf.gz

Of note, the --region option is mandatory. Double check that the chromosome ID matches one of those specified in the VCF file. A common mistake is to use chr20 for chromosome 20 for example, while it is specified as 20 in the VCF file.

3. BCF files

SHAPEIT4 file automatically detects the format of the input file. To tell it to use BCF as output file, just use the correct filename extension as follows:

shapeit4 --input unphased.bcf --map chr20.b37.gmap.gz --region 20 --output phased.bcf

4. Multi-threading

To phase the example dataset on 4 CPU cores using 4 threads, use the --thread option as follows:

shapeit4 --input unphased.bcf --map chr20.b37.gmap.gz --region 20 --output phased.bcf --thread 4

5. Iteration scheme

To specify your own iteration sequence, use the --mcmc-iterations option:

shapeit4 --input unphased.bcf --map chr20.b37.gmap.gz --region 20 --output phased.bcf --mcmc-iterations 6b,1p,2b,1p,2b,8m

The three different kinds of iterations are:

  • b: burn-in iteration. Only haplotype sampling and no storage of transition probabilities. To be used for convergence purpose.
  • p: pruning iteration. New haplotype are sampled and transition probabilities are used to trim unlikely paths in the genotype graphs. Using few of them once the algotrithm has converged helps to get better estimates. They have to be used after few burn-in iterations and before main iterations. It is recommended to run them paired with burn-in iterations, i.e. 1b,1p.
  • m: main iteration. New haplotype are sampled and transition probabilities are stored and averaged in memory. To be used as final iteration stage to produce final estimates.

By default, the sequence used by SHAPEIT4 is 5b,1p,1b,1p,1b,1p,5m. It provided a good trade-off between speed and accuracy in our experiments. If running time is not an issue, you may consider increasing the number of iterations using --mcmc-iterations 10b,1p,1b,1p,1b,1p,1b,1p,10m for instance.

6. Tuning the PBWT based selection

Reducing the number of conditioning neighbours in the PBWT can be achieved using the --pbwt-depth option. The default value is 4. Decreasing it results in faster runs at the cost of some accuracy.

shapeit4 --input unphased.bcf --map chr20.b37.gmap.gz --region 20 --output phased.bcf --pbwt-depth 2

Changing variants at which PBWT indexes are stored can be done using --pbwt-modulo. The default value is 8. Decreasing this results in more memory usage.

shapeit4 --input unphased.bcf --map chr20.b37.gmap.gz --region 20 --output phased.bcf --pbwt-modulo 16

You can also change the size of the genomic window into which phasing is carried out using the option --window. The default value is 2e6 base pairs (i.e. 2Mb). Increasing it results in more conditioning haplotypes being used and therefore increased running times. It is preferable to use --pbwt-depth instead of --window to increase the number of conditioning haplotypes in order to increase accuracy. Anyway, this can be done as follows:

shapeit4 --input unphased.bcf --map chr20.b37.gmap.gz --region 20 --output phased.bcf --window 5e6

7. Specifying initial estimates

By default, SHAPEIT4 uses a quick PBWT based algorithm to initialize the haplotypes. This can be disabled using --pbwt-disable-init. This is particularly useful if you want to specify your own starting point for the MCMC or if you want to refine some haplotypes you estimated in the past. Initial haplotype estimates are specified in the input VCF/BCF using the option --input and phased genotypes, e.g. 0|1. Example command line:

shapeit4 --input unphased.bcf --map chr20.b37.gmap.gz --region 20 --output phased.bcf --pbwt-disable-init

8. Reproducibility

Making reproducible runs can sometimes be useful. To do so, you need to specify the random generator seed using --seed and to use a single thread. Using multi-threading prevents reproducibility.

shapeit4 --input unphased.bcf --map chr20.b37.gmap.gz --region 20 --output phased.bcf --seed 123456

9. Using a reference panel

Taking into account reference haplotypes when phasing can improve the accuracy of the estimates, especially when the number of reference haplotypes is much large than the number of individuals to be phased. To do so, proceed as follows:

shapeit4 --input unphased.bcf --map chr20.b37.gmap.gz --region 20 --reference reference.bcf --output phased.bcf

SHAPEIT4 only retain variants that are in the overlap (i.e. intersection) between the two panels specified with --input and --reference. In practice, it proceeds exactly the same way than bcftool isec -c none and therefore keeps only variants with chromosome ID, position, REF and ALT alleles that perfectly match between the two panels.

The file specified by --reference needs to be indexed. This can be done using bcftools index reference.bcf.

10. Using haplotype scaffolds

Haplotype scaffolds can be used to incorporate prior information about the phase of some heterozygous genotypes, coming from large reference panels of haplotypes or family information for instance. A scaffold is usually defined at a sparser set of variants than the main panel. Example run:

shapeit4 --input unphased.bcf --map chr20.b37.gmap.gz --region 20 --scaffold scaffold.bcf --output phased.bcf

Similarly to --reference, only variants that are in the intersection with the main panel are retained. Variants in the scaffold and the main panels are not phased and remain unchnaged. Variants not in the scaffold but in the main panel are phased onto the scaffold. Intersection is done similarly to bcftool isec -c none. Only individuals that appear in the two files are scaffolded (IDs need to match), those that are in the main panel and not in the scaffold are not scaffolded but appear in the output. Those that are in the scaffold and not in the main panel are not used and do not appear in the output.

The file specified by --scaffold needs to be indexed. This can be done using bcftools index scaffold.bcf.

11. Using phase sets

Phase sets are groups of heterozygous genotypes at which phase have been infered using haplotype assembly. To achieve haplotype assembly, we recommend using WhatsHap but other software can also be considered. Phase sets are specified in the VCF/BCF file format using the PS field. Note that the PS needs to be defined in the VCF/BCF as a 32bits integer since we read them using the C function bcf_get_format_int32. If you define them as string, they won't be taken into account. Example run:

shapeit4 --input unphased.bcf --map chr20.b37.gmap.gz --region 20 --use-PS 0.0001 --output phased.bcf

The argument of the --use-PS defines the expected error rate in the phase sets. You need to change the value depending on your confidence in the haplotype assembly calls. In most of the cases, we recomment using 0.0001.

12. Complex run

SHAPEIT4 allows using all combinations of options described above, such as:

shapeit4 --input unphased.bcf \
          --map chr20.b37.gmap.gz \
          --region 20 \
          --reference reference.bcf \
          --scaffold scaffold.bcf \
          --use-PS 0.0001 \
          --thread 4 \
          --log phased.log\
          --output phased.bcf

This command allows phasing the example dataset using (i) a reference panel, (ii) a scaffold of haplotypes and (iii) the phase sets specified in the input file, all this in a multi-threaded manner.

13. Log file

To record all the verbose that appear on the screen, use the --log option as follows:

shapeit4 --input unphased.bcf --map chr20.b37.gmap.gz --region 20 --output phased.bcf --log phased.log

We strongly recommend to use this option for any run.

14. Option summary

The full list of options can be obtained by running the command:

shapeit4 --help

This should output this list of options:

Long form Short form Argument Description
--help NA NA Brief description for the options.
--seed NA INT Seed for random number generator. Default is 15052011.
--thread -T INT Seed for random number generator. Default is 15052011.
--input -I STRING Input VCF/BCF file to be phased.
--reference -H STRING Reference haplotypes in VCF/BCF.
--scaffold -S STRING Haplotype scaffold in VCF/BCF.
--map -M STRING Genetic map.
--region -R STRING Target region (e.g. chr20:1000000-2000000).
--use-PS NA FLOAT Use phase sets in input files assuming an error rate of FLOAT.
--mcmc-iterations NA STRING Iteration scheme for the MCMC. Default is 5b,1p,1b,1p,1b,1p,5m.
--pbwt-disable-init NA NA Do not initialise haplotypes by PBWT (rephase input haplotype data).
--pbwt-modulo NA INT Storage frequency of PBWT indexes in variant numbers. Default is 8 (i.e. storage every 8 variants).
--pbwt-depth NA INT Depth of PBWT indexes to condition on. Default is 4.
--window NA FLOAT Minimal size of the phasing window. Default is 2Mb (i.e. 2e6).
--output -O STRING Phased haplotypes in VCF/BCF format.
--log NA STRING Log file.

Using short form options, the complex command line described above (section 12) is:

shapeit4 -I unphased.bcf \
          -M chr20.b37.gmap.gz \
          -R 20 \
          -H reference.bcf \
          -S scaffold.bcf \
          --use-PS 0.0001 \
          -T 4 \
          --log phased.log \
          -O phased.bcf

15. Screen output decrypted

DEvelopment History

SHAPEIT has been developed by multiple researchers over the years:

  • Olivier Delaneau (OD), University of Lausanne, Switzerland. OD is the main developer of the code from version 1 to 4 and has supervised the development of version 4.
  • Jonathan Marchini (JM), University of Oxford, UK. JM has supervised the development of versions 2 and 3.
  • Jean-François Zagury (JFZ), Conservatoire des arts et métiers, France. JFZ has supervised the development of version 1.
  • Jared O’Connell (JO), University of Oxford, UK. JO has co-developed version 3 and extended it to process closely related samples.

SHAPEIT has been developed across the following institutions:

In 2011, version 1 has been released. This version introduced the compact graph representation of the sampling space so that Hidden Markov Model computations can be efficiently done [1].

In 2012, version 2 has been released. This version included the approach introduced in Impute2 to select subsets of conditioning haplotypes by Hamming distance minimization and therefore to speed up computations [2].

In 2013, version 2 has been extended to leverage phase information contained in sequencing reads [3].

In 2013, version 2 has been extended to process datasets with closely related individuals [4].

In 2014, version 2 has been extended to handle genotype likelihoods and use haplotype scaffolds in the frame of the 1000 Genomes project [5].

In 2016, version 3 has been released. This version was designed to process biobank scale datasets in the frame of the UK Biobank study [6].

In 2018, version 4 has been released. This version was completely refactored, includes most previous extensions into a unified framework and is open source [7].

[1] O. Delaneau, J. Marchini, JF. Zagury (2012) A linear complexity phasing method for thousands of genomes. Nat Methods. 9(2):179-81.

[2] O. Delaneau, JF. Zagury, J. Marchini (2013) Improved whole chromosome phasing for disease and population genetic studies. Nat Methods. 10(1):5-6.

[3] O. Delaneau, B. Howie, AJ. Cox, JF. Zagury JF, J. Marchini. Haplotype estimation using sequencing reads. Am J Hum Genet. 2013 Oct 3;93(4):687-96.

[4] J. O'Connell, D. Gurdasani, O. Delaneau, N. Pirastu, S. Ulivi, M. Cocca, M. Traglia, J. Huang, JE. Huffman, I. Rudan, R. McQuillan, RM. Fraser, H. Campbell, O. Polasek, G. Asiki, K. Ekoru, C. Hayward, AF. Wright, V. Vitart, P. Navarro, JF. Zagury, JF Wilson, D. Toniolo, P. Gasparini, N. Soranzo, MS. Sandhu, J. Marchini. A general approach for haplotype phasing across the full spectrum of relatedness. PLoS Genet. 2014 Apr 17;10(4):e1004234.

[5] O. Delaneau, J. Marchini; 1000 Genomes Project Consortium. Integrating sequence and array data to create an improved 1000 Genomes Project haplotype reference panel. Nat Commun. 2014 Jun 13;5:3934.

[6] J. O’Connell, K. Sharp, N. Shrine, L. Wain, I. Hall, M. Tobin, JF. Zagury, O. Delaneau, J. Marchini. Haplotype estimation for biobank scale datasets. Nat Genet. 2016 Jul; 48(7): 817–820.

[7] O. Delaneau, JF. Zagury, M. Robinson, J. Marchini, ET. Dermitzakis. Integrative haplotype estimation with sub-linear complexity. BioRxiv 2018.

Contact

Prof. Olivier Delaneau
Department of Computational Biology
University of Lausanne
olivier.delaneau@gmail.com
+41 21 692 40 55
https://odelaneau.github.io/lap-page/