Skip to main content Link Search Menu Expand Document (external link)

phase_rare

Table of contents

  1. Description
  2. Usage1: phasing unrelated samples
  3. Usage2: phasing related samples
  4. Usage3: phasing chromosome X data
  5. Command line options
    1. Basic options
    2. Input files
    3. PBWT parameters
    4. HMM parameters
    5. Output files

Description

Tool to phase rare variants onto a scaffold of common variants (output of phase_common + ligate). We recommend to use phase_rare for datasets with a sample size greater than 2,000 samples. For smaller smaple sizes, phase_common could do the job.

Usage1: phasing unrelated samples

Phasing large sequencing datasets happens in multiple steps.

First, let’s phase common variants (MAF>0.1%) using phase_common.

phase_common --input wgs/target.unrelated.bcf --filter-maf 0.001 --region 1 --map info/chr1.gmap.gz --output tmp/target.scaffold.bcf --thread 8

Second, let’s use the resulting haplotypes as a scaffold onto which rare variants are phased:

while read LINE; do
	CHK=$(echo $LINE | awk '{ print $1; }')
	SRG=$(echo $LINE | awk '{ print $3; }')
	IRG=$(echo $LINE | awk '{ print $4; }')
	phase_rare --input wgs/target.unrelated.bcf --scaffold tmp/target.scaffold.bcf --map info/chr1.gmap.gz --input-region $IRG --scaffold-region $SRG --output tmp/target.phased.chunk$CHK\.bcf  --thread 8
done < info/chunks.coordinates.txt

All chunk coordinates are given in the file info/chunks.coordinates.txt. This file should be generated using the chunking tool of GLIMPSE. In this toy example, the scaffold chunks are 3.5Mb and the input chunks are 2.5Mb. Only rare variants within the 2.5Mb regions are phased using a scaffold spanning 3.5Mb (0.5Mb buffer is used on each side for rare variants).

Notes:

  • The region defined by --scaffold-region must be larger than --input-region.
  • The file tmp/target.scaffold.bcf can also be generated by the ligate tool in case phase_common was also run in chunks.

Finally, we can bring all chunks of data together by concatenating files using bcftools concat –naive:

ls -1v tmp/target.phased.chunk*.bcf > tmp/files.txt
bcftools concat -n -Ob -o target.phased.bcf -f tmp/files.txt
bcftools index target.phased.bcf

To take into account duo/trio information while phasing, just give a FAM file specifying the family structures to the –pedigree option. This file contains one line per sample having parent(s) in the dataset and three columns (kidID fatherID and motherID), separated by TABs for spaces. You must give the exact same file to the three programs: phase_common, ligate and phase_rare.

To phase the WGS example dataset with family information, run:

phase_common --input wgs/target.family.bcf --filter-maf 0.001 --pedigree info/target.family.fam --region 1 --map info/chr1.gmap.gz --output tmp/target.scaffold.bcf --thread 8

while read LINE; do
	CHK=$(echo $LINE | awk '{ print $1; }')
	SRG=$(echo $LINE | awk '{ print $3; }')
	IRG=$(echo $LINE | awk '{ print $4; }')
	phase_rare --input wgs/target.family.bcf --scaffold tmp/target.scaffold.bcf --pedigree info/target.family.fam --map info/chr1.gmap.gz --input-region $IRG --scaffold-region $SRG --output $OUT  --thread 8
done < info/chunks.coordinates.txt

for CHK in $(seq 0 3); do echo tmp/target.phased.chunk$CHK\.bcf >> tmp/chunks.files.txt; done

bcftools concat -n -Ob -o target.phased.bcf -f tmp/chunks.files.txt
bcftools index target.phased.bcf

Usage3: phasing chromosome X data

To phase chromosome X data assuming haploidy for males, run;

phase_common --input wgs/target.haploid.bcf --filter-maf 0.001 --haploids info/target.haploid.txt --region 1 --map info/chr1.gmap.gz --output tmp/target.scaffold.bcf --thread 8

while read LINE; do
	CHK=$(echo $LINE | awk '{ print $1; }')
	SRG=$(echo $LINE | awk '{ print $3; }')
	IRG=$(echo $LINE | awk '{ print $4; }')
	phase_rare --input wgs/target.haploid.bcf --scaffold tmp/target.scaffold.bcf --haploids info/target.haploid.txt --map info/chr1.gmap.gz --input-region $IRG --scaffold-region $SRG --output $OUT  --thread 8
done < info/chunks.coordinates.txt

for CHK in $(seq 0 3); do echo tmp/target.phased.chunk$CHK\.bcf >> tmp/chunks.files.txt; done

bcftools concat -n -Ob -o target.phased.bcf -f tmp/chunks.files.txt
bcftools index target.phased.bcf

The file info/target.haploid.txt contains the list of all the haploid samples (i.e. males).


Command line options

Basic options

Option name Argument Default Description
--help NA NA Produces help message
--seed INT 15052011 Seed of the random number generator
-T [ --thread ] INT 1 Number of thread used

Input files

Option name Argument Default Description
--input STRING NA Genotypes to be phased in plain VCF/BCF format
--input-region STRING NA Region to be considered in --input-plain
--scaffold STRING NA Scaffold of haplotypes in VCF/BCF format
--scaffold-region STRING NA Region to be considered in --scaffold
--map STRING NA Genetic map
--pedigree STRING NA Pedigree information (offspring father mother)
--haploids STRING NA List of samples that are haploid (e.g. males on chrX)

PBWT parameters

Option name Argument Default Description
--pbwt-modulo FLOAT 0.1 Storage frequency of PBWT indexes in cM
--pbwt-depth-common INT 2 Depth of PBWT indexes at common sites to condition on
--pbwt-depth-rare INT 2 Depth of PBWT indexes at rare sites to condition on
--pbwt-mac INT 2 Minimal Minor Allele Count at which PBWT is evaluated
--pbwt-mdr FLOAT 0.1 Maximal Missing Data Rate at which PBWT is evaluated

HMM parameters

Option name Argument Default Description
--effective-size INT 15000 Effective size of the population

Output files

Option name Argument Default Description
-O [--output ] STRING NA Phased haplotypes in VCF/BCF format
--output-buffer STRING NA If specified, right and left buffers are printed in output
--log STRING NA Log file

Back to top

Copyright © 2022-2023 Olivier Delaneau | All Rights Reserved | SHAPEIT5 executables and source code are distributed under the MIT license.