Skip to main content Link Search Menu Expand Document (external link)

switch

Table of contents

  1. Description
  2. Usage1: Benchmark haplotypes using simulated data
  3. Usage2: Benchmark haplotypes using family data
  4. Usage3: Benchmark haplotypes using haploid chromosome X data
  5. Command line options
    1. Basic options
    2. Input files
    3. Output files

Description

Program to compute switch error rate (SER) and genotyping error rate (GER) given validation data. Validation data, in which haplotypes are known, can be obtained in multiple ways:

  • Simulated data, using msprime for instance.
  • Trio/duo data. Estimate haplotypes for offspring excluding the parents from the dataset. Use the parent and the offsprings in the validation set.
  • Chromosome X data. Pair male chromosomes X to make females in which phase in known.

Usage1: Benchmark haplotypes using simulated data

First, run a phasing run:

phase_common --input array/target.unrelated.bcf --region 1 --map info/chr1.gmap.gz --output target.phased.bcf --thread 8

The data in array/target.unrelated.bcf is phased simulated data, so we can also use it as validation set. Open the BCF file to check:

bcftools view -H array/target.unrelated.bcf | cut -f1-20 | head

Then, run thw switch program to compare the true haplotypes to those estimated by phase_common:

switch --validation array/target.unrelated.bcf --estimation target.phased.bcf --region 1 --output target.phased --thread 8

This command will produce a bunch of files:

  • target.phased.block.switch.txt.gzhas 2 columns (sample ID, position). This gives the coordinates of blocks of data coorectly phased. Can be used to produced a visual representation of the phasing . See supplemetary figure 1 of SHAPEIT4 paper.
  • target.phased.frequency.switch.txt.gz has 4 columns (MAC, #errors, #hets, SER). This gives SER stratified by MAC.
  • target.phased.sample.switch.txt.gz has 4 columns (sample ID, #errors, #hets, SER). This gives the SER per sample. This is the most important information given by the switch program.
  • target.phased.sample.typing.txt.gz has 4 columns (sample ID, #errors, #genotypes, GER). This gives GER per sample.
  • target.phased.variant.switch.txt.gz has 5 columns (rsid, position, #errors, #hets, SER). This gives the SER per variant relative to previous hets.
  • target.phased.variant.typing.txt.gz has 5 columns (rsid, position, #errors, #genotypes, GER). This gives the GER per variant.

To compute the SER across the entire dataset, we recommend to sum the numbers of errors and hets across all samples first:

zcat target.phased.sample.switch.txt.gz | awk 'BEGIN { e=0; t=0; } { e+=$2; t+=$3; } END { print "SER =", e*100/t; }'

Usage2: Benchmark haplotypes using family data

First, run build a benchmark dataset by removing parental genomes:

cat info/target.family.fam | cut -f2- | tr "\t" "\n" > parents.txt
bcftools view -Ob -o benchmark.data.bcf -S ^parents.txt array/target.family.bcf
bcftools index benchmark.data.bcf

Second, phase the benchmark dataset:

phase_common --input benchmark.data.bcf --region 1 --map info/chr1.gmap.gz --output target.phased.bcf --thread 8

Third, validate it using family data (original BCF + FAM file):

switch --validation array/target.family.bcf --estimation target.phased.bcf --pedigree info/target.family.fam --region 1 --output target.phased --thread 8

Fourth, compute SER:

zcat target.phased.sample.switch.txt.gz | awk 'BEGIN { e=0; t=0; } { e+=$2; t+=$3; } END { print "SER =", e*100/t; }'

Usage3: Benchmark haplotypes using haploid chromosome X data

To come!


Command line options

Basic options

Option name Argument Default Description
--help NA NA Produces help message
-T [ --thread ] INT 1 Number of thread used

Input files

Option name Argument Default Description
-V [--validation ] STRING NA Validation dataset in VCF/BCF format
-E [--estimation ] STRING NA Phased dataset in VCF/BCF format
-F [--frequency ] STRING NA Variant frequency in VCF/BCF format, to exaclude variants and/or stratify SER by MAC
-P [--pedigree ] STRING NA Pedigree information (offspring father mother)
-R [--region ] STRING NA Target region
--nbins INT 20 Number of bins used for calibration (for PP field)
--min-pp FLOAT 0 Minimal PP value for entering computations
--singleton STRING NA Singleton phase
--dupid STRING NA Duplicate ID for UKB matching IDs

Output files

Option name Argument Default Description
-O [--output ] STRING NA Phased haplotypes in VCF/BCF format
--log STRING NA Log file

Back to top

Copyright © 2022-2023 Olivier Delaneau | All Rights Reserved | SHAPEIT5 executables and source code are distributed under the MIT license.