Preparing Target Data

MendelImpute accepts VCF and PLINK files. Please make sure the following are true:

  • VCF file ends in .vcf or .vcf.gz (phased or unphased and may contain missing data)
  • For PLINK files, all trios (.bim, .bed, .fam) are present in the same directory
  • Each file contains only 1 (non-sex) chromosome
  • Every record (SNP) is present in the reference panel. If this is untrue, you must match markers in 2 VCF files.
  • Given a SNP, it's CHROM, POS, REF, and ALT fields are the same in target data and reference panel. MendelImpute use SNP position internally to align markers. Note this is not explicitly checked.
  • The position of every SNP is unique: so multiallelic markers should be excluded instead of split (this requirement will eventually be lifted).
Note

We recommend not doing pre-imputation quality control (i.e. do not remove SNPs out of Hardy-Weinburg equilibrium or SNPs with low minor allele frequencies...etc). Instead, reserve quality control to post-imputation. This is because every new set of typed SNPs requires a compressed reference haplotype panel that matches the set of typed SNPs.

Preparing Reference Haplotype Panel

Reference samples must all be phased and contain no missing genotypes. Reference VCF panels must be compressed into .jlso format first using the compress_haplotypes function. One must specify d: the maximum number of unique haplotypes per window. Larger d slows down computation, but increases accuracy. For most purposes, we recommend $d \approx 1000$.

Detailed Example

We use the 1000 genomes chromosome 22 as an example. As show below, this data contains 424147 SNPs and 2504 samples.

# load necessary packages in Julia
using VCFTools
using MendelImpute
using VCFTools
using Random

# compute simple summary statistics
data = "chr22.1kg.phase3.v5a.vcf.gz"
@show nrecords(data)
@show nsamples(data);
┌ Info: Precompiling MendelImpute [e47305d1-6a61-5370-bc5d-77554d143183]
└ @ Base loading.jl:1278


nrecords(data) = 424147
nsamples(data) = 2504

More summary statistics can be computed using the gtstats function in VCFTools.jl, with example usage here.

Step 1: generating realistic reference and target data

First we generate a reference panel and imputation target based on the 1000 genomes data. More specifically, we take the 1000 genomes chromosome 22 and divide it so that

  • 100 samples are randomly selected as imputation targets, where
    • 100k SNPs with minor allele frequency $\ge 0.05$ are randomly selected to be the typed positions.
    • 0.1% of typed SNPs are masked (mimicking GWAS errors)
    • Genotypes are unphased
  • The remaining 2404 samples are used as reference haplotypes.
  • SNPs with duplicate positions are filtered out.
  • All multiallelic markers are filtered out.

Instruction: execute the code below in a Julia session or a Jupyter notebook:

# set random seed for reproducibility
Random.seed!(2020)

# download example data 
data = "chr22.1kg.phase3.v5a.vcf.gz"
if !isfile(data) 
    download("http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase3_v5a/b37.vcf/chr22.1kg.phase3.v5a.vcf.gz")
end

# remove SNPs with the same positions, keep all samples, save result into new file
SNPs_to_keep = .!find_duplicate_marker(data) 
@time VCFTools.filter(data, SNPs_to_keep, 1:nsamples(data), des = "chr22.uniqueSNPs.vcf.gz")

# summarize data
total_snps, samples, _, _, _, maf_by_record, _ = gtstats("chr22.uniqueSNPs.vcf.gz")

# generate target file with 100 samples and 100k snps with maf>0.05
n = 100
p = 100000
record_idx = falses(total_snps)
large_maf = findall(x -> x > 0.05, maf_by_record)  
Random.shuffle!(large_maf)
record_idx[large_maf[1:p]] .= true
sample_idx = falses(samples)
sample_idx[1:n] .= true
Random.shuffle!(sample_idx)
@time VCFTools.filter("chr22.uniqueSNPs.vcf.gz", record_idx, sample_idx, 
    des = "target.chr22.typedOnly.vcf.gz", allow_multiallelic=false)

# unphase and mask 0.1% entries in target file
masks = falses(p, n)
missingprop = 0.001
for j in 1:n, i in 1:p
    rand() < missingprop && (masks[i, j] = true)
end
@time mask_gt("target.chr22.typedOnly.vcf.gz", masks, 
    des="target.chr22.typedOnly.masked.vcf.gz", unphase=true)

# generate target panel with all snps (this file contains true phase and genotypes)
@time VCFTools.filter("chr22.uniqueSNPs.vcf.gz", 1:total_snps, 
    sample_idx, des = "target.chr22.full.vcf.gz", allow_multiallelic=false)

# generate reference panel with 2404 samples
@time VCFTools.filter("chr22.uniqueSNPs.vcf.gz", 1:total_snps, .!sample_idx, 
    des = "ref.chr22.excludeTarget.vcf.gz", allow_multiallelic=false)
┌ Info: Precompiling MendelImpute [e47305d1-6a61-5370-bc5d-77554d143183]
└ @ Base loading.jl:1278
finding duplicate markers...100%|███████████████████████| Time: 0:03:56
filtering vcf file...100%|██████████████████████████████| Time: 0:04:46


292.131527 seconds (3.20 G allocations: 301.789 GiB, 7.89% gc time)


Progress: 100%|█████████████████████████████████████████| Time: 0:04:02
filtering vcf file...100%|██████████████████████████████| Time: 0:03:59


244.425505 seconds (3.18 G allocations: 301.694 GiB, 9.69% gc time)
  1.935526 seconds (20.00 M allocations: 1.491 GiB, 6.33% gc time)


filtering vcf file...100%|██████████████████████████████| Time: 0:04:10


255.505399 seconds (3.27 G allocations: 317.749 GiB, 9.95% gc time)


filtering vcf file...100%|██████████████████████████████| Time: 0:07:27


453.383147 seconds (6.16 G allocations: 566.535 GiB, 10.16% gc time)

Output explanation:

You just generated reference and target VCF files:

  • ref.chr22.excludeTarget.vcf.gz: Reference haplotype panel with 2404 samples
  • target.chr22.typedOnly.masked.vcf.gz: Imputation target file containing 100 samples at 100k SNPs. All genotypes are unphased and contains 0.1% missing data.

You also generated/downloaded:

  • chr22.1kg.phase3.v5a.vcf.gz: The original chromosome 22 data downloaded from Beagle's website.
  • chr22.uniqueSNPs.vcf.gz: This is the original chromosome 22 data excluding duplicate records (SNPs) by checking marker positions. The first SNP is included but all subsequent SNPs are removed.
  • target.chr22.full.vcf.gz: The complete data for imputation target, used for checking imputation accuracy. All genotypes are phased and non-missing.
  • target.chr22.typedOnly.vcf.gz: Complete target data on just the typed SNPs. All genotypes are phased and non-missing. Just by-producted for generating other files; not used for anything downstream.

Step 2: generating .jlso compressed reference panel

MendelImpute requires one to pre-process the reference panel for faster reading. This is achieved via the compress_haplotypes function.

# load necessary packages in Julia
using MendelImpute

max_d = 1000 # maximum number of unique haplotypes per window
reffile = "ref.chr22.excludeTarget.vcf.gz"
tgtfile = "target.chr22.typedOnly.masked.vcf.gz"
outfile = "ref.chr22.maxd1000.excludeTarget.jlso"
@time compress_haplotypes(reffile, tgtfile, outfile, max_d)
importing reference data...100%|████████████████████████| Time: 0:02:00


295.546081 seconds (2.09 G allocations: 209.215 GiB, 10.53% gc time)

Step 3: Run imputation and phasing

Below runs the main phase function in a single thread. By default all output genotypes will be phased and non-missing. A list of optional inputs can be found in the API.

# note: run twice for more accurate timing
reffile = "ref.chr22.maxd1000.excludeTarget.jlso" # jlso reference file
tgtfile = "target.chr22.typedOnly.masked.vcf.gz"  # target genotype file
outfile = "mendel.imputed.chr22.vcf.gz"           # output file name
phase(tgtfile, reffile, outfile);
Number of threads = 8
Importing reference haplotype data...
Total windows = 1634, averaging ~ 508 unique haplotypes per window.

Timings: 
    Data import                     = 9.9879 seconds
        import target data             = 1.59052 seconds
        import compressed haplotypes   = 8.39737 seconds
    Computing haplotype pair        = 4.487 seconds
        BLAS3 mul! to get M and N      = 0.203134 seconds per thread
        haplopair search               = 3.47789 seconds per thread
        initializing missing           = 0.0193766 seconds per thread
        allocating and viewing         = 0.0948972 seconds per thread
        index conversion               = 0.00162558 seconds per thread
    Phasing by win-win intersection = 0.661786 seconds
        Window-by-window intersection  = 0.0792241 seconds per thread
        Breakpoint search              = 0.48882 seconds per thread
        Recording result               = 0.0196465 seconds per thread
    Imputation                     = 1.10519 seconds
        Imputing missing               = 0.509987 seconds
        Writing to file                = 0.595203 seconds

    Total time                      = 16.243 seconds

The ; hides the output, or else the screen will be too jammed.

Note

To run MendelImpute in parallel, type export JULIA_NUM_THREADS=4 before starting Julia or Jupyter notebooks. See Performance Gotchas #1 on the left for details.

Step 3.5: (only for simulated data) check imputation accuracy

Since we simulated data, we can check imputation accuracy.

X_truth  = convert_gt(Float64, "target.chr22.full.vcf.gz")    # import true genotypes
X_mendel = convert_gt(Float64, "mendel.imputed.chr22.vcf.gz") # import imputed genotypes
n, p = size(X_mendel)
println("error overall = ", sum(X_mendel .!= X_truth) / n / p)
error overall = 0.005273778941849357

Post-imputation: per-SNP Imputation Quality Score

By default, MendelImpute outputs a per-SNP imputation quality score. A score of 0 is best, and larger is worst. At the $k$th SNP, this score is the average least squares error between the observed genotype and the 2 selected haplotypes $\mathbf{h}_1, \mathbf{h}_2$ over all $n$ samples:

\[e_k = \frac{1}{n}\sum_{i=1}^n ||x_k - h_{1k} - h_{2k}||_2^2\]

For untyped (i.e. imputed) SNPs, the quality score is the average of the closest 2 typed SNP's score. Note samples with missing genotypes are skipped, and $n$ adjusted accordingly, by default.

To extract this score from a VCF file, one can do:

using GeneticVariation, Plots
reader = VCF.Reader(openvcf(outfile, "r"))
snpscores = Vector{Float64}(undef, nrecords(outfile))

# loop over SNPs
for (i, record) in enumerate(reader)
    snpscores[i] = parse(Float64, VCF.info(record)[1].second)
end
close(reader)

# plot histogram of SNP scores
histogram(snpscores, label=:none, xlabel="error", ylabel="SNP counts", bins=30)

svg

Conclusion: Most SNPs are well imputed (least squares error close to $0.0$), but a few have quality score above $0.01$. We recommend such SNPs to be filtered out.

Note

The popular correlation metric $r^2$ such as this one is NOT a good metric for measuring imputation accuracy under MendelImpute's model, because all imputed SNPs will have $r^2 = 1$. For details, please see our paper.

Post-imputation: per-sample Imputation Quality score

MendelImpute also computes a rough quality score (file ending in sample.error) for measuring how well each sample is imputed. This value is the sum of the least squares error for all typed SNPs scaled by the total number of observed SNPs. A value of 0 is best, and high values mean worse.

using CSV
quality = CSV.read("mendel.imputed.chr22.sample.error") # import quality score 

# visualize error distribution
histogram(quality[:error], label=:none, xlabel="error", ylabel="Number of samples") 

svg

Conclusion: Most samples have small error, but some samples are indeed poorly imputed. From the histogram, we can safely filtered out samples with error $> 0.004$ as that would remove poorly imputed individuals without reducing sample size too much.