Preparing Target Data

MendelImpute accepts VCF, PLINK (.bed/.bim/.fam), and BGEN files. Please make sure the following are true:

Missing data is allowed. Genotypes does not have to be phased.
VCF file ends in .vcf or .vcf.gz
BGEN file ends in .bgen. If available, index files should have the same prefix. (e.g. genotypes.bgen should have index file called genotypes.bgen.bgi. Sample identifiers may be either contained in the .bgen file or listed in an external genotypes.sample file.
For PLINK files, all trios (.bim, .bed, .fam) are present in the same directory
Each file contains only 1 (non-sex) chromosome
Every record (SNP) in the imputation target is present in the reference panel. If this is untrue, you must match markers in 2 VCF files.
Given a SNP, its CHROM, POS, REF, and ALT fields are the same in target data and reference panel. MendelImpute use SNP position internally to align markers.
The position of every SNP is unique: so multiallelic markers should be excluded instead of split (this requirement will eventually be lifted).

Note

Currently only BGEN inputs support index files. Indexing support for VCF files coming soon...

Preparing Reference Haplotype Panel

Reference samples must be in VCF (.vcf or .vcf.gz) or BGEN (.bgen) format, every record must be phased, and contain no missing genotypes. Reference panels must be compressed into .jlso format first using the compress_haplotypes function. One must specify d: the maximum number of unique haplotypes per window. Larger d slows down computation, but increases accuracy. For most purposes, we recommend $d \approx 1000$.

Detailed Example

We use the 1000 genomes chromosome 22 as an example. As show below, this data contains 424147 SNPs and 2504 samples.

# load necessary packages in Julia
using VCFTools
using MendelImpute
using VCFTools
using Random

# compute simple summary statistics
data = "chr22.1kg.phase3.v5a.vcf.gz"
@show nrecords(data)
@show nsamples(data);

┌ Info: Precompiling MendelImpute [e47305d1-6a61-5370-bc5d-77554d143183]
└ @ Base loading.jl:1317


nrecords(data) = 424147
nsamples(data) = 2504

More summary statistics can be computed using the gtstats function in VCFTools.jl, with example usage here.

Step 1: generating realistic reference and target data

First we generate a reference panel and imputation target based on the 1000 genomes data. Specifically, we divide the 1000 genomes chromosome 22 data so that

100 samples are randomly selected as imputation targets, where
- 100k SNPs with minor allele frequency $\ge 0.05$ are randomly selected to be the typed positions.
- 0.1% of typed SNPs are masked (mimicking genotyping errors)
- Genotypes are unphased
The remaining 2404 samples are used as reference haplotypes.
SNPs with duplicate positions are filtered out.
All multiallelic markers are filtered out.

Instruction: execute the code below in a Julia session or a Jupyter notebook:

# set random seed for reproducibility
Random.seed!(2020)

# download example data 
data = "chr22.1kg.phase3.v5a.vcf.gz"
if !isfile(data) 
    download("http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase3_v5a/b37.vcf/chr22.1kg.phase3.v5a.vcf.gz")
end

# remove SNPs with the same positions, keep all samples, save result into new file
SNPs_to_keep = .!find_duplicate_marker(data) 
@time VCFTools.filter(data, SNPs_to_keep, 1:nsamples(data), des = "chr22.uniqueSNPs.vcf.gz")

# summarize data
total_snps, samples, _, _, _, maf_by_record, _ = gtstats("chr22.uniqueSNPs.vcf.gz")

# generate target file with 100 samples and 100k snps with maf>0.05
n = 100
p = 100000
record_idx = falses(total_snps)
large_maf = findall(x -> x > 0.05, maf_by_record)  
Random.shuffle!(large_maf)
record_idx[large_maf[1:p]] .= true
sample_idx = falses(samples)
sample_idx[1:n] .= true
Random.shuffle!(sample_idx)
@time VCFTools.filter("chr22.uniqueSNPs.vcf.gz", record_idx, sample_idx, 
    des = "target.chr22.typedOnly.vcf.gz", allow_multiallelic=false)

# unphase and mask 0.1% entries in target file
masks = falses(p, n)
missingprop = 0.001
for j in 1:n, i in 1:p
    rand() < missingprop && (masks[i, j] = true)
end
@time mask_gt("target.chr22.typedOnly.vcf.gz", masks, 
    des="target.chr22.typedOnly.masked.vcf.gz", unphase=true)

# generate target panel with all snps (this file contains true phase and genotypes)
@time VCFTools.filter("chr22.uniqueSNPs.vcf.gz", 1:total_snps, 
    sample_idx, des = "target.chr22.full.vcf.gz", allow_multiallelic=false)

# generate reference panel with 2404 samples
@time VCFTools.filter("chr22.uniqueSNPs.vcf.gz", 1:total_snps, .!sample_idx, 
    des = "ref.chr22.excludeTarget.vcf.gz", allow_multiallelic=false)

┌ Info: Precompiling MendelImpute [e47305d1-6a61-5370-bc5d-77554d143183]
└ @ Base loading.jl:1278
[32mfinding duplicate markers...100%|███████████████████████| Time: 0:03:56[39m
[32mfiltering vcf file...100%|██████████████████████████████| Time: 0:04:46[39m


292.131527 seconds (3.20 G allocations: 301.789 GiB, 7.89% gc time)


[32mProgress: 100%|█████████████████████████████████████████| Time: 0:04:02[39m
[32mfiltering vcf file...100%|██████████████████████████████| Time: 0:03:59[39m


244.425505 seconds (3.18 G allocations: 301.694 GiB, 9.69% gc time)
  1.935526 seconds (20.00 M allocations: 1.491 GiB, 6.33% gc time)


[32mfiltering vcf file...100%|██████████████████████████████| Time: 0:04:10[39m


255.505399 seconds (3.27 G allocations: 317.749 GiB, 9.95% gc time)


[32mfiltering vcf file...100%|██████████████████████████████| Time: 0:07:27[39m


453.383147 seconds (6.16 G allocations: 566.535 GiB, 10.16% gc time)

Output explanation:

You just generated reference and target VCF files:

ref.chr22.excludeTarget.vcf.gz: Reference haplotype panel with 2404 samples
target.chr22.typedOnly.masked.vcf.gz: Imputation target file containing 100 samples at 100k SNPs. All genotypes are unphased and contains 0.1% missing data.

You also generated/downloaded:

chr22.1kg.phase3.v5a.vcf.gz: The original chromosome 22 data downloaded from Beagle's website.
chr22.uniqueSNPs.vcf.gz: This is the original chromosome 22 data excluding duplicate records (SNPs) by checking marker positions. The first SNP is included but all subsequent SNPs are removed.
target.chr22.full.vcf.gz: The complete data for imputation target, used for checking imputation accuracy. All genotypes are phased and non-missing.
target.chr22.typedOnly.vcf.gz: Complete target data on just the typed SNPs. All genotypes are phased and non-missing. Just by-producted for generating other files; not used for anything downstream.

Step 2: generating `.jlso` compressed reference panel

MendelImpute requires one to pre-process the reference panel for faster reading. This is achieved via the compress_haplotypes function.

# load necessary packages in Julia
using MendelImpute

max_d = 1000 # maximum number of unique haplotypes per window
reffile = "ref.chr22.excludeTarget.vcf.gz"
tgtfile = "target.chr22.typedOnly.masked.vcf.gz"
outfile = "ref.chr22.maxd1000.excludeTarget.jlso"
@time compress_haplotypes(reffile, tgtfile, outfile, max_d)

[32mimporting reference data...100%|████████████████████████| Time: 0:02:00[39m


295.546081 seconds (2.09 G allocations: 209.215 GiB, 10.53% gc time)

Step 3: Run imputation and phasing

The phase function will perform imputation and phasing. By default, all output genotypes will be phased and non-missing. A list of optional inputs can be found in the API.

# note: run twice for more accurate timing
reffile = "ref.chr22.maxd1000.excludeTarget.jlso" # jlso reference file
tgtfile = "target.chr22.typedOnly.masked.vcf.gz"  # target genotype file
outfile = "mendel.imputed.chr22.vcf.gz"           # output file name
phase(tgtfile, reffile, outfile);

Number of threads = 1
Importing reference haplotype data...


[32mComputing optimal haplotypes...100%|████████████████████| Time: 0:00:25[39m
[32mPhasing...100%|█████████████████████████████████████████| Time: 0:00:05[39m


Total windows = 1634, averaging ~ 508 unique haplotypes per window.

Timings: 
    Data import                     = 12.6331 seconds
        import target data             = 3.54785 seconds
        import compressed haplotypes   = 9.08523 seconds
    Computing haplotype pair        = 25.8112 seconds
        BLAS3 mul! to get M and N      = 1.14179 seconds per thread
        haplopair search               = 19.8237 seconds per thread
        initializing missing           = 0.11383 seconds per thread
        allocating and viewing         = 0.183436 seconds per thread
        index conversion               = 0.0146743 seconds per thread
    Phasing by win-win intersection = 5.31061 seconds
        Window-by-window intersection  = 0.56512 seconds per thread
        Breakpoint search              = 3.4102 seconds per thread
        Recording result               = 0.182498 seconds per thread
    Imputation                     = 5.03281 seconds
        Imputing missing               = 1.15402 seconds
        Writing to file                = 3.87878 seconds

    Total time                      = 48.9418 seconds

The ; hides the output, or else the screen will be too jammed.

Step 3.5: (only for simulated data) check imputation accuracy

Since we simulated data, we can check imputation accuracy.

X_truth  = convert_gt(Float64, "target.chr22.full.vcf.gz")    # import true genotypes
X_mendel = convert_gt(Float64, "mendel.imputed.chr22.vcf.gz") # import imputed genotypes
n, p = size(X_mendel)
println("error overall = ", sum(X_mendel .!= X_truth) / n / p)

error overall = 0.005273994412137202

Thus, we are looking at about 5 imputation error out of every 1000 SNPs.

Post-imputation: per-SNP Imputation Quality Score

Consider the observed genotype $x_{ij} \in [0, 2] \cup \{missing\}$ at SNP $i$ of sample $j$ and the corresponding imputed genotype $g_{ij}$ derived from the two extended haplotypes of $j$. If $S_i$ denotes the set of individuals with observed genotypes at the SNP, then MendelImpute's quality score $q_i$ for the SNP is defined as

\[q_i = 1 - \frac{1}{|S_i|}\sum_{j \in S_i} \left(\frac{x_{ij} - g_{ij}}{2}\right)^2.\]

Note that $0 \le q_i \le 1$ and that the larger the quality score, the more confidence in the imputed values. Because $q_i$ can only be computed for the typed SNPs, an untyped SNP is assigned the average of the quality scores for its two closest flanking typed SNPs.

To extract this score from a VCF file, one can do:

using VariantCallFormat, Plots
reader = VCF.Reader(openvcf(outfile, "r"))
snpscores = Vector{Float64}(undef, nrecords(outfile))

# loop over SNPs
for (i, record) in enumerate(reader)
    snpscores[i] = parse(Float64, VCF.info(record)[1].second)
end
close(reader)

# plot histogram of SNP scores
histogram(snpscores, label=:none, xlabel="per-SNP quality score", 
    ylabel="SNP counts", bins=30)

svg

Conclusion: Most SNPs are well imputed (quality score close to $1.0$), but a few have subpar quality score (e.g. below $0.999$). We recommend such SNPs to be filtered out.

Note

The popular correlation metric $r^2$ such as this one is NOT a good metric for measuring imputation accuracy under MendelImpute's model, because all imputed SNPs will have $r^2 = 1$. For details, please see our paper.

Post-imputation: per-sample Imputation Quality score

MendelImpute also computes a rough quality score (file ending in sample.error) for measuring how well each sample is imputed. This value is the sum of the least squares error for all typed SNPs scaled by the total number of observed SNPs, similar to the per-SNP quality score. A value of 1.0 is best, and 0 is worst.

using CSV, DataFrames
quality = CSV.read("mendel.imputed.chr22.sample.error", DataFrame) # import quality score 

# visualize error distribution
histogram(quality[:error], label=:none, xlabel="per-sample quality score",
    ylabel="Number of samples")

svg

Conclusion: Most samples are well imputed (e.g. score close to 1), but some samples are indeed poorly imputed. From the histogram, we can safely filtered out samples with score $< 0.999$ as that would remove poorly imputed individuals without reducing sample size too much.