API

Documentation for MendelImpute.jl's functions.

Index

Functions

MendelImpute.phaseFunction
phase(tgtfile::String, reffile::String, outfile::String; [impute::Bool],
    [phase::Bool], [dosage::Bool], [recreen::Bool], [max_haplotypes::Int], 
    [stepwise::Int], [thinning_factor::Int], [scale_allelefreq::Bool], 
    [dynamic_programming::Bool])

Main function of MendelImpute program. Phasing (haplotying) of tgtfile from a pool of haplotypes reffile by sliding windows and saves result in outfile. All SNPs in tgtfile must be present in reffile. Per-sample imputation score (lower is better) will be saved in a file ending in sample.error.

Input

  • tgtfile: VCF or PLINK files. VCF files should end in .vcf or .vcf.gz. PLINK files should exclude .bim/.bed/.fam suffixes but the trio must all be present in the same directory.
  • reffile: Reference haplotype file ending in .vcf, .vcf.gz, or .jlso (compressed binary files).
  • outfile: output filename ending in .vcf.gz, .vcf, or .jlso. VCF output genotypes will have no missing data. If ending in .jlso, will output ultra-compressed data structure recording HaplotypeMosaicPairs for each sample.

Optional Inputs

  • impute: If true, imputes every SNPs in reffile to tgtfile. Otherwise only missing snps in tgtfile will be imputed.
  • phase: If true, all output genotypes will be phased, but observed data (minor allele count) may be changed. If phase=false all output genotypes will be unphased but observed minor allele count will not change.
  • dosage: If true, will assume target matrix are dosages for imputation. Note this means the genotype matrix will be entirely single precision.
  • rescreen: This option is more computationally intensive but gives more accurate results. It saves a number of top haplotype pairs when solving the least squares objective, and re-minimize least squares on just observed data.
  • max_haplotypes: Maximum number of haplotypes for using global search. Windows exceeding this number of unique haplotypes will be searched using a heuristic. A non-zero stepscreen or thinning_factor need to be specified
  • stepwise: If an integer is specified, will solve the least squares objective by first finding stepwise top haplotypes using a stepwise heuristic then finds the next haplotype using global search. Uses max_haplotypes.
  • thinning_factor: If an integer is specified, will solve the least squares objective on only thining_factor unique haplotypes. Uses max_haplotypes.
  • scale_allelefreq: Boolean indicating whether to give rare SNPs more weight scaled by wᵢ = 1 / √2p(1-p) where max weight is 2.
  • dynamic_programming: Boolean indicating whether to phase with a global search that finds the longest haplotype stretch over all windows. (Currently broken, sorry!)
source
MendelImpute.compress_haplotypesFunction
compress_haplotypes(reffile::String, tgtfile::String, outfile::String, 
    [d::Int], [minwidth::Int], [overlap::Float64])

Cuts a haplotype matrix reffile into windows of variable width so that each window has less than d unique haplotypes. Saves result to outfile as a compressed binary format. All SNPs in tgtfile must be present in reffile.

Why is tgtfile required?

The unique haplotypes in each window is computed on the typed SNPs only. A genotype matrix tgtfile is used to identify the typed SNPs. In the future, hopefully we can pre-compute compressed haplotype panels for all genotyping platforms and provide them as downloadable files. But currently, users must run this function by themselves.

Inputs

  • reffile: reference haplotype file name (ends in .vcf or .vcf.gz)
  • tgtfile: target genotype file name (ends in .vcf or .vcf.gz)
  • outfile: Output file name (ends in .jlso)

Optional Inputs

  • d: Max number of unique haplotypes per genotype window (default d = 1000).
  • minwidth: Minimum number of typed SNPs per window (default 0)
  • overlap: How much overlap between adjacent genotype windows in percentage of each window's width (default 0.0)
source
MendelImpute.paintFunction
paint(sample_phase::HaplotypeMosaicPair, panelID::Vector{String},
    refID_to_population::Dict{String, String}, populations::Vector{String})

Converts a person's phased haplotype lengths into segments of percentages. This function is used for easier plotting a "painted chromosome".

Inputs

  • sample_phase: A HaplotypeMosaicPair storing phase information for a sample, includes haplotype start position and haplotype label.
  • panelID: Sample ID's in the reference haplotype panel
  • refID_to_population: A dictionary mapping each ID in the haplotype reference panel to its population origin.

Optional inputs

  • populations: A unique list of populations present in refID_to_population

Output

  • composition: A list of percentages where composition[i] equals the sample's ancestry (in %) from populations[i]
source
MendelImpute.compositionFunction
composition(sample_phase::HaplotypeMosaicPair, panelID::Vector{String}, 
    refID_to_population::Dict{String, String}, [populations::Vector{String}])

Computes a sample's chromosome composition based on phase information. This function is used for easier plotting a person's admixed proportions.

Inputs

  • sample_phase: A HaplotypeMosaicPair storing phase information for a sample, includes haplotype start position and haplotype label.
  • panelID: Sample ID's in the reference haplotype panel
  • refID_to_population: A dictionary mapping each ID in the haplotype reference panel to its population origin.

Optional inputs

  • populations: A unique list of populations present in refID_to_population

Output

  • composition: A list of percentages where composition[i] equals the sample's ancestry (in %) from populations[i]
source
MendelImpute.unique_populationsFunction
unique_populations(x::Dict{String, String})

Computes the unique list of populations, preserving order. x is a Dict where each sample is a key and populations are values.

source
MendelImpute.convert_compressedFunction
convert_compressed(t<:Real, phaseinfo::String, reffile::String)

Converts phaseinfo into a phased genotype matrix of type t using the full reference haplotype panel H

Inputs

  • t: Type of matrix. If bool, genotypes are converted to a BitMatrix
  • phaseinfo: Vector of HaplotypeMosaicPairs stored in .jlso format
  • reffile: The complete (uncompressed) haplotype reference file

Output

  • X1: allele 1 of the phased genotype. Each column is a sample. X = X1 + X2.
  • X2: allele 2 of the phased genotype. Each column is a sample. X = X1 + X2.
  • phase: the original data structure after phasing and imputation.
  • sampleID: The ID's of each imputed person.
  • H: the complete reference haplotype panel. Columns of H are haplotypes.
source
convert_compressed(t<:Real, phaseinfo::Vector{HaplotypeMosaicPair}, H::AbstractMatrix)

Columns of H are haplotypes.

source