API

Documentation for MendelImpute.jl's functions.

Index

Functions

MendelImpute.phaseFunction
phase(tgtfile::String, reffile::String, outfile::String; [impute::Bool],
    [phase::Bool], [dosage::Bool], [rescreen::Bool], [max_haplotypes::Int], 
    [stepwise::Int], [thinning_factor::Int], [scale_allelefreq::Bool], 
    [dynamic_programming::Bool])

Main function of MendelImpute program. Phasing (haplotying) of tgtfile from a pool of haplotypes reffile by sliding windows and saves result in outfile. All SNPs in tgtfile must be present in reffile. Per-SNP quality score will be saved in outfile, while per-sample imputation score will be saved in a file ending in sample.error.

Input

  • tgtfile: VCF, PLINK, or BGEN file. VCF files should end in .vcf or .vcf.gz. PLINK files should exclude .bim/.bed/.fam trailings but the trio must all be present in the same directory. BGEN files should end in .bgen
  • reffile: Reference haplotype file ending in .jlso (compressed binary files). See compress_haplotypes
  • outfile: output filename ending in .vcf.gz, .vcf, or .jlso. VCF output genotypes will have no missing data. If ending in .jlso, will output ultra-compressed data structure recording HaplotypeMosaicPairs for each sample.

Optional Inputs

  • impute: If true, imputes every SNPs in reffile to tgtfile. Otherwise only missing snps in tgtfile will be imputed.
  • phase: If true, all output genotypes will be phased, but observed data (minor allele count) may be changed. If phase=false all output genotypes will be unphased but observed minor allele count will not change.
  • dosage: If true, will assume target matrix are dosages for imputation. Note this means the genotype matrix will be entirely single precision.
  • rescreen: This option is more computationally intensive but gives more accurate results. It saves a number of top haplotype pairs when solving the least squares objective, and re-minimize least squares on just observed data.
  • max_haplotypes: Maximum number of haplotypes for using global search. Windows exceeding this number of unique haplotypes will be searched using a heuristic. A non-zero stepscreen or thinning_factor need to be specified
  • stepwise: If an integer is specified, will solve the least squares objective by first finding stepwise top haplotypes using a stepwise heuristic then finds the next haplotype using global search. Uses max_haplotypes.
  • thinning_factor: If an integer is specified, will solve the least squares objective on only thining_factor unique haplotypes. Uses max_haplotypes.
  • scale_allelefreq: Boolean indicating whether to give rare SNPs more weight scaled by wᵢ = 1 / √2p(1-p) where max weight is 2.
  • dynamic_programming: Boolean indicating whether to phase with a global search that finds the longest haplotype stretch over all windows. (Currently broken, sorry!)
source
MendelImpute.compress_haplotypesFunction
compress_haplotypes(reffile::String, tgtfile::String, outfile::String, 
    [d::Int], [minwidth::Int], [overlap::Float64])

Cuts a haplotype matrix reffile into windows of variable width so that each window has less than d unique haplotypes. Saves result to outfile as a compressed binary format. All SNPs in tgtfile must be present in reffile. All genotypes in reffile must be phased and non-missing, and all genotype positions must be unique.

Inputs

  • reffile: reference haplotype file name (ends in .vcf, .vcf.gz, or .bgen)
  • tgtfile: VCF, PLINK, or BGEN file. VCF files should end in .vcf or .vcf.gz. PLINK files should exclude .bim/.bed/.fam suffixes but the trio must all be present in the same directory. BGEN files should end in .bgen.
  • outfile: Output file name (ends in .jlso)

Optional Inputs

  • d: Max number of unique haplotypes per genotype window (default d = 1000).
  • minwidth: Minimum number of typed SNPs per window (default 0)
  • overlap: How much overlap between adjacent genotype windows in percentage of each window's width (default 0.0)

Why is tgtfile required?

The unique haplotypes in each window is computed on the typed SNPs only. A genotype matrix tgtfile is used to identify the typed SNPs. In the future, hopefully we can pre-compute compressed haplotype panels for all genotyping platforms and provide them as downloadable files. But currently, users must run this function by themselves.

source
MendelImpute.convert_compressedFunction
convert_compressed(t<:Real, phaseinfo::String, reffile::String)

Converts phaseinfo into a phased genotype matrix of type t using the full reference haplotype panel H

Inputs

  • t: Type of matrix. If bool, genotypes are converted to a BitMatrix
  • phaseinfo: Vector of HaplotypeMosaicPairs stored in .jlso format
  • reffile: The complete (uncompressed) haplotype reference file

Output

  • X1: allele 1 of the phased genotype. Each column is a sample. X = X1 + X2.
  • X2: allele 2 of the phased genotype. Each column is a sample. X = X1 + X2.
  • phase: the original data structure after phasing and imputation.
  • sampleID: The ID's of each imputed person.
  • H: the complete reference haplotype panel. Columns of H are haplotypes.
source
convert_compressed(t<:Real, phaseinfo::Vector{HaplotypeMosaicPair}, H::AbstractMatrix)

Columns of H are haplotypes.

source
MendelImpute.admixture_globalFunction
admixture_global(tgtfile::String, reffile::String, 
    refID_to_population::Dict{String, String}, populations::Vector{String})

Computes global ancestry estimates for each sample in tgtfile using a labeled reference panel reffile.

Inputs

  • tgtfile: VCF or PLINK files. VCF files should end in .vcf or .vcf.gz. PLINK files should exclude .bim/.bed/.fam trailings but the trio must all be present in the same directory.
  • reffile: Reference haplotype file ending in .jlso (compressed binary files). See compress_haplotypes.
  • refID_to_population: A dictionary mapping each sample IDs in the haplotype reference panel to their population origin. For examples, see output of thousand_genome_population_to_superpopulation and thousand_genome_samples_to_super_population
  • populations: A vector of String containing unique populations present in values(refID_to_population).

Optional Inputs

  • Q_outfile: Output file name for the estimated Q matrix. Default Q_outfile="mendelimpute.ancestry.Q".
  • imputed_outfile: Output file name for the imputed genotypes ending in .jlso. Default impute_outfile = "mendelimpute.ancestry.Q.jlso"

Output

  • Q: A DataFrame containing estimated ancestry fractions. Each row is a sample. Matrix will be saved in mendelimpute.ancestry.Q
source
MendelImpute.admixture_localFunction
admixture_local(tgtfile::String, reffile::String, 
    refID_to_population::Dict{String, String}, populations::Vector{String},
    population_colors::Vector{RGB{FixedPointNumbers.N0f8}})

Computes global ancestry estimates for each sample in tgtfile using a labeled reference panel reffile.

Inputs

  • tgtfile: VCF or PLINK files. VCF files should end in .vcf or .vcf.gz. PLINK files should exclude .bim/.bed/.fam trailings but the trio must all be present in the same directory.
  • reffile: Reference haplotype file ending in .jlso (compressed binary files). See compress_haplotypes.
  • refID_to_population: A dictionary mapping each sample IDs in the haplotype reference panel to their population origin. For examples, see output of thousand_genome_population_to_superpopulation and thousand_genome_samples_to_super_population
  • population: A list String containing unique populations present in values(refID_to_population).
  • population_colors: A vector of colors for each population. typeof(population_colors} should be Vector{RGB{FixedPointNumbers.N0f8}}

Output

  • Q: Matrix containing estimated ancestry fractions. Each row is a haplotype. Sample 1's haplotypes are in rows 1 and 2, sample 2's are in rows 3, 4...etc.
  • pop_colors: Matrix with sample dimension of Q storing colors.
source
MendelImpute.thousand_genome_samples_to_populationFunction
thousand_genome_samples_to_population()

Creates a dictionaries mapping sample IDs of 1000 genome project to 26 population codes.

Population code and super population codes are described here: https://www.internationalgenome.org/category/population/

source
MendelImpute.thousand_genome_samples_to_super_populationFunction
thousand_genome_samples_to_population()

Creates a dictionaries mapping sample IDs of 1000 genome project to 5 super population codes.

Population code and super population codes are described here: https://www.internationalgenome.org/category/population/

source
MendelImpute.thousand_genome_population_to_superpopulationFunction
thousand_genome_population_to_superpopulation()

Creates a dictionary mapping population codes of 1000 genome project to their super-population codes.

Population code and super population codes are described here: https://www.internationalgenome.org/category/population/

source