API
Documentation for MendelImpute.jl's functions.
Index
MendelImpute.compositionMendelImpute.compress_haplotypesMendelImpute.convert_compressedMendelImpute.paintMendelImpute.phaseMendelImpute.unique_populations
Functions
MendelImpute.phase — Functionphase(tgtfile::String, reffile::String, outfile::String; [impute::Bool],
[phase::Bool], [dosage::Bool], [recreen::Bool], [max_haplotypes::Int],
[stepwise::Int], [thinning_factor::Int], [scale_allelefreq::Bool],
[dynamic_programming::Bool])Main function of MendelImpute program. Phasing (haplotying) of tgtfile from a pool of haplotypes reffile by sliding windows and saves result in outfile. All SNPs in tgtfile must be present in reffile. Per-sample imputation score (lower is better) will be saved in a file ending in sample.error.
Input
tgtfile: VCF or PLINK files. VCF files should end in.vcfor.vcf.gz. PLINK files should exclude.bim/.bed/.famsuffixes but the trio must all be present in the same directory.reffile: Reference haplotype file ending in.vcf,.vcf.gz, or.jlso(compressed binary files).outfile: output filename ending in.vcf.gz,.vcf, or.jlso. VCF output genotypes will have no missing data. If ending in.jlso, will output ultra-compressed data structure recordingHaplotypeMosaicPairs for each sample.
Optional Inputs
impute: Iftrue, imputes every SNPs inreffiletotgtfile. Otherwise only missing snps intgtfilewill be imputed.phase: Iftrue, all output genotypes will be phased, but observed data (minor allele count) may be changed. Ifphase=falseall output genotypes will be unphased but observed minor allele count will not change.dosage: Iftrue, will assume target matrix are dosages for imputation. Note this means the genotype matrix will be entirely single precision.rescreen: This option is more computationally intensive but gives more accurate results. It saves a number of top haplotype pairs when solving the least squares objective, and re-minimize least squares on just observed data.max_haplotypes: Maximum number of haplotypes for using global search. Windows exceeding this number of unique haplotypes will be searched using a heuristic. A non-zerostepscreenorthinning_factorneed to be specifiedstepwise: If an integer is specified, will solve the least squares objective by first findingstepwisetop haplotypes using a stepwise heuristic then finds the next haplotype using global search. Usesmax_haplotypes.thinning_factor: If an integer is specified, will solve the least squares objective on onlythining_factorunique haplotypes. Usesmax_haplotypes.scale_allelefreq: Boolean indicating whether to give rare SNPs more weight scaled bywᵢ = 1 / √2p(1-p)where max weight is 2.dynamic_programming: Boolean indicating whether to phase with a global search that finds the longest haplotype stretch over all windows. (Currently broken, sorry!)
MendelImpute.compress_haplotypes — Functioncompress_haplotypes(reffile::String, tgtfile::String, outfile::String,
[d::Int], [minwidth::Int], [overlap::Float64])Cuts a haplotype matrix reffile into windows of variable width so that each window has less than d unique haplotypes. Saves result to outfile as a compressed binary format. All SNPs in tgtfile must be present in reffile.
Why is tgtfile required?
The unique haplotypes in each window is computed on the typed SNPs only. A genotype matrix tgtfile is used to identify the typed SNPs. In the future, hopefully we can pre-compute compressed haplotype panels for all genotyping platforms and provide them as downloadable files. But currently, users must run this function by themselves.
Inputs
reffile: reference haplotype file name (ends in.vcfor.vcf.gz)tgtfile: target genotype file name (ends in.vcfor.vcf.gz)outfile: Output file name (ends in.jlso)
Optional Inputs
d: Max number of unique haplotypes per genotype window (defaultd = 1000).minwidth: Minimum number of typed SNPs per window (default 0)overlap: How much overlap between adjacent genotype windows in percentage of each window's width (default 0.0)
MendelImpute.paint — Functionpaint(sample_phase::HaplotypeMosaicPair, panelID::Vector{String},
refID_to_population::Dict{String, String}, populations::Vector{String})Converts a person's phased haplotype lengths into segments of percentages. This function is used for easier plotting a "painted chromosome".
Inputs
sample_phase: AHaplotypeMosaicPairstoring phase information for a sample, includes haplotype start position and haplotype label.panelID: Sample ID's in the reference haplotype panelrefID_to_population: A dictionary mapping each ID in the haplotype reference panel to its population origin.
Optional inputs
populations: A unique list of populations present inrefID_to_population
Output
composition: A list of percentages wherecomposition[i]equals the sample's ancestry (in %) frompopulations[i]
MendelImpute.composition — Functioncomposition(sample_phase::HaplotypeMosaicPair, panelID::Vector{String},
refID_to_population::Dict{String, String}, [populations::Vector{String}])Computes a sample's chromosome composition based on phase information. This function is used for easier plotting a person's admixed proportions.
Inputs
sample_phase: AHaplotypeMosaicPairstoring phase information for a sample, includes haplotype start position and haplotype label.panelID: Sample ID's in the reference haplotype panelrefID_to_population: A dictionary mapping each ID in the haplotype reference panel to its population origin.
Optional inputs
populations: A unique list of populations present inrefID_to_population
Output
composition: A list of percentages wherecomposition[i]equals the sample's ancestry (in %) frompopulations[i]
MendelImpute.unique_populations — Functionunique_populations(x::Dict{String, String})Computes the unique list of populations, preserving order. x is a Dict where each sample is a key and populations are values.
MendelImpute.convert_compressed — Functionconvert_compressed(t<:Real, phaseinfo::String, reffile::String)Converts phaseinfo into a phased genotype matrix of type t using the full reference haplotype panel H
Inputs
t: Type of matrix. Ifbool, genotypes are converted to aBitMatrixphaseinfo: Vector ofHaplotypeMosaicPairs stored in.jlsoformatreffile: The complete (uncompressed) haplotype reference file
Output
X1: allele 1 of the phased genotype. Each column is a sample.X = X1 + X2.X2: allele 2 of the phased genotype. Each column is a sample.X = X1 + X2.phase: the original data structure after phasing and imputation.sampleID: The ID's of each imputed person.H: the complete reference haplotype panel. Columns ofHare haplotypes.
convert_compressed(t<:Real, phaseinfo::Vector{HaplotypeMosaicPair}, H::AbstractMatrix)Columns of H are haplotypes.