API
Documentation for MendelImpute.jl's functions.
Index
MendelImpute.admixture_globalMendelImpute.admixture_localMendelImpute.compress_haplotypesMendelImpute.convert_compressedMendelImpute.phaseMendelImpute.thousand_genome_population_to_superpopulationMendelImpute.thousand_genome_samples_to_populationMendelImpute.thousand_genome_samples_to_super_population
Functions
MendelImpute.phase — Functionphase(tgtfile::String, reffile::String, outfile::String; [impute::Bool],
[phase::Bool], [dosage::Bool], [rescreen::Bool], [max_haplotypes::Int],
[stepwise::Int], [thinning_factor::Int], [scale_allelefreq::Bool],
[dynamic_programming::Bool])Main function of MendelImpute program. Phasing (haplotying) of tgtfile from a pool of haplotypes reffile by sliding windows and saves result in outfile. All SNPs in tgtfile must be present in reffile. Per-SNP quality score will be saved in outfile, while per-sample imputation score will be saved in a file ending in sample.error.
Input
tgtfile: VCF, PLINK, or BGEN file. VCF files should end in.vcfor.vcf.gz. PLINK files should exclude.bim/.bed/.famtrailings but the trio must all be present in the same directory. BGEN files should end in.bgenreffile: Reference haplotype file ending in.jlso(compressed binary files). Seecompress_haplotypesoutfile: output filename ending in.vcf.gz,.vcf, or.jlso. VCF output genotypes will have no missing data. If ending in.jlso, will output ultra-compressed data structure recordingHaplotypeMosaicPairs for each sample.
Optional Inputs
impute: Iftrue, imputes every SNPs inreffiletotgtfile. Otherwise only missing snps intgtfilewill be imputed.phase: Iftrue, all output genotypes will be phased, but observed data (minor allele count) may be changed. Ifphase=falseall output genotypes will be unphased but observed minor allele count will not change.dosage: Iftrue, will assume target matrix are dosages for imputation. Note this means the genotype matrix will be entirely single precision.rescreen: This option is more computationally intensive but gives more accurate results. It saves a number of top haplotype pairs when solving the least squares objective, and re-minimize least squares on just observed data.max_haplotypes: Maximum number of haplotypes for using global search. Windows exceeding this number of unique haplotypes will be searched using a heuristic. A non-zerostepscreenorthinning_factorneed to be specifiedstepwise: If an integer is specified, will solve the least squares objective by first findingstepwisetop haplotypes using a stepwise heuristic then finds the next haplotype using global search. Usesmax_haplotypes.thinning_factor: If an integer is specified, will solve the least squares objective on onlythining_factorunique haplotypes. Usesmax_haplotypes.scale_allelefreq: Boolean indicating whether to give rare SNPs more weight scaled bywᵢ = 1 / √2p(1-p)where max weight is 2.dynamic_programming: Boolean indicating whether to phase with a global search that finds the longest haplotype stretch over all windows. (Currently broken, sorry!)
MendelImpute.compress_haplotypes — Functioncompress_haplotypes(reffile::String, tgtfile::String, outfile::String,
[d::Int], [minwidth::Int], [overlap::Float64])Cuts a haplotype matrix reffile into windows of variable width so that each window has less than d unique haplotypes. Saves result to outfile as a compressed binary format. All SNPs in tgtfile must be present in reffile. All genotypes in reffile must be phased and non-missing, and all genotype positions must be unique.
Inputs
reffile: reference haplotype file name (ends in.vcf,.vcf.gz, or.bgen)tgtfile: VCF, PLINK, or BGEN file. VCF files should end in.vcfor.vcf.gz. PLINK files should exclude.bim/.bed/.famsuffixes but the trio must all be present in the same directory. BGEN files should end in.bgen.outfile: Output file name (ends in.jlso)
Optional Inputs
d: Max number of unique haplotypes per genotype window (defaultd = 1000).minwidth: Minimum number of typed SNPs per window (default 0)overlap: How much overlap between adjacent genotype windows in percentage of each window's width (default 0.0)
Why is tgtfile required?
The unique haplotypes in each window is computed on the typed SNPs only. A genotype matrix tgtfile is used to identify the typed SNPs. In the future, hopefully we can pre-compute compressed haplotype panels for all genotyping platforms and provide them as downloadable files. But currently, users must run this function by themselves.
MendelImpute.convert_compressed — Functionconvert_compressed(t<:Real, phaseinfo::String, reffile::String)Converts phaseinfo into a phased genotype matrix of type t using the full reference haplotype panel H
Inputs
t: Type of matrix. Ifbool, genotypes are converted to aBitMatrixphaseinfo: Vector ofHaplotypeMosaicPairs stored in.jlsoformatreffile: The complete (uncompressed) haplotype reference file
Output
X1: allele 1 of the phased genotype. Each column is a sample.X = X1 + X2.X2: allele 2 of the phased genotype. Each column is a sample.X = X1 + X2.phase: the original data structure after phasing and imputation.sampleID: The ID's of each imputed person.H: the complete reference haplotype panel. Columns ofHare haplotypes.
convert_compressed(t<:Real, phaseinfo::Vector{HaplotypeMosaicPair}, H::AbstractMatrix)Columns of H are haplotypes.
MendelImpute.admixture_global — Functionadmixture_global(tgtfile::String, reffile::String,
refID_to_population::Dict{String, String}, populations::Vector{String})Computes global ancestry estimates for each sample in tgtfile using a labeled reference panel reffile.
Inputs
tgtfile: VCF or PLINK files. VCF files should end in.vcfor.vcf.gz. PLINK files should exclude.bim/.bed/.famtrailings but the trio must all be present in the same directory.reffile: Reference haplotype file ending in.jlso(compressed binary files). Seecompress_haplotypes.refID_to_population: A dictionary mapping each sample IDs in the haplotype reference panel to their population origin. For examples, see output ofthousand_genome_population_to_superpopulationandthousand_genome_samples_to_super_populationpopulations: A vector ofStringcontaining unique populations present invalues(refID_to_population).
Optional Inputs
Q_outfile: Output file name for the estimatedQmatrix. DefaultQ_outfile="mendelimpute.ancestry.Q".imputed_outfile: Output file name for the imputed genotypes ending in.jlso. Defaultimpute_outfile = "mendelimpute.ancestry.Q.jlso"
Output
Q: ADataFramecontaining estimated ancestry fractions. Each row is a sample. Matrix will be saved inmendelimpute.ancestry.Q
MendelImpute.admixture_local — Functionadmixture_local(tgtfile::String, reffile::String,
refID_to_population::Dict{String, String}, populations::Vector{String},
population_colors::Vector{RGB{FixedPointNumbers.N0f8}})Computes global ancestry estimates for each sample in tgtfile using a labeled reference panel reffile.
Inputs
tgtfile: VCF or PLINK files. VCF files should end in.vcfor.vcf.gz. PLINK files should exclude.bim/.bed/.famtrailings but the trio must all be present in the same directory.reffile: Reference haplotype file ending in.jlso(compressed binary files). Seecompress_haplotypes.refID_to_population: A dictionary mapping each sample IDs in the haplotype reference panel to their population origin. For examples, see output ofthousand_genome_population_to_superpopulationandthousand_genome_samples_to_super_populationpopulation: A listStringcontaining unique populations present invalues(refID_to_population).population_colors: A vector of colors for each population.typeof(population_colors}should beVector{RGB{FixedPointNumbers.N0f8}}
Output
Q: Matrix containing estimated ancestry fractions. Each row is a haplotype. Sample 1's haplotypes are in rows 1 and 2, sample 2's are in rows 3, 4...etc.pop_colors: Matrix with sample dimension ofQstoring colors.
MendelImpute.thousand_genome_samples_to_population — Functionthousand_genome_samples_to_population()Creates a dictionaries mapping sample IDs of 1000 genome project to 26 population codes.
Population code and super population codes are described here: https://www.internationalgenome.org/category/population/
MendelImpute.thousand_genome_samples_to_super_population — Functionthousand_genome_samples_to_population()Creates a dictionaries mapping sample IDs of 1000 genome project to 5 super population codes.
Population code and super population codes are described here: https://www.internationalgenome.org/category/population/
MendelImpute.thousand_genome_population_to_superpopulation — Functionthousand_genome_population_to_superpopulation()Creates a dictionary mapping population codes of 1000 genome project to their super-population codes.
Population code and super population codes are described here: https://www.internationalgenome.org/category/population/