API
Documentation for MendelImpute.jl
's functions.
Index
MendelImpute.composition
MendelImpute.compress_haplotypes
MendelImpute.convert_compressed
MendelImpute.paint
MendelImpute.phase
MendelImpute.unique_populations
Functions
MendelImpute.phase
— Functionphase(tgtfile::String, reffile::String, outfile::String; [impute::Bool],
[phase::Bool], [dosage::Bool], [recreen::Bool], [max_haplotypes::Int],
[stepwise::Int], [thinning_factor::Int], [scale_allelefreq::Bool],
[dynamic_programming::Bool])
Main function of MendelImpute program. Phasing (haplotying) of tgtfile
from a pool of haplotypes reffile
by sliding windows and saves result in outfile
. All SNPs in tgtfile
must be present in reffile
. Per-sample imputation score (lower is better) will be saved in a file ending in sample.error
.
Input
tgtfile
: VCF or PLINK files. VCF files should end in.vcf
or.vcf.gz
. PLINK files should exclude.bim/.bed/.fam
suffixes but the trio must all be present in the same directory.reffile
: Reference haplotype file ending in.vcf
,.vcf.gz
, or.jlso
(compressed binary files).outfile
: output filename ending in.vcf.gz
,.vcf
, or.jlso
. VCF output genotypes will have no missing data. If ending in.jlso
, will output ultra-compressed data structure recordingHaplotypeMosaicPair
s for each sample.
Optional Inputs
impute
: Iftrue
, imputes every SNPs inreffile
totgtfile
. Otherwise only missing snps intgtfile
will be imputed.phase
: Iftrue
, all output genotypes will be phased, but observed data (minor allele count) may be changed. Ifphase=false
all output genotypes will be unphased but observed minor allele count will not change.dosage
: Iftrue
, will assume target matrix are dosages for imputation. Note this means the genotype matrix will be entirely single precision.rescreen
: This option is more computationally intensive but gives more accurate results. It saves a number of top haplotype pairs when solving the least squares objective, and re-minimize least squares on just observed data.max_haplotypes
: Maximum number of haplotypes for using global search. Windows exceeding this number of unique haplotypes will be searched using a heuristic. A non-zerostepscreen
orthinning_factor
need to be specifiedstepwise
: If an integer is specified, will solve the least squares objective by first findingstepwise
top haplotypes using a stepwise heuristic then finds the next haplotype using global search. Usesmax_haplotypes
.thinning_factor
: If an integer is specified, will solve the least squares objective on onlythining_factor
unique haplotypes. Usesmax_haplotypes
.scale_allelefreq
: Boolean indicating whether to give rare SNPs more weight scaled bywᵢ = 1 / √2p(1-p)
where max weight is 2.dynamic_programming
: Boolean indicating whether to phase with a global search that finds the longest haplotype stretch over all windows. (Currently broken, sorry!)
MendelImpute.compress_haplotypes
— Functioncompress_haplotypes(reffile::String, tgtfile::String, outfile::String,
[d::Int], [minwidth::Int], [overlap::Float64])
Cuts a haplotype matrix reffile
into windows of variable width so that each window has less than d
unique haplotypes. Saves result to outfile
as a compressed binary format. All SNPs in tgtfile
must be present in reffile
.
Why is tgtfile
required?
The unique haplotypes in each window is computed on the typed SNPs only. A genotype matrix tgtfile
is used to identify the typed SNPs. In the future, hopefully we can pre-compute compressed haplotype panels for all genotyping platforms and provide them as downloadable files. But currently, users must run this function by themselves.
Inputs
reffile
: reference haplotype file name (ends in.vcf
or.vcf.gz
)tgtfile
: target genotype file name (ends in.vcf
or.vcf.gz
)outfile
: Output file name (ends in.jlso
)
Optional Inputs
d
: Max number of unique haplotypes per genotype window (defaultd = 1000
).minwidth
: Minimum number of typed SNPs per window (default 0)overlap
: How much overlap between adjacent genotype windows in percentage of each window's width (default 0.0)
MendelImpute.paint
— Functionpaint(sample_phase::HaplotypeMosaicPair, panelID::Vector{String},
refID_to_population::Dict{String, String}, populations::Vector{String})
Converts a person's phased haplotype lengths into segments of percentages. This function is used for easier plotting a "painted chromosome".
Inputs
sample_phase
: AHaplotypeMosaicPair
storing phase information for a sample, includes haplotype start position and haplotype label.panelID
: Sample ID's in the reference haplotype panelrefID_to_population
: A dictionary mapping each ID in the haplotype reference panel to its population origin.
Optional inputs
populations
: A unique list of populations present inrefID_to_population
Output
composition
: A list of percentages wherecomposition[i]
equals the sample's ancestry (in %) frompopulations[i]
MendelImpute.composition
— Functioncomposition(sample_phase::HaplotypeMosaicPair, panelID::Vector{String},
refID_to_population::Dict{String, String}, [populations::Vector{String}])
Computes a sample's chromosome composition based on phase information. This function is used for easier plotting a person's admixed proportions.
Inputs
sample_phase
: AHaplotypeMosaicPair
storing phase information for a sample, includes haplotype start position and haplotype label.panelID
: Sample ID's in the reference haplotype panelrefID_to_population
: A dictionary mapping each ID in the haplotype reference panel to its population origin.
Optional inputs
populations
: A unique list of populations present inrefID_to_population
Output
composition
: A list of percentages wherecomposition[i]
equals the sample's ancestry (in %) frompopulations[i]
MendelImpute.unique_populations
— Functionunique_populations(x::Dict{String, String})
Computes the unique list of populations, preserving order. x
is a Dict
where each sample is a key and populations are values.
MendelImpute.convert_compressed
— Functionconvert_compressed(t<:Real, phaseinfo::String, reffile::String)
Converts phaseinfo
into a phased genotype matrix of type t
using the full reference haplotype panel H
Inputs
t
: Type of matrix. Ifbool
, genotypes are converted to aBitMatrix
phaseinfo
: Vector ofHaplotypeMosaicPair
s stored in.jlso
formatreffile
: The complete (uncompressed) haplotype reference file
Output
X1
: allele 1 of the phased genotype. Each column is a sample.X = X1 + X2
.X2
: allele 2 of the phased genotype. Each column is a sample.X = X1 + X2
.phase
: the original data structure after phasing and imputation.sampleID
: The ID's of each imputed person.H
: the complete reference haplotype panel. Columns ofH
are haplotypes.
convert_compressed(t<:Real, phaseinfo::Vector{HaplotypeMosaicPair}, H::AbstractMatrix)
Columns of H
are haplotypes.