API
Documentation for MendelImpute.jl
's functions.
Index
MendelImpute.admixture_global
MendelImpute.admixture_local
MendelImpute.compress_haplotypes
MendelImpute.convert_compressed
MendelImpute.phase
MendelImpute.thousand_genome_population_to_superpopulation
MendelImpute.thousand_genome_samples_to_population
MendelImpute.thousand_genome_samples_to_super_population
Functions
MendelImpute.phase
— Functionphase(tgtfile::String, reffile::String, outfile::String; [impute::Bool],
[phase::Bool], [dosage::Bool], [rescreen::Bool], [max_haplotypes::Int],
[stepwise::Int], [thinning_factor::Int], [scale_allelefreq::Bool],
[dynamic_programming::Bool])
Main function of MendelImpute program. Phasing (haplotying) of tgtfile
from a pool of haplotypes reffile
by sliding windows and saves result in outfile
. All SNPs in tgtfile
must be present in reffile
. Per-SNP quality score will be saved in outfile
, while per-sample imputation score will be saved in a file ending in sample.error
.
Input
tgtfile
: VCF, PLINK, or BGEN file. VCF files should end in.vcf
or.vcf.gz
. PLINK files should exclude.bim/.bed/.fam
trailings but the trio must all be present in the same directory. BGEN files should end in.bgen
reffile
: Reference haplotype file ending in.jlso
(compressed binary files). Seecompress_haplotypes
outfile
: output filename ending in.vcf.gz
,.vcf
, or.jlso
. VCF output genotypes will have no missing data. If ending in.jlso
, will output ultra-compressed data structure recordingHaplotypeMosaicPair
s for each sample.
Optional Inputs
impute
: Iftrue
, imputes every SNPs inreffile
totgtfile
. Otherwise only missing snps intgtfile
will be imputed.phase
: Iftrue
, all output genotypes will be phased, but observed data (minor allele count) may be changed. Ifphase=false
all output genotypes will be unphased but observed minor allele count will not change.dosage
: Iftrue
, will assume target matrix are dosages for imputation. Note this means the genotype matrix will be entirely single precision.rescreen
: This option is more computationally intensive but gives more accurate results. It saves a number of top haplotype pairs when solving the least squares objective, and re-minimize least squares on just observed data.max_haplotypes
: Maximum number of haplotypes for using global search. Windows exceeding this number of unique haplotypes will be searched using a heuristic. A non-zerostepscreen
orthinning_factor
need to be specifiedstepwise
: If an integer is specified, will solve the least squares objective by first findingstepwise
top haplotypes using a stepwise heuristic then finds the next haplotype using global search. Usesmax_haplotypes
.thinning_factor
: If an integer is specified, will solve the least squares objective on onlythining_factor
unique haplotypes. Usesmax_haplotypes
.scale_allelefreq
: Boolean indicating whether to give rare SNPs more weight scaled bywᵢ = 1 / √2p(1-p)
where max weight is 2.dynamic_programming
: Boolean indicating whether to phase with a global search that finds the longest haplotype stretch over all windows. (Currently broken, sorry!)
MendelImpute.compress_haplotypes
— Functioncompress_haplotypes(reffile::String, tgtfile::String, outfile::String,
[d::Int], [minwidth::Int], [overlap::Float64])
Cuts a haplotype matrix reffile
into windows of variable width so that each window has less than d
unique haplotypes. Saves result to outfile
as a compressed binary format. All SNPs in tgtfile
must be present in reffile
. All genotypes in reffile
must be phased and non-missing, and all genotype positions must be unique.
Inputs
reffile
: reference haplotype file name (ends in.vcf
,.vcf.gz
, or.bgen
)tgtfile
: VCF, PLINK, or BGEN file. VCF files should end in.vcf
or.vcf.gz
. PLINK files should exclude.bim/.bed/.fam
suffixes but the trio must all be present in the same directory. BGEN files should end in.bgen
.outfile
: Output file name (ends in.jlso
)
Optional Inputs
d
: Max number of unique haplotypes per genotype window (defaultd = 1000
).minwidth
: Minimum number of typed SNPs per window (default 0)overlap
: How much overlap between adjacent genotype windows in percentage of each window's width (default 0.0)
Why is tgtfile
required?
The unique haplotypes in each window is computed on the typed SNPs only. A genotype matrix tgtfile
is used to identify the typed SNPs. In the future, hopefully we can pre-compute compressed haplotype panels for all genotyping platforms and provide them as downloadable files. But currently, users must run this function by themselves.
MendelImpute.convert_compressed
— Functionconvert_compressed(t<:Real, phaseinfo::String, reffile::String)
Converts phaseinfo
into a phased genotype matrix of type t
using the full reference haplotype panel H
Inputs
t
: Type of matrix. Ifbool
, genotypes are converted to aBitMatrix
phaseinfo
: Vector ofHaplotypeMosaicPair
s stored in.jlso
formatreffile
: The complete (uncompressed) haplotype reference file
Output
X1
: allele 1 of the phased genotype. Each column is a sample.X = X1 + X2
.X2
: allele 2 of the phased genotype. Each column is a sample.X = X1 + X2
.phase
: the original data structure after phasing and imputation.sampleID
: The ID's of each imputed person.H
: the complete reference haplotype panel. Columns ofH
are haplotypes.
convert_compressed(t<:Real, phaseinfo::Vector{HaplotypeMosaicPair}, H::AbstractMatrix)
Columns of H
are haplotypes.
MendelImpute.admixture_global
— Functionadmixture_global(tgtfile::String, reffile::String,
refID_to_population::Dict{String, String}, populations::Vector{String})
Computes global ancestry estimates for each sample in tgtfile
using a labeled reference panel reffile
.
Inputs
tgtfile
: VCF or PLINK files. VCF files should end in.vcf
or.vcf.gz
. PLINK files should exclude.bim/.bed/.fam
trailings but the trio must all be present in the same directory.reffile
: Reference haplotype file ending in.jlso
(compressed binary files). Seecompress_haplotypes
.refID_to_population
: A dictionary mapping each sample IDs in the haplotype reference panel to their population origin. For examples, see output ofthousand_genome_population_to_superpopulation
andthousand_genome_samples_to_super_population
populations
: A vector ofString
containing unique populations present invalues(refID_to_population)
.
Optional Inputs
Q_outfile
: Output file name for the estimatedQ
matrix. DefaultQ_outfile="mendelimpute.ancestry.Q"
.imputed_outfile
: Output file name for the imputed genotypes ending in.jlso
. Defaultimpute_outfile = "mendelimpute.ancestry.Q.jlso"
Output
Q
: ADataFrame
containing estimated ancestry fractions. Each row is a sample. Matrix will be saved inmendelimpute.ancestry.Q
MendelImpute.admixture_local
— Functionadmixture_local(tgtfile::String, reffile::String,
refID_to_population::Dict{String, String}, populations::Vector{String},
population_colors::Vector{RGB{FixedPointNumbers.N0f8}})
Computes global ancestry estimates for each sample in tgtfile
using a labeled reference panel reffile
.
Inputs
tgtfile
: VCF or PLINK files. VCF files should end in.vcf
or.vcf.gz
. PLINK files should exclude.bim/.bed/.fam
trailings but the trio must all be present in the same directory.reffile
: Reference haplotype file ending in.jlso
(compressed binary files). Seecompress_haplotypes
.refID_to_population
: A dictionary mapping each sample IDs in the haplotype reference panel to their population origin. For examples, see output ofthousand_genome_population_to_superpopulation
andthousand_genome_samples_to_super_population
population
: A listString
containing unique populations present invalues(refID_to_population)
.population_colors
: A vector of colors for each population.typeof(population_colors}
should beVector{RGB{FixedPointNumbers.N0f8}}
Output
Q
: Matrix containing estimated ancestry fractions. Each row is a haplotype. Sample 1's haplotypes are in rows 1 and 2, sample 2's are in rows 3, 4...etc.pop_colors
: Matrix with sample dimension ofQ
storing colors.
MendelImpute.thousand_genome_samples_to_population
— Functionthousand_genome_samples_to_population()
Creates a dictionaries mapping sample IDs of 1000 genome project to 26 population codes.
Population code and super population codes are described here: https://www.internationalgenome.org/category/population/
MendelImpute.thousand_genome_samples_to_super_population
— Functionthousand_genome_samples_to_population()
Creates a dictionaries mapping sample IDs of 1000 genome project to 5 super population codes.
Population code and super population codes are described here: https://www.internationalgenome.org/category/population/
MendelImpute.thousand_genome_population_to_superpopulation
— Functionthousand_genome_population_to_superpopulation()
Creates a dictionary mapping population codes of 1000 genome project to their super-population codes.
Population code and super population codes are described here: https://www.internationalgenome.org/category/population/