API
Documentation for VCFTools.jl
's functions.
Index
VCFTools.aim_select
VCFTools.conformgt_by_id
VCFTools.conformgt_by_pos
VCFTools.convert_ds
VCFTools.convert_gt
VCFTools.convert_ht
VCFTools.copy_ds!
VCFTools.copy_gt!
VCFTools.copy_ht!
VCFTools.filter_genotype
VCFTools.filter_header
VCFTools.gtstats
VCFTools.mask_gt
VCFTools.nrecords
VCFTools.nsamples
VCFTools.openvcf
VCFTools.sampleID
VCFTools.write_vcf
Functions
VCFTools.openvcf
— Functionopenvcf(vcffile, [mode = "r"])
Open VCF file (.vcf
or .vcf.gz
) and return an IO stream.
VCFTools.nrecords
— Functionnrecords(vcffile)
Number of records (markers) in a VCF file. Each record is a row.
VCFTools.nsamples
— Functionnsamples(vcffile)
Number of samples (individuals) in a VCF file.
VCFTools.sampleID
— FunctionsampleID(vcffile::AbstractString)
Helper function for extracting sample IDs from a VCF file.
VCFTools.gtstats
— Functiongtstats(vcffile, [out=devnull])
Calculate genotype statistics for each marker with GT field in a VCF file.
Input
vcffile
: VCF file, ending with.vcf
or.vcf.gz
out
: output file name or IOStream. Default isout=devnull
(no output).
One line with 15 tab-delimited fiels is written per marker to out
: - 1-8) VCF fixed fields (CHROM, POS, ID, REF, ALT, QUAL, FILT, INFO) - 9) Missing genotype count - 10) Missing genotype frequency - 11) ALT allele count - 12) ALT allele frequency - 13) Minor allele count (REF allele vs ALT alleles) - 14) Minor allele frequency (REF allele vs ALT alleles) - 15) HWE P-value (REF allele vs ALT alleles)
pval_method
: Method for calculating hardy weinburg equilibrium p-values. Can be:Pearson
(default) or:Fisher
.
Output
records
: number of records in the input VCF filesamples
: number of individuals in the input VCF filelines
: number of lines written toout
; equivalently number of markers with GT fieldmissings_by_sample
: number of missing genotypes in each samplemissings_by_record
: number of missing genotypes in each record (marker)maf_by_record
: minor allele frequency in each record (marker)minorallele_by_record
: a Boolean vector indicating the minor allele in each record (marker).minorallele_by_record[i]=true
means the minor allele is the REF allele for markeri
;minorallele_by_record[i]=false
means the minor allele is the ALT allele for markeri
hwe_by_record
: hardy weinburg equilibrium p-values in each record calculated usingpval_method
gtstats(record, [missings_by_sample=nothing])
Calculate genotype statistics for a VCF record with GT field.
Input
record
: a VCF recordmissings_by_sample
: accumulator of missings by sample,missings_by_sample[i]
is incremented by 1 ifi
-th individual has missing genotype in this record
Output
n00
: number of homozygote REF/REF or REF|REFn01
: number of heterozygote REF/ALT or REF|ALTn11
: number of homozygote ALT/ALT or ALT|ALTn0
: number of REF allelesn1
: number of ALT allelesaltfreq
: proportion of ALT allelesreffreq
: proportion of REF allelesmissings
: number of missing genotypesminorallele
: minor allele:false
(ALT is minor allele) ortrue
(REF is minor allele)maf
: minor allele frequencyhwepval
: Hardy-Weinberg p-value
Missing docstring for filter
. Check Documenter's build log for details.
Missing docstring for filter_record!
. Check Documenter's build log for details.
VCFTools.filter_header
— Functionfilter_header(reader, sample_mask)
Filters the header info of a VCF reader. Will not keep sample IDs for entry i
if sample_mask[i] = false
.
Inpute:
reader
: a VCF reader object.sample_mask
: a BitVector.sample_mask[i] = true
means keep samplei
.
TODO: Create VCFTool.jl signature in meta info. If filedate/signature exist already, delete them before writing new ones
VCFTools.filter_genotype
— Functionfilter_genotype(record, [genokey=["GT"]])
Filter a VCF record according to genokey
and output a VCF record with genotype formats only in genokey
.
filter_genotype(vcffile, outfile, [genokey=["GT"]])
Filter a VCF file according to genokey
(default ["GT"]
). Output a VCF file with genotype formats only in genokey
. Record (markers) with no fields in genokey
are skipped.
VCFTools.convert_gt
— FunctionConvert a two-bit genotype (of ALT alleles) to a real number of type t
according to specified SNP model.
convert_gt(t, vcffile; [impute=false], [center=false], [scale=false])
Convert the GT data from a VCF file to a matrix of type t
, allowing for missing values. Records with missing genotypes are converted to missing
.
Input
t
: a typet <: Real
vcffile
: VCF file path
Optional argument
model
: genetic model:additive
(default),:dominant
, or:recessive
impute
: impute missing genotype or not, defaultfalse
center
: center gentoype by 2maf or not, defaultfalse
scale
: scale genotype by 1/√2maf(1-maf) or not, defaultfalse
trans
: whether to import data transposed so that each column is 1 genotype, defaultfalse
msg
: A message that will be printed to indicate progress. Defaults to not printing.save_snp_info
: Boolean. If true, will also output sample ID, chrom, pos, snp id, ref, and alt vectors
Output
out
: The genotype matrix stored in column or rows depending ontrans
.sampleID
: The sample ID for each personrecord_chr
: The chromosome number for each recordrecord_pos
: The position for each recordrecord_ids
: The record ID for each record. Only the first one is recorded when there are multiple IDs.record_ref
: The ref allele for each recordrecord_alt
: The alternate allele for each record. Only the first one is recorded when there are multiple alternate alleles.
VCFTools.convert_ht
— Functionconvert_ht(t, vcffile)
Converts the GT data from a VCF file to a haplotype matrix of type t
. One VCF record will become 2 column/row in the resulting matrix. Records cannot have missing genotypes and the target file must be phased.
Input
t
: a typet <: Real
. Ift
isBool
, output will be a BitMatrix.vcffile
: a phased VCF filetrans
: whether to import data transposed so that each column is a haplotype, defaultfalse
.msg
: A message that will be printed to indicate import progress. Defaults to not printing.save_snp_info
: Boolean. If true, will also output sample ID, chrom, pos, snp id, ref, and alt vectors
Output
out
: The haplotype matrix stored in column or rows depending ontrans
.sampleID
: The sample ID for each personrecord_chr
: The chromosome number for each recordrecord_pos
: The position for each recordrecord_ids
: The record ID for each record. Only the first one is recorded when there are multiple IDs.record_ref
: The ref allele for each recordrecord_alt
: The alternate allele for each record. Only the first one is recorded when there are multiple alternate alleles.
VCFTools.convert_ds
— Functionconvert_ds(t, vcffile; [key = "DS"], ...)
Converts dosage data from a VCF file to a numeric matrix of type t
. Here key
specifies the FORMAT field of the VCF file that encodes the dosage (default = "DS").
Arguments
t
: type of output matrix.vcffile
: VCF file path
Optional argument
key
: The FIELD name of the VCF file that encodes dosages, defaults to"DS"
model
: genetic model:additive
(default),:dominant
, or:recessive
impute
: impute missing genotype with 2 times the ALT allele frequency or not. Defaultfalse
center
: center gentoype by 2maf or not, defaultfalse
scale
: scale genotype by 1/√2maf(1-maf) or not, defaultfalse
trans
: whether to import data transposed so that each column is 1 genotype, defaultfalse
msg
: A message that will be printed to indicate progress. Defaults to not printing.
Output
out
: Dosage matrixsampleID
: The sample ID for each personrecord_chr
: The chromosome number for each recordrecord_pos
: The position for each recordrecord_ids
: The record ID for each record. Only the first one is recorded when there are multiple IDs.record_ref
: The ref allele for each recordrecord_alt
: The alternate allele for each record. Only the first one is recorded when there are multiple alternate alleles.
VCFTools.copy_gt!
— Functioncopy_gt!(A, reader; ...)
Fill the rows of matrix A
by the GT data from VCF records in reader
. Records with missing genotypes are converted to missing
.
Input
A
: a matrix or vector such thateltype(A) <: Union{Missing, Real}
. Each row is a person's genotype.reader
: a VCF reader
Optional argument
model
: genetic model:additive
(default),:dominant
, or:recessive
impute
: impute missing genotype or not, defaultfalse
center
: center gentoype by 2maf or not, defaultfalse
scale
: scale genotype by 1/√2maf(1-maf) or not, defaultfalse
msg
: A message that will be printed to indicate progress. Defaults to not printing.sampleID
: The sample ID for each personrecord_chr
: The chromosome number for each recordrecord_pos
: The position for each recordrecord_ids
: The record ID for each record. Only the first one is recorded when there are multiple IDs.record_ref
: The ref allele for each recordrecord_alt
: The alternate allele for each record. Only the first one is recorded when there are multiple alternate alleles.
Output
A
:ismissing(A[i, j]) == true
indicates missing genotype. Ifimpute=true
,ismissing(A[i, j]) == false
for all entries.
VCFTools.copy_ht!
— Functioncopy_ht!(A, reader)
Fill 2 columns of A
by the GT data from VCF records in reader
, each record filling 2 columns. Missing data is not allowed and genotypes must be phased.
Input
A
: matrix where rows are haplotypes. Personi
's haplotype are filled in rows2i - 1
and2i
.reader
: a VCF reader
Optional inputs
msg
: A message that will be printed to indicate progress. Defaults to not printing.sampleID
: The sample ID for each personrecord_chr
: The chromosome number for each recordrecord_pos
: The position for each recordrecord_ids
: The record ID for each record. Only the first one is recorded when there are multiple IDs.record_ref
: The ref allele for each recordrecord_alt
: The alternate allele for each record. Only the first one is recorded when there are multiple alternate alleles.
VCFTools.copy_ds!
— Functioncopy_ds!(A, reader; [key="DS"], ...)
Fill the columns of matrix A
by the dosage data from VCF records in reader
. Records with missing fields (i.e. key
field is .
) are converted to missing
.
Input
A
: matrix where rows are dosages.reader
: a VCF reader
Optional argument
key
: The field key for accessing dosage values, default"DS"
model
: genetic model:additive
(default),:dominant
, or:recessive
impute
: impute missing genotype or not, defaultfalse
center
: center gentoype by 2maf or not, defaultfalse
scale
: scale genotype by 1/√2maf(1-maf) or not, defaultfalse
msg
: A message that will be printed to indicate progress. Defaults to not printing.sampleID
: The sample ID for each personrecord_chr
: The chromosome number for each recordrecord_pos
: The position for each recordrecord_ids
: The record ID for each record. Only the first one is recorded when there are multiple IDs.record_ref
: The ref allele for each recordrecord_alt
: The alternate allele for each record. Only the first one is recorded when there are multiple alternate alleles.
VCFTools.conformgt_by_id
— Functionconformgt_by_id(reffile, tgtfile, outfile, chrom, posrange, checkfreq)
Match the VCF records in tgtfile
to those in reffile
according to ID. The function will:
- Find corresponding VCF records in the target and reference files
- Exclude target VCF records whose ID cannot be matched to any reference VCF record
- Exclude target VCF records whose test of equal allele frequency is rejected at significance level
checkfreq
- Adjust target VCF records so that chromosome strand and allele order match the VCF reference file
- The matched VCF records are written into files
outfile.tgt.vcf.gz
and
outfile.ref.vcf.gz
, both with only "GT" data
Input
reffile
: VCF file with reference genotype (GT) datatgtfile
: VCF file with target genotype (GT) dataoutfile
: the prefix for output filenameschrom
: chromosome name, must be identical in target and reference filesposrange
: position range in the reference filecheckfreq
: significance level for testing equal alelle frequencies between mached target and reference records. If the test pvalue is≤ checkfreq
, the records are not output. Settingcheckfreq=0
orcheckfreq=false
(default) implies not checking allele frequencies. Settingcheckfreq=1
effectively rejects all tests and no matched records are output
Output
lines
: number of matched VCF records
VCFTools.conformgt_by_pos
— Functionconformgt_by_pos(reffile, tgtfile, outfile, chrom, posrange, checkfreq)
Match the VCF GT records in tgtfile
to those in reffile
according to chromosome position. The function will:
- Find corresponding VCF records in the target and reference files
- Exclude target VCF records whose position cannot be matched to any reference VCF record
- Exclude target VCF records whose test of equal allele frequency is rejected at significance level
checkfreq
- Adjust target VCF records so that chromosome strand and allele order match the VCF reference file
- The matched VCF records are written into files
outfile.tgt.vcf.gz
and
outfile.ref.vcf.gz
, both with only "GT" data
Input
reffile
: VCF file with reference genotype (GT) datatgtfile
: VCF file with target genotype (GT) dataoutfile
: the prefix for output filenameschrom
: chromosome name, must be identical in target and reference filesposrange
: position range in the reference filecheckfreq
: significance level for testing equal alelle frequencies between mached target and reference records. If the test pvalue is≤ checkfreq
, the records are not output. Settingcheckfreq=0
orcheckfreq=false
(default) implies not checking allele frequencies. Settingcheckfreq=1
effectively rejects all tests and no matched records are output
Output
lines
: number of matched VCF records
VCFTools.mask_gt
— Functionmask_gt(src, masks; [des = "masked." * src], [separator = '/'])
Creates a new VCF file des
where genotype entry (i, j) of src
is missing if masks[i, j]
is true. src
is unchanged.
Arguments
src
: Input VCF file name.masks
: Bit matrix.masks[i, j] = true
means mask entry (i, j).des
: output VCF file name.unphase
: Iftrue
, all heterozygous genotypes will be1/0
and homozygous separator will be/
. Iffalse
, all unmasked genotypes will be retained.
VCFTools.write_vcf
— Functionwrite_vcf(outfile, x1, x2, [chr], [pos], [ref], [alt], [sampleID], [SNPids])
Writes haplotypes in x1
and x2
into VCF file, tracking phase information.
Inputs
outfile
: Output file name (ends in.vcf
or.vcf.gz
)x1
:n × p
matrix of the 1st haplotype for each sample. Each row is a haplotype, and each entry is 0 or 1x2
:n × p
matrix of the 2nd haplotype for each sample.x = x1 + x2
. Each entry is 0 or 1
Optional Inputs
chr
: Vector of String.chr[i]
is chromosome number of SNPi
.pos
: Vector of Int.pos[i]
is position of SNPi
.ref
: Vector of String.ref[i]
is reference allele of SNPi
.alt
: Vector of String.alt[i]
is alternate allele of SNPi
.sampleID
: Vector of String.sampleID[i]
is reference allele of SNPi
.SNPids
: Vector of String.SNPids[i]
is reference allele of SNPi
.
Reference
For multithreaded write, see https://github.com/OpenMendel/MendelImpute.jl/blob/master/src/impute.jl#L23
write_vcf(outfile, x, [chr], [pos], [ref], [alt], [sampleID], [SNPids])
Writes genotypes in x
into VCF file, not tracking phase information.
Inputs
outfile
: Output file name (ends in.vcf
or.vcf.gz
)x
:n × p
genotype matrix. Each row is a sample, and each entry is 0, 1, or 2.
Optional Inputs
chr
: Vector of String.chr[i]
is chromosome number of SNPi
.pos
: Vector of Int.pos[i]
is position of SNPi
.ref
: Vector of String.ref[i]
is reference allele of SNPi
.alt
: Vector of String.alt[i]
is alternate allele of SNPi
.sampleID
: Vector of String.sampleID[i]
is reference allele of SNPi
.SNPids
: Vector of String.SNPids[i]
is reference allele of SNPi
.
Reference
For multithreaded write, see https://github.com/OpenMendel/MendelImpute.jl/blob/master/src/impute.jl#L23
VCFTools.aim_select
— Functionaim_select(vcffile::AbstractString, sampleID_to_population::Dict{String, String},
[maf_threshold::Float64=0.01])
Ranks SNPs by their ancestry information content. All people should be assigned an ancestry origin and be fully typed. P-values are computed via a likelihood ratio heterogeneity test. Sex chromosome is ignored (for now).
Inputs
vcffile
: VCF file, ending with.vcf
or.vcf.gz
sampleID_to_population
: A dictionary mapping every sample ID to a population origin.
Optional Inputs
maf_threshold
: Minimum minor allele frequency for SNPs considered (default 0.01)
Note:
Method is adapted from MendelAimSelection.jl https://github.com/OpenMendel/MendelAimSelection.jl