API
Documentation for VCFTools.jl's functions.
Index
VCFTools.aim_selectVCFTools.conformgt_by_idVCFTools.conformgt_by_posVCFTools.convert_dsVCFTools.convert_gtVCFTools.convert_htVCFTools.copy_ds!VCFTools.copy_gt!VCFTools.copy_ht!VCFTools.filter_genotypeVCFTools.filter_headerVCFTools.gtstatsVCFTools.mask_gtVCFTools.nrecordsVCFTools.nsamplesVCFTools.openvcfVCFTools.sampleIDVCFTools.write_vcf
Functions
VCFTools.openvcf — Functionopenvcf(vcffile, [mode = "r"])Open VCF file (.vcf or .vcf.gz) and return an IO stream.
VCFTools.nrecords — Functionnrecords(vcffile)Number of records (markers) in a VCF file. Each record is a row.
VCFTools.nsamples — Functionnsamples(vcffile)Number of samples (individuals) in a VCF file.
VCFTools.sampleID — FunctionsampleID(vcffile::AbstractString)Helper function for extracting sample IDs from a VCF file.
VCFTools.gtstats — Functiongtstats(vcffile, [out=devnull])Calculate genotype statistics for each marker with GT field in a VCF file.
Input
vcffile: VCF file, ending with.vcfor.vcf.gzout: output file name or IOStream. Default isout=devnull(no output).
One line with 15 tab-delimited fiels is written per marker to out:     - 1-8)  VCF fixed fields (CHROM, POS, ID, REF, ALT, QUAL, FILT, INFO)     -   9)  Missing genotype count     -  10)  Missing genotype frequency     -  11)  ALT allele count     -  12)  ALT allele frequency     -  13)  Minor allele count             (REF allele vs ALT alleles)     -  14)  Minor allele frequency         (REF allele vs ALT alleles)     -  15)  HWE P-value                    (REF allele vs ALT alleles)
pval_method: Method for calculating hardy weinburg equilibrium p-values. Can be:Pearson(default) or:Fisher.
Output
records: number of records in the input VCF filesamples: number of individuals in the input VCF filelines: number of lines written toout; equivalently number of markers with GT fieldmissings_by_sample: number of missing genotypes in each samplemissings_by_record: number of missing genotypes in each record (marker)maf_by_record: minor allele frequency in each record (marker)minorallele_by_record: a Boolean vector indicating the minor allele in each record (marker).minorallele_by_record[i]=truemeans the minor allele is the REF allele for markeri;minorallele_by_record[i]=falsemeans the minor allele is the ALT allele for markerihwe_by_record: hardy weinburg equilibrium p-values in each record calculated usingpval_method
gtstats(record, [missings_by_sample=nothing])Calculate genotype statistics for a VCF record with GT field.
Input
record: a VCF recordmissings_by_sample: accumulator of missings by sample,missings_by_sample[i]is incremented by 1 ifi-th individual has missing genotype in this record
Output
n00: number of homozygote REF/REF or REF|REFn01: number of heterozygote REF/ALT or REF|ALTn11: number of homozygote ALT/ALT or ALT|ALTn0: number of REF allelesn1: number of ALT allelesaltfreq: proportion of ALT allelesreffreq: proportion of REF allelesmissings: number of missing genotypesminorallele: minor allele:false(ALT is minor allele) ortrue(REF is minor allele)maf: minor allele frequencyhwepval: Hardy-Weinberg p-value
Missing docstring for filter. Check Documenter's build log for details.
Missing docstring for filter_record!. Check Documenter's build log for details.
VCFTools.filter_header — Functionfilter_header(reader, sample_mask)Filters the header info of a VCF reader. Will not keep sample IDs for entry i if sample_mask[i] = false.
Inpute:
reader: a VCF reader object.sample_mask: a BitVector.sample_mask[i] = truemeans keep samplei.
TODO: Create VCFTool.jl signature in meta info. If filedate/signature exist already, delete them before writing new ones
VCFTools.filter_genotype — Functionfilter_genotype(record, [genokey=["GT"]])Filter a VCF record according to genokey and output a VCF record with genotype formats only in genokey.
filter_genotype(vcffile, outfile, [genokey=["GT"]])Filter a VCF file according to genokey (default ["GT"]). Output a VCF file with genotype formats only in genokey. Record (markers) with no fields in genokey are skipped.
VCFTools.convert_gt — FunctionConvert a two-bit genotype (of ALT alleles) to a real number  of type t according to specified SNP model. 
convert_gt(t, vcffile; [impute=false], [center=false], [scale=false])Convert the GT data from a VCF file to a matrix of type t, allowing for  missing values. Records with missing genotypes are converted to missing.
Input
t: a typet <: Realvcffile: VCF file path
Optional argument
model: genetic model:additive(default),:dominant, or:recessiveimpute: impute missing genotype or not, defaultfalsecenter: center gentoype by 2maf or not, defaultfalsescale: scale genotype by 1/√2maf(1-maf) or not, defaultfalsetrans: whether to import data transposed so that each column is 1 genotype, defaultfalsemsg: A message that will be printed to indicate progress. Defaults to not printing.save_snp_info: Boolean. If true, will also output sample ID, chrom, pos, snp id, ref, and alt vectors
Output
out: The genotype matrix stored in column or rows depending ontrans.sampleID: The sample ID for each personrecord_chr: The chromosome number for each recordrecord_pos: The position for each recordrecord_ids: The record ID for each record. Only the first one is recorded when there are multiple IDs.record_ref: The ref allele for each recordrecord_alt: The alternate allele for each record. Only the first one is recorded when there are multiple alternate alleles.
VCFTools.convert_ht — Functionconvert_ht(t, vcffile)Converts the GT data from a VCF file to a haplotype matrix of type t. One  VCF record will become 2 column/row in the resulting matrix. Records cannot have missing genotypes and the target file must be phased. 
Input
t: a typet <: Real. IftisBool, output will be a BitMatrix.vcffile: a phased VCF filetrans: whether to import data transposed so that each column is a haplotype, defaultfalse.msg: A message that will be printed to indicate import progress. Defaults to not printing.save_snp_info: Boolean. If true, will also output sample ID, chrom, pos, snp id, ref, and alt vectors
Output
out: The haplotype matrix stored in column or rows depending ontrans.sampleID: The sample ID for each personrecord_chr: The chromosome number for each recordrecord_pos: The position for each recordrecord_ids: The record ID for each record. Only the first one is recorded when there are multiple IDs.record_ref: The ref allele for each recordrecord_alt: The alternate allele for each record. Only the first one is recorded when there are multiple alternate alleles.
VCFTools.convert_ds — Functionconvert_ds(t, vcffile; [key = "DS"], ...)Converts dosage data from a VCF file to a numeric matrix of type t. Here key specifies the FORMAT field of the VCF file that encodes the dosage (default = "DS"). 
Arguments
t: type of output matrix.vcffile: VCF file path
Optional argument
key: The FIELD name of the VCF file that encodes dosages, defaults to"DS"model: genetic model:additive(default),:dominant, or:recessiveimpute: impute missing genotype with 2 times the ALT allele frequency or not. Defaultfalsecenter: center gentoype by 2maf or not, defaultfalsescale: scale genotype by 1/√2maf(1-maf) or not, defaultfalsetrans: whether to import data transposed so that each column is 1 genotype, defaultfalsemsg: A message that will be printed to indicate progress. Defaults to not printing.
Output
out: Dosage matrixsampleID: The sample ID for each personrecord_chr: The chromosome number for each recordrecord_pos: The position for each recordrecord_ids: The record ID for each record. Only the first one is recorded when there are multiple IDs.record_ref: The ref allele for each recordrecord_alt: The alternate allele for each record. Only the first one is recorded when there are multiple alternate alleles.
VCFTools.copy_gt! — Functioncopy_gt!(A, reader; ...)Fill the rows of matrix A by the GT data from VCF records in reader. Records with missing genotypes are converted to missing.
Input
A: a matrix or vector such thateltype(A) <: Union{Missing, Real}. Each row is a person's genotype.reader: a VCF reader
Optional argument
model: genetic model:additive(default),:dominant, or:recessiveimpute: impute missing genotype or not, defaultfalsecenter: center gentoype by 2maf or not, defaultfalsescale: scale genotype by 1/√2maf(1-maf) or not, defaultfalsemsg: A message that will be printed to indicate progress. Defaults to not printing.sampleID: The sample ID for each personrecord_chr: The chromosome number for each recordrecord_pos: The position for each recordrecord_ids: The record ID for each record. Only the first one is recorded when there are multiple IDs.record_ref: The ref allele for each recordrecord_alt: The alternate allele for each record. Only the first one is recorded when there are multiple alternate alleles.
Output
A:ismissing(A[i, j]) == trueindicates missing genotype. Ifimpute=true,ismissing(A[i, j]) == falsefor all entries.
VCFTools.copy_ht! — Functioncopy_ht!(A, reader)Fill 2 columns of A by the GT data from VCF records in reader, each record filling 2 columns. Missing data is not allowed and genotypes must be phased. 
Input
A: matrix where rows are haplotypes. Personi's haplotype are filled in rows2i - 1and2i.reader: a VCF reader
Optional inputs
msg: A message that will be printed to indicate progress. Defaults to not printing.sampleID: The sample ID for each personrecord_chr: The chromosome number for each recordrecord_pos: The position for each recordrecord_ids: The record ID for each record. Only the first one is recorded when there are multiple IDs.record_ref: The ref allele for each recordrecord_alt: The alternate allele for each record. Only the first one is recorded when there are multiple alternate alleles.
VCFTools.copy_ds! — Functioncopy_ds!(A, reader; [key="DS"], ...)Fill the columns of matrix A by the dosage data from VCF records in reader. Records with missing fields (i.e. key field is .) are converted to missing.
Input
A: matrix where rows are dosages.reader: a VCF reader
Optional argument
key: The field key for accessing dosage values, default"DS"model: genetic model:additive(default),:dominant, or:recessiveimpute: impute missing genotype or not, defaultfalsecenter: center gentoype by 2maf or not, defaultfalsescale: scale genotype by 1/√2maf(1-maf) or not, defaultfalsemsg: A message that will be printed to indicate progress. Defaults to not printing.sampleID: The sample ID for each personrecord_chr: The chromosome number for each recordrecord_pos: The position for each recordrecord_ids: The record ID for each record. Only the first one is recorded when there are multiple IDs.record_ref: The ref allele for each recordrecord_alt: The alternate allele for each record. Only the first one is recorded when there are multiple alternate alleles.
VCFTools.conformgt_by_id — Functionconformgt_by_id(reffile, tgtfile, outfile, chrom, posrange, checkfreq)Match the VCF records in tgtfile to those in reffile according to ID. The function will:
- Find corresponding VCF records in the target and reference files
 - Exclude target VCF records whose ID cannot be matched to any reference VCF record
 - Exclude target VCF records whose test of equal allele frequency is rejected at significance level 
checkfreq - Adjust target VCF records so that chromosome strand and allele order match the VCF reference file
 - The matched VCF records are written into files 
outfile.tgt.vcf.gzand 
outfile.ref.vcf.gz, both with only "GT" data
Input
reffile: VCF file with reference genotype (GT) datatgtfile: VCF file with target genotype (GT) dataoutfile: the prefix for output filenameschrom: chromosome name, must be identical in target and reference filesposrange: position range in the reference filecheckfreq: significance level for testing equal alelle frequencies between mached target and reference records. If the test pvalue is≤ checkfreq, the records are not output. Settingcheckfreq=0orcheckfreq=false(default) implies not checking allele frequencies. Settingcheckfreq=1effectively rejects all tests and no matched records are output
Output
lines: number of matched VCF records
VCFTools.conformgt_by_pos — Functionconformgt_by_pos(reffile, tgtfile, outfile, chrom, posrange, checkfreq)Match the VCF GT records in tgtfile to those in reffile according to chromosome position. The function will:
- Find corresponding VCF records in the target and reference files
 - Exclude target VCF records whose position cannot be matched to any reference VCF record
 - Exclude target VCF records whose test of equal allele frequency is rejected at significance level 
checkfreq - Adjust target VCF records so that chromosome strand and allele order match the VCF reference file
 - The matched VCF records are written into files 
outfile.tgt.vcf.gzand 
outfile.ref.vcf.gz, both with only "GT" data
Input
reffile: VCF file with reference genotype (GT) datatgtfile: VCF file with target genotype (GT) dataoutfile: the prefix for output filenameschrom: chromosome name, must be identical in target and reference filesposrange: position range in the reference filecheckfreq: significance level for testing equal alelle frequencies between mached target and reference records. If the test pvalue is≤ checkfreq, the records are not output. Settingcheckfreq=0orcheckfreq=false(default) implies not checking allele frequencies. Settingcheckfreq=1effectively rejects all tests and no matched records are output
Output
lines: number of matched VCF records
VCFTools.mask_gt — Functionmask_gt(src, masks; [des = "masked." * src], [separator = '/'])Creates a new VCF file des where genotype entry (i, j) of src  is missing if masks[i, j] is true. src is unchanged.
Arguments
src: Input VCF file name.masks: Bit matrix.masks[i, j] = truemeans mask entry (i, j).des: output VCF file name.unphase: Iftrue, all heterozygous genotypes will be1/0and homozygous separator will be/. Iffalse, all unmasked genotypes will be retained.
VCFTools.write_vcf — Functionwrite_vcf(outfile, x1, x2, [chr], [pos], [ref], [alt], [sampleID], [SNPids])Writes haplotypes in x1 and x2 into VCF file, tracking phase information. 
Inputs
outfile: Output file name (ends in.vcfor.vcf.gz)x1:n × pmatrix of the 1st haplotype for each sample. Each row is a haplotype, and each entry is 0 or 1x2:n × pmatrix of the 2nd haplotype for each sample.x = x1 + x2. Each entry is 0 or 1
Optional Inputs
chr: Vector of String.chr[i]is chromosome number of SNPi.pos: Vector of Int.pos[i]is position of SNPi.ref: Vector of String.ref[i]is reference allele of SNPi.alt: Vector of String.alt[i]is alternate allele of SNPi.sampleID: Vector of String.sampleID[i]is reference allele of SNPi.SNPids: Vector of String.SNPids[i]is reference allele of SNPi.
Reference
For multithreaded write, see https://github.com/OpenMendel/MendelImpute.jl/blob/master/src/impute.jl#L23
write_vcf(outfile, x, [chr], [pos], [ref], [alt], [sampleID], [SNPids])Writes genotypes in x into VCF file, not tracking phase information. 
Inputs
outfile: Output file name (ends in.vcfor.vcf.gz)x:n × pgenotype matrix. Each row is a sample, and each entry is 0, 1, or 2.
Optional Inputs
chr: Vector of String.chr[i]is chromosome number of SNPi.pos: Vector of Int.pos[i]is position of SNPi.ref: Vector of String.ref[i]is reference allele of SNPi.alt: Vector of String.alt[i]is alternate allele of SNPi.sampleID: Vector of String.sampleID[i]is reference allele of SNPi.SNPids: Vector of String.SNPids[i]is reference allele of SNPi.
Reference
For multithreaded write, see https://github.com/OpenMendel/MendelImpute.jl/blob/master/src/impute.jl#L23
VCFTools.aim_select — Functionaim_select(vcffile::AbstractString, sampleID_to_population::Dict{String, String}, 
    [maf_threshold::Float64=0.01])Ranks SNPs by their ancestry information content. All people should be assigned an ancestry origin and be fully typed. P-values are computed via a likelihood ratio heterogeneity test. Sex chromosome is ignored (for now).
Inputs
vcffile: VCF file, ending with.vcfor.vcf.gzsampleID_to_population: A dictionary mapping every sample ID to a population origin.
Optional Inputs
maf_threshold: Minimum minor allele frequency for SNPs considered (default 0.01)
Note:
Method is adapted from MendelAimSelection.jl https://github.com/OpenMendel/MendelAimSelection.jl