API
Documentation for VCFTools.jl's functions.
Index
- VCFTools.aim_select
- VCFTools.conformgt_by_id
- VCFTools.conformgt_by_pos
- VCFTools.convert_ds
- VCFTools.convert_gt
- VCFTools.convert_ht
- VCFTools.copy_ds!
- VCFTools.copy_gt!
- VCFTools.copy_ht!
- VCFTools.filter_genotype
- VCFTools.filter_header
- VCFTools.gtstats
- VCFTools.mask_gt
- VCFTools.nrecords
- VCFTools.nsamples
- VCFTools.openvcf
- VCFTools.sampleID
- VCFTools.write_vcf
Functions
VCFTools.openvcf — Functionopenvcf(vcffile, [mode = "r"])Open VCF file (.vcf or .vcf.gz) and return an IO stream.
VCFTools.nrecords — Functionnrecords(vcffile)Number of records (markers) in a VCF file. Each record is a row.
VCFTools.nsamples — Functionnsamples(vcffile)Number of samples (individuals) in a VCF file.
VCFTools.sampleID — FunctionsampleID(vcffile::AbstractString)Helper function for extracting sample IDs from a VCF file.
VCFTools.gtstats — Functiongtstats(vcffile, [out=devnull])Calculate genotype statistics for each marker with GT field in a VCF file.
Input
- vcffile: VCF file, ending with- .vcfor- .vcf.gz
- out: output file name or IOStream. Default is- out=devnull(no output).
One line with 15 tab-delimited fiels is written per marker to out:     - 1-8)  VCF fixed fields (CHROM, POS, ID, REF, ALT, QUAL, FILT, INFO)     -   9)  Missing genotype count     -  10)  Missing genotype frequency     -  11)  ALT allele count     -  12)  ALT allele frequency     -  13)  Minor allele count             (REF allele vs ALT alleles)     -  14)  Minor allele frequency         (REF allele vs ALT alleles)     -  15)  HWE P-value                    (REF allele vs ALT alleles)
- pval_method: Method for calculating hardy weinburg equilibrium p-values. Can be- :Pearson(default) or- :Fisher.
Output
- records: number of records in the input VCF file
- samples: number of individuals in the input VCF file
- lines: number of lines written to- out; equivalently number of markers with GT field
- missings_by_sample: number of missing genotypes in each sample
- missings_by_record: number of missing genotypes in each record (marker)
- maf_by_record: minor allele frequency in each record (marker)
- minorallele_by_record: a Boolean vector indicating the minor allele in each record (marker).- minorallele_by_record[i]=truemeans the minor allele is the REF allele for marker- i;- minorallele_by_record[i]=falsemeans the minor allele is the ALT allele for marker- i
- hwe_by_record: hardy weinburg equilibrium p-values in each record calculated using- pval_method
gtstats(record, [missings_by_sample=nothing])Calculate genotype statistics for a VCF record with GT field.
Input
- record: a VCF record
- missings_by_sample: accumulator of missings by sample,- missings_by_sample[i]is incremented by 1 if- i-th individual has missing genotype in this record
Output
- n00: number of homozygote REF/REF or REF|REF
- n01: number of heterozygote REF/ALT or REF|ALT
- n11: number of homozygote ALT/ALT or ALT|ALT
- n0: number of REF alleles
- n1: number of ALT alleles
- altfreq: proportion of ALT alleles
- reffreq: proportion of REF alleles
- missings: number of missing genotypes
- minorallele: minor allele:- false(ALT is minor allele) or- true(REF is minor allele)
- maf: minor allele frequency
- hwepval: Hardy-Weinberg p-value
Missing docstring for filter. Check Documenter's build log for details.
Missing docstring for filter_record!. Check Documenter's build log for details.
VCFTools.filter_header — Functionfilter_header(reader, sample_mask)Filters the header info of a VCF reader. Will not keep sample IDs for entry i if sample_mask[i] = false.
Inpute:
- reader: a VCF reader object.
- sample_mask: a BitVector.- sample_mask[i] = truemeans keep sample- i.
TODO: Create VCFTool.jl signature in meta info. If filedate/signature exist already, delete them before writing new ones
VCFTools.filter_genotype — Functionfilter_genotype(record, [genokey=["GT"]])Filter a VCF record according to genokey and output a VCF record with genotype formats only in genokey.
filter_genotype(vcffile, outfile, [genokey=["GT"]])Filter a VCF file according to genokey (default ["GT"]). Output a VCF file with genotype formats only in genokey. Record (markers) with no fields in genokey are skipped.
VCFTools.convert_gt — FunctionConvert a two-bit genotype (of ALT alleles) to a real number  of type t according to specified SNP model. 
convert_gt(t, vcffile; [impute=false], [center=false], [scale=false])Convert the GT data from a VCF file to a matrix of type t, allowing for  missing values. Records with missing genotypes are converted to missing.
Input
- t: a type- t <: Real
- vcffile: VCF file path
Optional argument
- model: genetic model- :additive(default),- :dominant, or- :recessive
- impute: impute missing genotype or not, default- false
- center: center gentoype by 2maf or not, default- false
- scale: scale genotype by 1/√2maf(1-maf) or not, default- false
- trans: whether to import data transposed so that each column is 1 genotype, default- false
- msg: A message that will be printed to indicate progress. Defaults to not printing.
- save_snp_info: Boolean. If true, will also output sample ID, chrom, pos, snp id, ref, and alt vectors
Output
- out: The genotype matrix stored in column or rows depending on- trans.
- sampleID: The sample ID for each person
- record_chr: The chromosome number for each record
- record_pos: The position for each record
- record_ids: The record ID for each record. Only the first one is recorded when there are multiple IDs.
- record_ref: The ref allele for each record
- record_alt: The alternate allele for each record. Only the first one is recorded when there are multiple alternate alleles.
VCFTools.convert_ht — Functionconvert_ht(t, vcffile)Converts the GT data from a VCF file to a haplotype matrix of type t. One  VCF record will become 2 column/row in the resulting matrix. Records cannot have missing genotypes and the target file must be phased. 
Input
- t: a type- t <: Real. If- tis- Bool, output will be a BitMatrix.
- vcffile: a phased VCF file
- trans: whether to import data transposed so that each column is a haplotype, default- false.
- msg: A message that will be printed to indicate import progress. Defaults to not printing.
- save_snp_info: Boolean. If true, will also output sample ID, chrom, pos, snp id, ref, and alt vectors
Output
- out: The haplotype matrix stored in column or rows depending on- trans.
- sampleID: The sample ID for each person
- record_chr: The chromosome number for each record
- record_pos: The position for each record
- record_ids: The record ID for each record. Only the first one is recorded when there are multiple IDs.
- record_ref: The ref allele for each record
- record_alt: The alternate allele for each record. Only the first one is recorded when there are multiple alternate alleles.
VCFTools.convert_ds — Functionconvert_ds(t, vcffile; [key = "DS"], ...)Converts dosage data from a VCF file to a numeric matrix of type t. Here key specifies the FORMAT field of the VCF file that encodes the dosage (default = "DS"). 
Arguments
- t: type of output matrix.
- vcffile: VCF file path
Optional argument
- key: The FIELD name of the VCF file that encodes dosages, defaults to- "DS"
- model: genetic model- :additive(default),- :dominant, or- :recessive
- impute: impute missing genotype with 2 times the ALT allele frequency or not. Default- false
- center: center gentoype by 2maf or not, default- false
- scale: scale genotype by 1/√2maf(1-maf) or not, default- false
- trans: whether to import data transposed so that each column is 1 genotype, default- false
- msg: A message that will be printed to indicate progress. Defaults to not printing.
Output
- out: Dosage matrix
- sampleID: The sample ID for each person
- record_chr: The chromosome number for each record
- record_pos: The position for each record
- record_ids: The record ID for each record. Only the first one is recorded when there are multiple IDs.
- record_ref: The ref allele for each record
- record_alt: The alternate allele for each record. Only the first one is recorded when there are multiple alternate alleles.
VCFTools.copy_gt! — Functioncopy_gt!(A, reader; ...)Fill the rows of matrix A by the GT data from VCF records in reader. Records with missing genotypes are converted to missing.
Input
- A: a matrix or vector such that- eltype(A) <: Union{Missing, Real}. Each row is a person's genotype.
- reader: a VCF reader
Optional argument
- model: genetic model- :additive(default),- :dominant, or- :recessive
- impute: impute missing genotype or not, default- false
- center: center gentoype by 2maf or not, default- false
- scale: scale genotype by 1/√2maf(1-maf) or not, default- false
- msg: A message that will be printed to indicate progress. Defaults to not printing.
- sampleID: The sample ID for each person
- record_chr: The chromosome number for each record
- record_pos: The position for each record
- record_ids: The record ID for each record. Only the first one is recorded when there are multiple IDs.
- record_ref: The ref allele for each record
- record_alt: The alternate allele for each record. Only the first one is recorded when there are multiple alternate alleles.
Output
- A:- ismissing(A[i, j]) == trueindicates missing genotype. If- impute=true,- ismissing(A[i, j]) == falsefor all entries.
VCFTools.copy_ht! — Functioncopy_ht!(A, reader)Fill 2 columns of A by the GT data from VCF records in reader, each record filling 2 columns. Missing data is not allowed and genotypes must be phased. 
Input
- A: matrix where rows are haplotypes. Person- i's haplotype are filled in rows- 2i - 1and- 2i.
- reader: a VCF reader
Optional inputs
- msg: A message that will be printed to indicate progress. Defaults to not printing.
- sampleID: The sample ID for each person
- record_chr: The chromosome number for each record
- record_pos: The position for each record
- record_ids: The record ID for each record. Only the first one is recorded when there are multiple IDs.
- record_ref: The ref allele for each record
- record_alt: The alternate allele for each record. Only the first one is recorded when there are multiple alternate alleles.
VCFTools.copy_ds! — Functioncopy_ds!(A, reader; [key="DS"], ...)Fill the columns of matrix A by the dosage data from VCF records in reader. Records with missing fields (i.e. key field is .) are converted to missing.
Input
- A: matrix where rows are dosages.
- reader: a VCF reader
Optional argument
- key: The field key for accessing dosage values, default- "DS"
- model: genetic model- :additive(default),- :dominant, or- :recessive
- impute: impute missing genotype or not, default- false
- center: center gentoype by 2maf or not, default- false
- scale: scale genotype by 1/√2maf(1-maf) or not, default- false
- msg: A message that will be printed to indicate progress. Defaults to not printing.
- sampleID: The sample ID for each person
- record_chr: The chromosome number for each record
- record_pos: The position for each record
- record_ids: The record ID for each record. Only the first one is recorded when there are multiple IDs.
- record_ref: The ref allele for each record
- record_alt: The alternate allele for each record. Only the first one is recorded when there are multiple alternate alleles.
VCFTools.conformgt_by_id — Functionconformgt_by_id(reffile, tgtfile, outfile, chrom, posrange, checkfreq)Match the VCF records in tgtfile to those in reffile according to ID. The function will:
- Find corresponding VCF records in the target and reference files
- Exclude target VCF records whose ID cannot be matched to any reference VCF record
- Exclude target VCF records whose test of equal allele frequency is rejected at significance level checkfreq
- Adjust target VCF records so that chromosome strand and allele order match the VCF reference file
- The matched VCF records are written into files outfile.tgt.vcf.gzand
outfile.ref.vcf.gz, both with only "GT" data
Input
- reffile: VCF file with reference genotype (GT) data
- tgtfile: VCF file with target genotype (GT) data
- outfile: the prefix for output filenames
- chrom: chromosome name, must be identical in target and reference files
- posrange: position range in the reference file
- checkfreq: significance level for testing equal alelle frequencies between mached target and reference records. If the test pvalue is- ≤ checkfreq, the records are not output. Setting- checkfreq=0or- checkfreq=false(default) implies not checking allele frequencies. Setting- checkfreq=1effectively rejects all tests and no matched records are output
Output
- lines: number of matched VCF records
VCFTools.conformgt_by_pos — Functionconformgt_by_pos(reffile, tgtfile, outfile, chrom, posrange, checkfreq)Match the VCF GT records in tgtfile to those in reffile according to chromosome position. The function will:
- Find corresponding VCF records in the target and reference files
- Exclude target VCF records whose position cannot be matched to any reference VCF record
- Exclude target VCF records whose test of equal allele frequency is rejected at significance level checkfreq
- Adjust target VCF records so that chromosome strand and allele order match the VCF reference file
- The matched VCF records are written into files outfile.tgt.vcf.gzand
outfile.ref.vcf.gz, both with only "GT" data
Input
- reffile: VCF file with reference genotype (GT) data
- tgtfile: VCF file with target genotype (GT) data
- outfile: the prefix for output filenames
- chrom: chromosome name, must be identical in target and reference files
- posrange: position range in the reference file
- checkfreq: significance level for testing equal alelle frequencies between mached target and reference records. If the test pvalue is- ≤ checkfreq, the records are not output. Setting- checkfreq=0or- checkfreq=false(default) implies not checking allele frequencies. Setting- checkfreq=1effectively rejects all tests and no matched records are output
Output
- lines: number of matched VCF records
VCFTools.mask_gt — Functionmask_gt(src, masks; [des = "masked." * src], [separator = '/'])Creates a new VCF file des where genotype entry (i, j) of src  is missing if masks[i, j] is true. src is unchanged.
Arguments
- src: Input VCF file name.
- masks: Bit matrix.- masks[i, j] = truemeans mask entry (i, j).
- des: output VCF file name.
- unphase: If- true, all heterozygous genotypes will be- 1/0and homozygous separator will be- /. If- false, all unmasked genotypes will be retained.
VCFTools.write_vcf — Functionwrite_vcf(outfile, x1, x2, [chr], [pos], [ref], [alt], [sampleID], [SNPids])Writes haplotypes in x1 and x2 into VCF file, tracking phase information. 
Inputs
- outfile: Output file name (ends in- .vcfor- .vcf.gz)
- x1:- n × pmatrix of the 1st haplotype for each sample. Each row is a haplotype, and each entry is 0 or 1
- x2:- n × pmatrix of the 2nd haplotype for each sample.- x = x1 + x2. Each entry is 0 or 1
Optional Inputs
- chr: Vector of String.- chr[i]is chromosome number of SNP- i.
- pos: Vector of Int.- pos[i]is position of SNP- i.
- ref: Vector of String.- ref[i]is reference allele of SNP- i.
- alt: Vector of String.- alt[i]is alternate allele of SNP- i.
- sampleID: Vector of String.- sampleID[i]is reference allele of SNP- i.
- SNPids: Vector of String.- SNPids[i]is reference allele of SNP- i.
Reference
For multithreaded write, see https://github.com/OpenMendel/MendelImpute.jl/blob/master/src/impute.jl#L23
write_vcf(outfile, x, [chr], [pos], [ref], [alt], [sampleID], [SNPids])Writes genotypes in x into VCF file, not tracking phase information. 
Inputs
- outfile: Output file name (ends in- .vcfor- .vcf.gz)
- x:- n × pgenotype matrix. Each row is a sample, and each entry is 0, 1, or 2.
Optional Inputs
- chr: Vector of String.- chr[i]is chromosome number of SNP- i.
- pos: Vector of Int.- pos[i]is position of SNP- i.
- ref: Vector of String.- ref[i]is reference allele of SNP- i.
- alt: Vector of String.- alt[i]is alternate allele of SNP- i.
- sampleID: Vector of String.- sampleID[i]is reference allele of SNP- i.
- SNPids: Vector of String.- SNPids[i]is reference allele of SNP- i.
Reference
For multithreaded write, see https://github.com/OpenMendel/MendelImpute.jl/blob/master/src/impute.jl#L23
VCFTools.aim_select — Functionaim_select(vcffile::AbstractString, sampleID_to_population::Dict{String, String}, 
    [maf_threshold::Float64=0.01])Ranks SNPs by their ancestry information content. All people should be assigned an ancestry origin and be fully typed. P-values are computed via a likelihood ratio heterogeneity test. Sex chromosome is ignored (for now).
Inputs
- vcffile: VCF file, ending with- .vcfor- .vcf.gz
- sampleID_to_population: A dictionary mapping every sample ID to a population origin.
Optional Inputs
- maf_threshold: Minimum minor allele frequency for SNPs considered (default 0.01)
Note:
Method is adapted from MendelAimSelection.jl https://github.com/OpenMendel/MendelAimSelection.jl