API

Documentation for VCFTools.jl's functions.

Index

Functions

VCFTools.openvcfFunction
openvcf(vcffile, [mode = "r"])

Open VCF file (.vcf or .vcf.gz) and return an IO stream.

source
VCFTools.nrecordsFunction
nrecords(vcffile)

Number of records (markers) in a VCF file. Each record is a row.

source
VCFTools.sampleIDFunction
sampleID(vcffile::AbstractString)

Helper function for extracting sample IDs from a VCF file.

source
VCFTools.gtstatsFunction
gtstats(vcffile, [out=devnull])

Calculate genotype statistics for each marker with GT field in a VCF file.

Input

  • vcffile: VCF file, ending with .vcf or .vcf.gz
  • out: output file name or IOStream. Default is out=devnull (no output).

One line with 15 tab-delimited fiels is written per marker to out: - 1-8) VCF fixed fields (CHROM, POS, ID, REF, ALT, QUAL, FILT, INFO) - 9) Missing genotype count - 10) Missing genotype frequency - 11) ALT allele count - 12) ALT allele frequency - 13) Minor allele count (REF allele vs ALT alleles) - 14) Minor allele frequency (REF allele vs ALT alleles) - 15) HWE P-value (REF allele vs ALT alleles)

  • pval_method: Method for calculating hardy weinburg equilibrium p-values. Can be :Pearson (default) or :Fisher.

Output

  • records: number of records in the input VCF file
  • samples: number of individuals in the input VCF file
  • lines : number of lines written to out; equivalently number of markers with GT field
  • missings_by_sample: number of missing genotypes in each sample
  • missings_by_record: number of missing genotypes in each record (marker)
  • maf_by_record: minor allele frequency in each record (marker)
  • minorallele_by_record: a Boolean vector indicating the minor allele in each record (marker). minorallele_by_record[i]=true means the minor allele is the REF allele for marker i; minorallele_by_record[i]=false means the minor allele is the ALT allele for marker i
  • hwe_by_record: hardy weinburg equilibrium p-values in each record calculated using pval_method
source
gtstats(record, [missings_by_sample=nothing])

Calculate genotype statistics for a VCF record with GT field.

Input

  • record: a VCF record
  • missings_by_sample: accumulator of missings by sample, missings_by_sample[i] is incremented by 1 if i-th individual has missing genotype in this record

Output

  • n00: number of homozygote REF/REF or REF|REF
  • n01: number of heterozygote REF/ALT or REF|ALT
  • n11: number of homozygote ALT/ALT or ALT|ALT
  • n0: number of REF alleles
  • n1: number of ALT alleles
  • altfreq: proportion of ALT alleles
  • reffreq: proportion of REF alleles
  • missings: number of missing genotypes
  • minorallele: minor allele: false (ALT is minor allele) or true (REF is minor allele)
  • maf: minor allele frequency
  • hwepval: Hardy-Weinberg p-value
source
Missing docstring.

Missing docstring for filter. Check Documenter's build log for details.

Missing docstring.

Missing docstring for filter_record!. Check Documenter's build log for details.

VCFTools.filter_headerFunction
filter_header(reader, sample_mask)

Filters the header info of a VCF reader. Will not keep sample IDs for entry i if sample_mask[i] = false.

Inpute:

  • reader: a VCF reader object.
  • sample_mask: a BitVector. sample_mask[i] = true means keep sample i.

TODO: Create VCFTool.jl signature in meta info. If filedate/signature exist already, delete them before writing new ones

source
VCFTools.filter_genotypeFunction
filter_genotype(record, [genokey=["GT"]])

Filter a VCF record according to genokey and output a VCF record with genotype formats only in genokey.

source
filter_genotype(vcffile, outfile, [genokey=["GT"]])

Filter a VCF file according to genokey (default ["GT"]). Output a VCF file with genotype formats only in genokey. Record (markers) with no fields in genokey are skipped.

source
VCFTools.convert_gtFunction

Convert a two-bit genotype (of ALT alleles) to a real number of type t according to specified SNP model.

source
convert_gt(t, vcffile; [impute=false], [center=false], [scale=false])

Convert the GT data from a VCF file to a matrix of type t, allowing for missing values. Records with missing genotypes are converted to missing.

Input

  • t: a type t <: Real
  • vcffile: VCF file path

Optional argument

  • model: genetic model :additive (default), :dominant, or :recessive
  • impute: impute missing genotype or not, default false
  • center: center gentoype by 2maf or not, default false
  • scale: scale genotype by 1/√2maf(1-maf) or not, default false
  • trans: whether to import data transposed so that each column is 1 genotype, default false
  • msg: A message that will be printed to indicate progress. Defaults to not printing.
  • save_snp_info: Boolean. If true, will also output sample ID, chrom, pos, snp id, ref, and alt vectors

Output

  • out: The genotype matrix stored in column or rows depending on trans.
  • sampleID: The sample ID for each person
  • record_chr: The chromosome number for each record
  • record_pos: The position for each record
  • record_ids: The record ID for each record. Only the first one is recorded when there are multiple IDs.
  • record_ref: The ref allele for each record
  • record_alt: The alternate allele for each record. Only the first one is recorded when there are multiple alternate alleles.
source
VCFTools.convert_htFunction
convert_ht(t, vcffile)

Converts the GT data from a VCF file to a haplotype matrix of type t. One VCF record will become 2 column/row in the resulting matrix. Records cannot have missing genotypes and the target file must be phased.

Input

  • t: a type t <: Real. If t is Bool, output will be a BitMatrix.
  • vcffile: a phased VCF file
  • trans: whether to import data transposed so that each column is a haplotype, default false.
  • msg: A message that will be printed to indicate import progress. Defaults to not printing.
  • save_snp_info: Boolean. If true, will also output sample ID, chrom, pos, snp id, ref, and alt vectors

Output

  • out: The haplotype matrix stored in column or rows depending on trans.
  • sampleID: The sample ID for each person
  • record_chr: The chromosome number for each record
  • record_pos: The position for each record
  • record_ids: The record ID for each record. Only the first one is recorded when there are multiple IDs.
  • record_ref: The ref allele for each record
  • record_alt: The alternate allele for each record. Only the first one is recorded when there are multiple alternate alleles.
source
VCFTools.convert_dsFunction
convert_ds(t, vcffile; [key = "DS"], ...)

Converts dosage data from a VCF file to a numeric matrix of type t. Here key specifies the FORMAT field of the VCF file that encodes the dosage (default = "DS").

Arguments

  • t: type of output matrix.
  • vcffile: VCF file path

Optional argument

  • key: The FIELD name of the VCF file that encodes dosages, defaults to "DS"
  • model: genetic model :additive (default), :dominant, or :recessive
  • impute: impute missing genotype with 2 times the ALT allele frequency or not. Default false
  • center: center gentoype by 2maf or not, default false
  • scale: scale genotype by 1/√2maf(1-maf) or not, default false
  • trans: whether to import data transposed so that each column is 1 genotype, default false
  • msg: A message that will be printed to indicate progress. Defaults to not printing.

Output

  • out: Dosage matrix
  • sampleID: The sample ID for each person
  • record_chr: The chromosome number for each record
  • record_pos: The position for each record
  • record_ids: The record ID for each record. Only the first one is recorded when there are multiple IDs.
  • record_ref: The ref allele for each record
  • record_alt: The alternate allele for each record. Only the first one is recorded when there are multiple alternate alleles.
source
VCFTools.copy_gt!Function
copy_gt!(A, reader; ...)

Fill the rows of matrix A by the GT data from VCF records in reader. Records with missing genotypes are converted to missing.

Input

  • A: a matrix or vector such that eltype(A) <: Union{Missing, Real}. Each row is a person's genotype.
  • reader: a VCF reader

Optional argument

  • model: genetic model :additive (default), :dominant, or :recessive
  • impute: impute missing genotype or not, default false
  • center: center gentoype by 2maf or not, default false
  • scale: scale genotype by 1/√2maf(1-maf) or not, default false
  • msg: A message that will be printed to indicate progress. Defaults to not printing.
  • sampleID: The sample ID for each person
  • record_chr: The chromosome number for each record
  • record_pos: The position for each record
  • record_ids: The record ID for each record. Only the first one is recorded when there are multiple IDs.
  • record_ref: The ref allele for each record
  • record_alt: The alternate allele for each record. Only the first one is recorded when there are multiple alternate alleles.

Output

  • A: ismissing(A[i, j]) == true indicates missing genotype. If impute=true, ismissing(A[i, j]) == false for all entries.
source
VCFTools.copy_ht!Function
copy_ht!(A, reader)

Fill 2 columns of A by the GT data from VCF records in reader, each record filling 2 columns. Missing data is not allowed and genotypes must be phased.

Input

  • A: matrix where rows are haplotypes. Person i's haplotype are filled in rows 2i - 1 and 2i.
  • reader: a VCF reader

Optional inputs

  • msg: A message that will be printed to indicate progress. Defaults to not printing.
  • sampleID: The sample ID for each person
  • record_chr: The chromosome number for each record
  • record_pos: The position for each record
  • record_ids: The record ID for each record. Only the first one is recorded when there are multiple IDs.
  • record_ref: The ref allele for each record
  • record_alt: The alternate allele for each record. Only the first one is recorded when there are multiple alternate alleles.
source
VCFTools.copy_ds!Function
copy_ds!(A, reader; [key="DS"], ...)

Fill the columns of matrix A by the dosage data from VCF records in reader. Records with missing fields (i.e. key field is .) are converted to missing.

Input

  • A: matrix where rows are dosages.
  • reader: a VCF reader

Optional argument

  • key: The field key for accessing dosage values, default "DS"
  • model: genetic model :additive (default), :dominant, or :recessive
  • impute: impute missing genotype or not, default false
  • center: center gentoype by 2maf or not, default false
  • scale: scale genotype by 1/√2maf(1-maf) or not, default false
  • msg: A message that will be printed to indicate progress. Defaults to not printing.
  • sampleID: The sample ID for each person
  • record_chr: The chromosome number for each record
  • record_pos: The position for each record
  • record_ids: The record ID for each record. Only the first one is recorded when there are multiple IDs.
  • record_ref: The ref allele for each record
  • record_alt: The alternate allele for each record. Only the first one is recorded when there are multiple alternate alleles.
source
VCFTools.conformgt_by_idFunction
conformgt_by_id(reffile, tgtfile, outfile, chrom, posrange, checkfreq)

Match the VCF records in tgtfile to those in reffile according to ID. The function will:

  1. Find corresponding VCF records in the target and reference files
  2. Exclude target VCF records whose ID cannot be matched to any reference VCF record
  3. Exclude target VCF records whose test of equal allele frequency is rejected at significance level checkfreq
  4. Adjust target VCF records so that chromosome strand and allele order match the VCF reference file
  5. The matched VCF records are written into files outfile.tgt.vcf.gz and

outfile.ref.vcf.gz, both with only "GT" data

Input

  • reffile: VCF file with reference genotype (GT) data
  • tgtfile: VCF file with target genotype (GT) data
  • outfile: the prefix for output filenames
  • chrom: chromosome name, must be identical in target and reference files
  • posrange: position range in the reference file
  • checkfreq: significance level for testing equal alelle frequencies between mached target and reference records. If the test pvalue is ≤ checkfreq, the records are not output. Setting checkfreq=0 or checkfreq=false (default) implies not checking allele frequencies. Setting checkfreq=1 effectively rejects all tests and no matched records are output

Output

  • lines: number of matched VCF records
source
VCFTools.conformgt_by_posFunction
conformgt_by_pos(reffile, tgtfile, outfile, chrom, posrange, checkfreq)

Match the VCF GT records in tgtfile to those in reffile according to chromosome position. The function will:

  1. Find corresponding VCF records in the target and reference files
  2. Exclude target VCF records whose position cannot be matched to any reference VCF record
  3. Exclude target VCF records whose test of equal allele frequency is rejected at significance level checkfreq
  4. Adjust target VCF records so that chromosome strand and allele order match the VCF reference file
  5. The matched VCF records are written into files outfile.tgt.vcf.gz and

outfile.ref.vcf.gz, both with only "GT" data

Input

  • reffile: VCF file with reference genotype (GT) data
  • tgtfile: VCF file with target genotype (GT) data
  • outfile: the prefix for output filenames
  • chrom: chromosome name, must be identical in target and reference files
  • posrange: position range in the reference file
  • checkfreq: significance level for testing equal alelle frequencies between mached target and reference records. If the test pvalue is ≤ checkfreq, the records are not output. Setting checkfreq=0 or checkfreq=false (default) implies not checking allele frequencies. Setting checkfreq=1 effectively rejects all tests and no matched records are output

Output

  • lines: number of matched VCF records
source
VCFTools.mask_gtFunction
mask_gt(src, masks; [des = "masked." * src], [separator = '/'])

Creates a new VCF file des where genotype entry (i, j) of src is missing if masks[i, j] is true. src is unchanged.

Arguments

  • src: Input VCF file name.
  • masks: Bit matrix. masks[i, j] = true means mask entry (i, j).
  • des: output VCF file name.
  • unphase: If true, all heterozygous genotypes will be 1/0 and homozygous separator will be /. If false, all unmasked genotypes will be retained.
source
VCFTools.write_vcfFunction
write_vcf(outfile, x1, x2, [chr], [pos], [ref], [alt], [sampleID], [SNPids])

Writes haplotypes in x1 and x2 into VCF file, tracking phase information.

Inputs

  • outfile: Output file name (ends in .vcf or .vcf.gz)
  • x1: n × p matrix of the 1st haplotype for each sample. Each row is a haplotype, and each entry is 0 or 1
  • x2: n × p matrix of the 2nd haplotype for each sample. x = x1 + x2. Each entry is 0 or 1

Optional Inputs

  • chr: Vector of String. chr[i] is chromosome number of SNP i.
  • pos: Vector of Int. pos[i] is position of SNP i.
  • ref: Vector of String. ref[i] is reference allele of SNP i.
  • alt: Vector of String. alt[i] is alternate allele of SNP i.
  • sampleID: Vector of String. sampleID[i] is reference allele of SNP i.
  • SNPids: Vector of String. SNPids[i] is reference allele of SNP i.

Reference

For multithreaded write, see https://github.com/OpenMendel/MendelImpute.jl/blob/master/src/impute.jl#L23

source
write_vcf(outfile, x, [chr], [pos], [ref], [alt], [sampleID], [SNPids])

Writes genotypes in x into VCF file, not tracking phase information.

Inputs

  • outfile: Output file name (ends in .vcf or .vcf.gz)
  • x: n × p genotype matrix. Each row is a sample, and each entry is 0, 1, or 2.

Optional Inputs

  • chr: Vector of String. chr[i] is chromosome number of SNP i.
  • pos: Vector of Int. pos[i] is position of SNP i.
  • ref: Vector of String. ref[i] is reference allele of SNP i.
  • alt: Vector of String. alt[i] is alternate allele of SNP i.
  • sampleID: Vector of String. sampleID[i] is reference allele of SNP i.
  • SNPids: Vector of String. SNPids[i] is reference allele of SNP i.

Reference

For multithreaded write, see https://github.com/OpenMendel/MendelImpute.jl/blob/master/src/impute.jl#L23

source
VCFTools.aim_selectFunction
aim_select(vcffile::AbstractString, sampleID_to_population::Dict{String, String}, 
    [maf_threshold::Float64=0.01])

Ranks SNPs by their ancestry information content. All people should be assigned an ancestry origin and be fully typed. P-values are computed via a likelihood ratio heterogeneity test. Sex chromosome is ignored (for now).

Inputs

  • vcffile: VCF file, ending with .vcf or .vcf.gz
  • sampleID_to_population: A dictionary mapping every sample ID to a population origin.

Optional Inputs

  • maf_threshold: Minimum minor allele frequency for SNPs considered (default 0.01)

Note:

Method is adapted from MendelAimSelection.jl https://github.com/OpenMendel/MendelAimSelection.jl

source