API

Documentation for VCFTools.jl's functions.

Index

Functions

# VCFTools.openvcfFunction.

openvcf(vcffile, [mode = "r"])

Open VCF file (.vcf or .vcf.gz) and return an IO stream.

source

# VCFTools.nrecordsFunction.

nrecords(vcffile)

Number of records (markers) in a VCF file.

source

# VCFTools.nsamplesFunction.

nsamples(vcffile)

Number of samples (individuals) in a VCF file.

source

# VCFTools.gtstatsFunction.

gtstats(vcffile, [out=DevNull])

Calculate genotype statistics for each marker with GT field in a VCF file.

Input

  • vcffile: VCF file, ending with .vcf or .vcf.gz
  • out: output file name or IOStream. Default is out=DevNull (no output).

One line with 15 tab-delimited fiels is written per marker to out: - 1-8) VCF fixed fields (CHROM, POS, ID, REF, ALT, QUAL, FILT, INFO) - 9) Missing genotype count - 10) Missing genotype frequency - 11) ALT allele count - 12) ALT allele frequency - 13) Minor allele count (REF allele vs ALT alleles) - 14) Minor allele frequency (REF allele vs ALT alleles) - 15) HWE P-value (REF allele vs ALT alleles)

Output

  • records: number of records in the input VCF file
  • samples: number of individuals in the input VCF file
  • lines : number of lines written to out; equivalently number of markers with GT field
  • missings_by_sample: number of missing genotypes in each sample
  • missings_by_record: number of missing genotypes in each record (marker)
  • maf_by_record: minor allele frequency in each record (marker)
  • minorallele_by_record: a Boolean vector indicating the minor allele in each record (marker). minorallele_by_record[i]=true means the minor allele is the REF allele for marker i; minorallele_by_record[i]=false means the minor allele is the ALT allele for marker i

source

gtstats(record, [missings_by_sample=nothing])

Calculate genotype statistics for a VCF record with GT field.

Input

  • record: a VCF record
  • missings_by_sample: accumulator of misisngs by sample, missings_by_sample[i] is incremented by 1 if i-th individual has missing genotype in this record

Output

  • n00: number of homozygote ALT/ALT or ALT|ALT
  • n01: number of heterozygote REF/ALT or REF|ALT
  • n11: number of homozygote REF/REF or REF|REF
  • n0: number of ALT alleles
  • n1: number of REF alleles
  • altfreq: proportion of ALT alleles
  • reffreq: proportion of REF alleles
  • missings: number of missing genotypes
  • minorallele: minor allele: false (ALT allele) or true (REF allele)
  • maf: minor allele frequency
  • hwepval: Hardy-Weinberg p-value

source

# VCFTools.filter_genotypeFunction.

filter_genotype(record, [genokey=["GT"]])

Filter a VCF record according to genokey and output a VCF record with genotype formats only in genokey.

source

filter_genotype(vcffile, outfile, [genokey=["GT"]])

Filter a VCF file according to genokey (default ["GT"]). Output a VCF file with genotype formats only in genokey. Record (markers) with no fields in genokey are skipped.

source

# VCFTools.copy_gt!Function.

copy_gt!(A, reader; [model=:additive], [impute=false], [center=false], [scale=false])

Fill the columns of a nullable matrix A by the GT data from VCF records in reader. Each column of A corresponds to one record. Record without GT field is converted to NaN.

Input

  • A: a nullable matrix or nullable vector
  • reader: a VCF reader

Optional argument

  • model: genetic model :additive (default), :dominant, or :recessive
  • impute: impute missing genotype or not, default false
  • center: center gentoype by 2maf or not, default false
  • scale: scale genotype by 1/√2maf(1-maf) or not, default false

Output

  • A: isnull(A[i, j]) == true indicates missing genotype. If impute=true, isnull(A[i, j]) == false for all entries.

source

# VCFTools.convert_gtFunction.

Convert a two-bit genotype to a real number (minor allele count) of type t according to specified SNP model. Missing genotype is converted to null. minor_allele==true indicates REF is the minor allele; minor_allele==false indicates ALT is the minor allele.

source

convert_gt!(t, vcffile; [impute=false], [center=false], [scale=false])

Convert the GT data from a VCF file to a nullable matrix of type t. Each column of the matrix corresponds to one VCF record. Record without GT field is converted to equivalent of missing genotypes.

Input

  • t: a type t <: Real
  • vcffile: VCF file path

Optional argument

  • model: genetic model :additive (default), :dominant, or :recessive
  • impute: impute missing genotype or not, default false
  • center: center gentoype by 2maf or not, default false
  • scale: scale genotype by 1/√2maf(1-maf) or not, default false

Output

  • A: a nulalble matrix of type NullableMatrix{T}. isnull(A[i, j]) == true indicates missing genotype, even when A.values[i, j] may hold the imputed genotype

source

# VCFTools.conformgt_by_idFunction.

conformgt_by_id(reffile, tgtfile, outfile, chrom, posrange, checkfreq)

Match the VCF records in tgtfile to those in reffile according to ID. The function will:

  1. Find corresponding VCF records in the target and reference files
  2. Exclude target VCF records whose ID cannot be matched to any reference VCF record
  3. Exclude target VCF records whose test of equal allele frequency is rejected at significance level checkfreq
  4. Adjust target VCF records so that chromosome strand and allele order match the VCF reference file
  5. The matched VCF records are written into files outfile.tgt.vcf.gz and

outfile.ref.vcf.gz, both with only "GT" data

Input

  • reffile: VCF file with reference genotype (GT) data
  • tgtfile: VCF file with target genotype (GT) data
  • outfile: the prefix for output filenames
  • chrom: chromosome name, must be identical in target and reference files
  • posrange: position range in the reference file
  • checkfreq: significance level for testing equal alelle frequencies between mached target and reference records. If the test pvalue is ≤ checkfreq, the records are not output. Setting checkfreq=0 or checkfreq=false (default) implies not checking allele frequencies. Setting checkfreq=1 effectively rejects all tests and no matched records are output

Output

  • lines: number of matched VCF records

source

# VCFTools.conformgt_by_posFunction.

conformgt_by_pos(reffile, tgtfile, outfile, chrom, posrange, checkfreq)

Match the VCF GT records in tgtfile to those in reffile according to chromosome position. The function will:

  1. Find corresponding VCF records in the target and reference files
  2. Exclude target VCF records whose position cannot be matched to any reference VCF record
  3. Exclude target VCF records whose test of equal allele frequency is rejected at significance level checkfreq
  4. Adjust target VCF records so that chromosome strand and allele order match the VCF reference file
  5. The matched VCF records are written into files outfile.tgt.vcf.gz and

outfile.ref.vcf.gz, both with only "GT" data

Input

  • reffile: VCF file with reference genotype (GT) data
  • tgtfile: VCF file with target genotype (GT) data
  • outfile: the prefix for output filenames
  • chrom: chromosome name, must be identical in target and reference files
  • posrange: position range in the reference file
  • checkfreq: significance level for testing equal alelle frequencies between mached target and reference records. If the test pvalue is ≤ checkfreq, the records are not output. Setting checkfreq=0 or checkfreq=false (default) implies not checking allele frequencies. Setting checkfreq=1 effectively rejects all tests and no matched records are output

Output

  • lines: number of matched VCF records

source