API
Documentation for VCFTools.jl
's functions.
Index
VCFTools.conformgt_by_id
VCFTools.conformgt_by_pos
VCFTools.convert_gt
VCFTools.copy_gt!
VCFTools.filter_genotype
VCFTools.gtstats
VCFTools.nrecords
VCFTools.nsamples
VCFTools.openvcf
Functions
#
VCFTools.openvcf
— Function.
openvcf(vcffile, [mode = "r"])
Open VCF file (.vcf
or .vcf.gz
) and return an IO stream.
#
VCFTools.nrecords
— Function.
nrecords(vcffile)
Number of records (markers) in a VCF file.
#
VCFTools.nsamples
— Function.
nsamples(vcffile)
Number of samples (individuals) in a VCF file.
#
VCFTools.gtstats
— Function.
gtstats(vcffile, [out=DevNull])
Calculate genotype statistics for each marker with GT field in a VCF file.
Input
vcffile
: VCF file, ending with .vcf or .vcf.gzout
: output file name or IOStream. Default isout=DevNull
(no output).
One line with 15 tab-delimited fiels is written per marker to out
: - 1-8) VCF fixed fields (CHROM, POS, ID, REF, ALT, QUAL, FILT, INFO) - 9) Missing genotype count - 10) Missing genotype frequency - 11) ALT allele count - 12) ALT allele frequency - 13) Minor allele count (REF allele vs ALT alleles) - 14) Minor allele frequency (REF allele vs ALT alleles) - 15) HWE P-value (REF allele vs ALT alleles)
Output
records
: number of records in the input VCF filesamples
: number of individuals in the input VCF filelines
: number of lines written toout
; equivalently number of markers with GT fieldmissings_by_sample
: number of missing genotypes in each samplemissings_by_record
: number of missing genotypes in each record (marker)maf_by_record
: minor allele frequency in each record (marker)minorallele_by_record
: a Boolean vector indicating the minor allele in each record (marker).minorallele_by_record[i]=true
means the minor allele is the REF allele for markeri
;minorallele_by_record[i]=false
means the minor allele is the ALT allele for markeri
gtstats(record, [missings_by_sample=nothing])
Calculate genotype statistics for a VCF record with GT field.
Input
record
: a VCF recordmissings_by_sample
: accumulator of misisngs by sample,missings_by_sample[i]
is incremented by 1 ifi
-th individual has missing genotype in this record
Output
n00
: number of homozygote ALT/ALT or ALT|ALTn01
: number of heterozygote REF/ALT or REF|ALTn11
: number of homozygote REF/REF or REF|REFn0
: number of ALT allelesn1
: number of REF allelesaltfreq
: proportion of ALT allelesreffreq
: proportion of REF allelesmissings
: number of missing genotypesminorallele
: minor allele:false
(ALT allele) ortrue
(REF allele)maf
: minor allele frequencyhwepval
: Hardy-Weinberg p-value
#
VCFTools.filter_genotype
— Function.
filter_genotype(record, [genokey=["GT"]])
Filter a VCF record according to genokey
and output a VCF record with genotype formats only in genokey
.
filter_genotype(vcffile, outfile, [genokey=["GT"]])
Filter a VCF file according to genokey
(default ["GT"]
). Output a VCF file with genotype formats only in genokey
. Record (markers) with no fields in genokey
are skipped.
#
VCFTools.copy_gt!
— Function.
copy_gt!(A, reader; [model=:additive], [impute=false], [center=false], [scale=false])
Fill the columns of a nullable matrix A
by the GT data from VCF records in reader
. Each column of A
corresponds to one record. Record without GT field is converted to NaN
.
Input
A
: a nullable matrix or nullable vectorreader
: a VCF reader
Optional argument
model
: genetic model:additive
(default),:dominant
, or:recessive
impute
: impute missing genotype or not, defaultfalse
center
: center gentoype by 2maf or not, defaultfalse
scale
: scale genotype by 1/√2maf(1-maf) or not, defaultfalse
Output
A
:isnull(A[i, j]) == true
indicates missing genotype. Ifimpute=true
,isnull(A[i, j]) == false
for all entries.
#
VCFTools.convert_gt
— Function.
Convert a two-bit genotype to a real number (minor allele count) of type t
according to specified SNP model. Missing genotype is converted to null. minor_allele==true
indicates REF
is the minor allele; minor_allele==false
indicates ALT
is the minor allele.
convert_gt!(t, vcffile; [impute=false], [center=false], [scale=false])
Convert the GT data from a VCF file to a nullable matrix of type t
. Each column of the matrix corresponds to one VCF record. Record without GT field is converted to equivalent of missing genotypes.
Input
t
: a typet <: Real
vcffile
: VCF file path
Optional argument
model
: genetic model:additive
(default),:dominant
, or:recessive
impute
: impute missing genotype or not, defaultfalse
center
: center gentoype by 2maf or not, defaultfalse
scale
: scale genotype by 1/√2maf(1-maf) or not, defaultfalse
Output
A
: a nulalble matrix of typeNullableMatrix{T}
.isnull(A[i, j]) == true
indicates missing genotype, even whenA.values[i, j]
may hold the imputed genotype
#
VCFTools.conformgt_by_id
— Function.
conformgt_by_id(reffile, tgtfile, outfile, chrom, posrange, checkfreq)
Match the VCF records in tgtfile
to those in reffile
according to ID. The function will:
- Find corresponding VCF records in the target and reference files
- Exclude target VCF records whose ID cannot be matched to any reference VCF record
- Exclude target VCF records whose test of equal allele frequency is rejected at significance level
checkfreq
- Adjust target VCF records so that chromosome strand and allele order match the VCF reference file
- The matched VCF records are written into files
outfile.tgt.vcf.gz
and
outfile.ref.vcf.gz
, both with only "GT" data
Input
reffile
: VCF file with reference genotype (GT) datatgtfile
: VCF file with target genotype (GT) dataoutfile
: the prefix for output filenameschrom
: chromosome name, must be identical in target and reference filesposrange
: position range in the reference filecheckfreq
: significance level for testing equal alelle frequencies between mached target and reference records. If the test pvalue is≤ checkfreq
, the records are not output. Settingcheckfreq=0
orcheckfreq=false
(default) implies not checking allele frequencies. Settingcheckfreq=1
effectively rejects all tests and no matched records are output
Output
lines
: number of matched VCF records
#
VCFTools.conformgt_by_pos
— Function.
conformgt_by_pos(reffile, tgtfile, outfile, chrom, posrange, checkfreq)
Match the VCF GT records in tgtfile
to those in reffile
according to chromosome position. The function will:
- Find corresponding VCF records in the target and reference files
- Exclude target VCF records whose position cannot be matched to any reference VCF record
- Exclude target VCF records whose test of equal allele frequency is rejected at significance level
checkfreq
- Adjust target VCF records so that chromosome strand and allele order match the VCF reference file
- The matched VCF records are written into files
outfile.tgt.vcf.gz
and
outfile.ref.vcf.gz
, both with only "GT" data
Input
reffile
: VCF file with reference genotype (GT) datatgtfile
: VCF file with target genotype (GT) dataoutfile
: the prefix for output filenameschrom
: chromosome name, must be identical in target and reference filesposrange
: position range in the reference filecheckfreq
: significance level for testing equal alelle frequencies between mached target and reference records. If the test pvalue is≤ checkfreq
, the records are not output. Settingcheckfreq=0
orcheckfreq=false
(default) implies not checking allele frequencies. Settingcheckfreq=1
effectively rejects all tests and no matched records are output
Output
lines
: number of matched VCF records