Ancestry Informative Markers
When sample origins are available, one can select SNPs that are most informative at predicting ancestry for your data — the best Ancestry Informative Markers (AIMs).
Example data
For illustration, we use chromosome 22 data (129 MB) from 1000 genomes project. As shown below, this data contains 424147 SNPs and 2504 samples.
using Revise
using VCFTools
using Random
using CSV
using DataFrames
using StatsBase
# download data
vcffile = "chr22.1kg.phase3.v5a.vcf.gz"
isfile(vcffile) ||
download("http://bochet.gcc.biostat.washington.edu/beagle/1000_Genomes_phase3_v5a/b37.vcf/chr22.1kg.phase3.v5a.vcf.gz",
joinpath(pwd(), vcffile))
# compute simple summary statistics
@show nrecords(vcffile)
@show nsamples(vcffile);
nrecords(vcffile) = 424147
nsamples(vcffile) = 2504
Note the population codes for 1000 genome's samples are explained here.
Preprocess data
To compute AIM markers, we need:
- Origin (country/continent/population) of every sample stored in a dictionary
- Complete SNP information. If any sample contains missing genotypes, the given SNP will have p-value of 1.
The goal is to save the sample origin information into a dictionary where keys are sample IDs and values are origin. Here, the population origin are stored in the file 1000genomes.population.txt
under VCFTools/test
.
# cd to test folder
cd(normpath(VCFTools.datadir()))
# read population origin into a dataframe
df = CSV.read("1000genomes.population.txt", DataFrame)
# create dictionary with key = ID, value = population
sampleID_to_population = Dict{String, String}()
for (id, population) in eachrow(df)
sampleID_to_population[id] = population
end
sampleID_to_population
Dict{String, String} with 2504 entries:
"HG01791" => "GBR"
"HG02736" => "PJL"
"HG00182" => "FIN"
"HG03914" => "BEB"
"HG00149" => "GBR"
"NA12156" => "CEU"
"HG02642" => "GWD"
"HG02851" => "GWD"
"NA19835" => "ASW"
"NA19019" => "LWK"
"HG01131" => "CLM"
"HG03578" => "MSL"
"NA18550" => "CHB"
"HG02401" => "CDX"
"HG01350" => "CLM"
"HG03973" => "ITU"
"NA07000" => "CEU"
"HG01709" => "IBS"
"HG01395" => "PUR"
"HG01980" => "PEL"
"HG01979" => "PEL"
"HG01122" => "CLM"
"HG03869" => "ITU"
"HG03729" => "ITU"
"NA19920" => "ASW"
⋮ => ⋮
Run AIM selection
A p-value will be computed for each SNP. Smaller p-values indicate more ancestry informative.
pvals = VCFTools.aim_select(vcffile, sampleID_to_population)
[32mProgress: 100%|█████████████████████████████████████████| Time: 0:05:45[39m
424147-element Vector{Float64}:
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
1.0
5.684678238417218e-120
2.2097628180404582e-104
1.0
⋮
1.0
1.0
3.918650824065397e-73
2.2771505483860234e-77
1.0
1.8351937329381248e-92
1.0
1.0
1.0
3.0769656913253715e-14
1.0
1.0
We can rank these p-values
aim_rank = ordinalrank(pvals)
df = DataFrame(pvalues=pvals, rank=aim_rank)
@show df[1:20, :]; # list 20 rows
df[1:20, :] = 20×2 DataFrame
Row │ pvalues rank
│ Float64 Int64
─────┼──────────────────────
1 │ 1.0 193453
2 │ 1.0 193454
3 │ 1.0 193455
4 │ 1.0 193456
5 │ 1.0 193457
6 │ 1.0 193458
7 │ 1.0 193459
8 │ 1.0 193460
9 │ 1.0 193461
10 │ 1.0 193462
11 │ 5.68468e-120 39800
12 │ 2.20976e-104 49662
13 │ 1.0 193463
14 │ 1.0 193464
15 │ 1.0 193465
16 │ 2.00626e-61 96498
17 │ 1.0 193466
18 │ 1.4409e-78 73608
19 │ 1.0 193467
20 │ 1.0 193468