Estimating ancestry

If samples in the reference haplotype panel are labeled with a population origin, MendelImpute can also be used for:

Estimate admixed proportions
Chromosome painting

We use the 1000 genomes chromosome 22 as illustration. Example code to generate plots are presented.

# first load all necessary packages
using MendelImpute
using VCFTools
using GeneticVariation
using Random
using DataFrames
using Plots
using JLSO

┌ Info: Precompiling MendelImpute [e47305d1-6a61-5370-bc5d-77554d143183]
└ @ Base loading.jl:1278

Data preparation

Step 0. Filter chromosome data

The original chromosome data are filtered into target and reference panels. Follow detailed example in Phasing and Imputation to obtain the same data.

Step 1. Get population data

Download population code for each 1000 genomes sample via the command below (note wget will probably not work on non-Mac OS). Different population code is explained here.

# run this code in terminal
# wget -r -l3 -N --no-parent ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/data_collections/1000_genomes_project/data/

For easier processing, copy the country of origin data into a folder called data. It should look contain these subfolders (where each population code contains the sample IDs that belong to the population):

;ls data

ACB
ASW
BEB
CDX
CEU
CHB
CHS
CLM
ESN
FIN
GBR
GIH
GWD
IBS
ITU
JPT
KHV
LWK
MSL
MXL
PEL
PJL
PUR
STU
TSI
YRI

Step 2. Process each sample's population origin

The goal here is to create a Dict{key, value} where each key is a sample ID and the value is the population code. This will be used for both the paint and composition function.

Here the population origin for different samples are encoded in weird subfolder directory way. We process them into the desired dictionary structure.

df = DataFrame(sample = String[], population = String[])
refID_to_population = Dict{String, String}()
for population in readdir("data/")
    population == ".DS_Store" && continue # skip auxiliary files
    for sample in readdir("data/" * population)
        sample == ".DS_Store" && continue # skip auxiliary files
        push!(df, (sample, population))
        refID_to_population[sample] = population
    end
end
refID_to_population

Dict{String,String} with 2709 entries:
  "HG01791" => "GBR"
  "HG02736" => "PJL"
  "HG00182" => "FIN"
  "HG03914" => "BEB"
  "HG00149" => "GBR"
  "NA12156" => "CEU"
  "HG02642" => "GWD"
  "HG02851" => "GWD"
  "NA19835" => "ASW"
  "NA19019" => "LWK"
  "HG01131" => "CLM"
  "HG03725" => "ITU"
  "HG03578" => "MSL"
  "NA18550" => "CHB"
  "HG02401" => "CDX"
  "HG01350" => "CLM"
  "HG03973" => "ITU"
  "NA07000" => "CEU"
  "HG01709" => "IBS"
  "HG01395" => "PUR"
  "HG02388" => "CDX"
  "HG01980" => "PEL"
  "HG01979" => "PEL"
  "HG01122" => "CLM"
  "HG03869" => "ITU"
  ⋮         => ⋮

Step 3. Compute phase information using MendelImpute

This is equivalent to running a typical imputation. Please ensure that:

The output file name ends with .jlso (save output to ultra-compressed format)
impute = true (so the output contains the entire chromosome)

Note data used here is prepared in Detailed Example.

# compute each person's phase information
tgtfile = "target.chr22.typedOnly.masked.vcf.gz"
reffile = "ref.chr22.maxd1000.excludeTarget.jlso"
outfile = "mendel.imputed.jlso"
@time ph = phase(tgtfile, reffile, outfile);

Number of threads = 1
Importing reference haplotype data...


[32mComputing optimal haplotypes...100%|████████████████████| Time: 0:00:22[39m


Total windows = 1634, averaging ~ 508 unique haplotypes per window.

Timings: 
    Data import                     = 13.1989 seconds
        import target data             = 3.39344 seconds
        import compressed haplotypes   = 9.80547 seconds
    Computing haplotype pair        = 22.7262 seconds
        BLAS3 mul! to get M and N      = 0.990657 seconds per thread
        haplopair search               = 17.9241 seconds per thread
        initializing missing           = 0.0950638 seconds per thread
        allocating and viewing         = 0.262792 seconds per thread
        index conversion               = 0.0134224 seconds per thread
    Phasing by win-win intersection = 4.74799 seconds
        Window-by-window intersection  = 0.490777 seconds per thread
        Breakpoint search              = 3.28471 seconds per thread
        Recording result               = 0.00808895 seconds per thread
    Imputation                     = 3.06857 seconds
        Imputing missing               = 0.0608952 seconds
        Writing to file                = 3.00767 seconds

    Total time                      = 43.917 seconds

 51.409108 seconds (99.91 M allocations: 5.675 GiB, 4.56% gc time)

Estimate admixture proportions

The composition will compute a list of percentages where composition[i] equals the sample's ancestry (in %) from populations[i]. Thus we simply have to plot the result. This illustration depends on data preparation above.

Step 1: import necessary data

# First import compressed reference panel
reffile = "ref.chr22.maxd1000.excludeTarget.jlso"
compressed_Hunique = JLSO.load(reffile)[:compressed_Hunique]
panelID = compressed_Hunique.sampleID

# also need target sample's ancestry
tgtfile = "target.chr22.typedOnly.masked.vcf.gz"
reader = VCF.Reader(openvcf(tgtfile, "r"))
tgtID  = VCF.header(reader).sampleID
sample_population = [refID_to_population[id] for id in tgtID];

# here is our sample population (sample 1 is GBR, 4 is CHS, 84 is LWK...etc)
sample_population

100-element Array{String,1}:
 "GBR"
 "FIN"
 "CHS"
 "CHS"
 "CDX"
 "CDX"
 "PUR"
 "PUR"
 "PUR"
 "PUR"
 "GBR"
 "CLM"
 "IBS"
 ⋮
 "MXL"
 "ASW"
 "ASW"
 "TSI"
 "TSI"
 "TSI"
 "TSI"
 "TSI"
 "TSI"
 "TSI"
 "GIH"
 "GIH"

Step 2: call `composition` function

The composition will compute a list of percentages where composition[i] equals the sample's ancestry (in %) from populations[i]. We are finally using the imputation result stored in ph.

populations = MendelImpute.unique_populations(refID_to_population)
@time sample1_comp = composition(ph[1], panelID, refID_to_population) # origin GBR
@time sample4_comp = composition(ph[4], panelID, refID_to_population) # origin CHS
@time sample84_comp = composition(ph[84], panelID, refID_to_population) # origin LWK

  0.003916 seconds (28 allocations: 2.719 KiB)
  0.000336 seconds (8 allocations: 1.250 KiB)
  0.000370 seconds (8 allocations: 1.250 KiB)





26-element Array{Float64,1}:
 0.03159153536944775
 0.004281873442329377
 0.0183269450383178
 0.0025138200248509064
 0.0014987155576730166
 0.11780598576459632
 0.0104802353893189
 0.21297801484829695
 0.0338623527918964
 0.004151394212468068
 0.07204967308369105
 0.0002681408026507634
 0.004550014244980141
 0.013386690160908423
 0.06359485840010917
 0.006903428611102466
 0.0006188785489751994
 0.1910251836884204
 0.01634940661876842
 0.06281796830671477
 0.0004357288043074905
 0.003690527207912069
 0.11661611095283356
 0.006875896296544575
 0.0028884989142691606
 0.0004381229186168723

We computed the population percentages for sample 1, 4, and 84 with respect to the 26 reference populations. Note the first run is slower because Julia has to compile the code.

Step 3: Plot the percentages

Here sample1_comp[i] equals the sample's estimated ancestry (in %) from populations[i]. Thus we simply have to create a bar plot for each:

barplot = bar(sample1_comp, xticks=(1:1:26, populations), xrotation=50, grid=false, 
    ylabel = "Ancestry proportions", label="Sample 1 (GBR)", alpha=0.8, legend=:top,
    xtickfont=font(10), ytickfont=font(11), legendfont=font(9), yguidefontsize=18)
bar!(barplot, sample4_comp, label="Sample 4 (CHS)", alpha=0.8)
bar!(barplot, sample84_comp, label="Sample 84 (LWK)", alpha=0.8)

svg

Chromosome painting

The main function is the paint function. For an imputed sample, it will convert each haplotype segment into a percentage indicating the segment's length in the chromosome. Then the list can be used for easy plotting.

Note: this illustration depends on data preparation above.

Step 1: Choose your colors

In this example, colors are arranged such that:

Blue ≈ European/American
Red ≈ Asian
Green ≈ African

Of course, Julia lets you plot your favoriate colors. We pick our colors here: https://mdigi.tools/color-shades/#008000.

# generated here: https://mdigi.tools/color-shades/#008000
# Blue ≈ European/American, Red ≈ Asian, Green ≈ Africa
goodcolors = [colorant"#c8c8ff", colorant"#ffeaea", colorant"#ffbfbf", colorant"#a4a4ff",
    colorant"#8080ff", colorant"#e3ffe3", colorant"#aaffaa", colorant"#71ff71", 
    colorant"#5b5bff", colorant"#ff9595", colorant"#39ff39", colorant"#ff6a6a",
    colorant"#ff4040", colorant"#3737ff", colorant"#1212ff", colorant"#0000c8", 
    colorant"#0000a4", colorant"#00ff00", colorant"#ff1515", colorant"#00c600", 
    colorant"#ea0000", colorant"#bf0000", colorant"#008e00", colorant"#00005b",
    colorant"#950000", colorant"#6a0000"]

svg

Step 2: Run `paint` funcion

This function convert the imputed haplotype segments into a list of percentages (one list for each strand). This is simply a post-processing routine so that data can be used for easy plotting later.

populations = unique_populations(refID_to_population)
@time sample1_s1_comp, sample1_s2_comp = paint(ph[1], panelID, refID_to_population)
@time sample4_s1_comp, sample4_s2_comp = paint(ph[4], panelID, refID_to_population)
@time sample84_s1_comp, sample84_s2_comp = paint(ph[84], panelID, refID_to_population);

  0.000260 seconds (12 allocations: 19.125 KiB)
  0.000234 seconds (12 allocations: 20.375 KiB)
  0.000252 seconds (12 allocations: 22.875 KiB)

Step 3: Generate plots for painted chromosomes

We found the StatsPlots.jl package to be more useful for this purpose, although the code below still did the plotting in a very roundabout way.

# assign a color to each haplotype segment
sample1_s1_colors = [goodcolors[findfirst(x -> x == pop, populations)] for pop in sample1_s1_comp[2]]
sample1_s1_colors = reshape(sample1_s1_colors, 1, length(sample1_s1_colors))

sample1_s2_colors = [goodcolors[findfirst(x -> x == pop, populations)] for pop in sample1_s2_comp[2]]
sample1_s2_colors = reshape(sample1_s2_colors, 1, length(sample1_s2_colors))

sample4_s1_colors = [goodcolors[findfirst(x -> x == pop, populations)] for pop in sample4_s1_comp[2]]
sample4_s1_colors = reshape(sample4_s1_colors, 1, length(sample4_s1_colors))

sample4_s2_colors = [goodcolors[findfirst(x -> x == pop, populations)] for pop in sample4_s2_comp[2]]
sample4_s2_colors = reshape(sample4_s2_colors, 1, length(sample4_s2_colors))

sample84_s1_colors = [goodcolors[findfirst(x -> x == pop, populations)] for pop in sample84_s1_comp[2]]
sample84_s1_colors = reshape(sample84_s1_colors, 1, length(sample84_s1_colors))

sample84_s2_colors = [goodcolors[findfirst(x -> x == pop, populations)] for pop in sample84_s2_comp[2]]
sample84_s2_colors = reshape(sample84_s2_colors, 1, length(sample84_s2_colors));

using StatsPlots, FixedPointNumbers

# some tedious and roundabout routine for making a bad groupedplot
sample1_s1l = length(sample1_s1_comp[1])
sample1_s2l = length(sample1_s2_comp[1])
sample4_s1l = length(sample4_s1_comp[1])
sample4_s2l = length(sample4_s2_comp[1])
sample84_s1l = length(sample84_s1_comp[1])
sample84_s2l = length(sample84_s2_comp[1])
maxlen = max(sample1_s1l, sample1_s2l, sample4_s1l, sample4_s2l, sample84_s1l, sample84_s2l)

mydata = zeros(6, maxlen)
copyto!(@view(mydata[1, 1:sample1_s1l]), sample1_s1_comp[1])
copyto!(@view(mydata[2, 1:sample1_s2l]), sample1_s2_comp[1])
copyto!(@view(mydata[3, 1:sample4_s1l]), sample4_s1_comp[1])
copyto!(@view(mydata[4, 1:sample4_s2l]), sample4_s2_comp[1])
copyto!(@view(mydata[5, 1:sample84_s1l]), sample84_s1_comp[1])
copyto!(@view(mydata[6, 1:sample84_s2l]), sample84_s2_comp[1])

mycolors = Matrix{RGB{Normed{UInt8,8}}}(undef, 6, maxlen)
copyto!(@view(mycolors[1, 1:sample1_s1l]), sample1_s1_colors)
copyto!(@view(mycolors[2, 1:sample1_s2l]), sample1_s2_colors)
copyto!(@view(mycolors[3, 1:sample4_s1l]), sample4_s1_colors)
copyto!(@view(mycolors[4, 1:sample4_s2l]), sample4_s2_colors)
copyto!(@view(mycolors[5, 1:sample84_s1l]), sample84_s1_colors)
copyto!(@view(mycolors[6, 1:sample84_s2l]), sample84_s2_colors)

# axis labels
xnames = ["Sample 1 hap1", "Sample 1 hap2", "Sample 4 hap1", "Sample 4 hap2", "Sample 84 hap1", "Sample 84 hap2"]
ynames = ["SNP 1", "SNP 208k", "SNP 417k"]

# final plot
chrom_plt = groupedbar(mydata, bar_position = :stack, bar_width=0.7, label=:none, 
    lw = 0, color=mycolors, xticks=(1:1:6, xnames), yticks=(0:0.5:1, ynames),
    ytickfont=font(12), xtickfont=font(12), xrotation=20)

svg

For more details, please refer to our paper, or file an issue on GitHub.