Plotting

This page will help you get started with plotting in R using Jupyter notebook. I will be going through some tutorials to create useful plots for ancestry inference.

Getting Started with Jupyter Notebook

You should alredy have jupyterlab and an r-kernel installed, and you can enter a jupyter session using the following command:

//You want to enter a port number for <localhost port> (e.g. 8000)

jupyter lab --no-browser --port=<localhost port>

If you want more information, check out this link.

If this does not work, it is also possible to connect to a remote server using Visual Studio Code and edit the Jupyter notebook directly. More instructions on that here:

To reiterate, we will be programming using R in a Jupyter Notebook. All of the code shown will be R.

PCA Plots

Once in Jupyter Notebook, here is a quick example on how to create a PCA plot in R. Here I will be using a subset of the 1000 Genomes Project reference dataset that only includes European, African, and East Asian samples. This dataset is merged with American admixed populations, meaning their genetic makeup is the result of a mixture of different reference populations.

First we have to load our libraries:

library("plyr")
library('dplyr')
library('ggplot2')
library('readr')
library('stats')

Next, we can load in our table(s). In this case, it is just one file with the eigenvector data obtained from my plink files:

eigenvec = read.table("../Data/Sample1KGP/sampleData1KGP.PCA.eigenvec", header = FALSE, sep = " ")

The next thing I did is reformatted the table to make plotting easier:

//Rowname changed from 1, 2, etc. to the specific sample identifier (ACB1, ACB2, ESN1, etc.)
    rownames(eigenvec) <- eigenvec[,2]


//Cutting out starting rows, leaving only eigenvector data
    eigenvec <- eigenvec[,3:ncol(eigenvec)]


//Formatting table to only include first two PCs, changing column headers
    colnames(eigenvec) <- paste('PC', c(1:20), sep = '')
    eigenvec$IndividualID = row.names(eigenvec)
    eigenvec = eigenvec[c("IndividualID", "PC1", "PC2")]
    row.names(eigenvec) <- NULL


//Adding a column named PopGroup, that includes the population group of each sample as a column
    eigenvec = transform(eigenvec, PopGroup = substr(IndividualID, start = 1, stop = 3))

Now we can move on to plotting. Using the package ggplot2, we can create a PCA plot from our eigenvector data:

//Color List

    colors  <- c("YRI" = "blue", "ACB" = "dark red", "ASW" = "dark red", "CEU" = "orange", "CHB" = "dark green", "ESN" = "blue", "GBR" = "orange", "JPT" = "dark green")


//Plotting PCA plot for the top two PCs.

    options(repr.plot.width=12, repr.plot.height=12)
    ggplot(eigenvec, aes(x=PC1, y=PC2, fill=PopGroup)) + scale_fill_manual(values=colors) +
        geom_point(size = 7, pch = 21, color="black") + theme_classic() + theme(axis.title=element_text(size=23)) + theme(axis.text.x=element_text(size=14)) + theme(axis.text.y=element_text(size=14)) +
        theme(legend.key.size = unit(1, 'cm'), #change legend key size
            legend.key.height = unit(1, 'cm'), #change legend key height
            legend.key.width = unit(1, 'cm'), #change legend key width
            legend.title = element_text(size=14), #change legend title font size
            legend.text = element_text(size=14)) #change legend text font size

Running this will create a plot similar to this, which is our working PCA plot.

Alt text

Admixture Plots

The following will be an example on how to make an admixture plot, with the same example data used for the PCA plot example above.

Note: These plots were created using output files from Admixture. Rye is also acceptable, however the output file will need to be manipulated differently to generate a plot using the same R code here.

First we have to load our libraries:

library("plyr")
library("dplyr")
library("ggplot2")
library("reshape2")

Now we can take our plink .fam file and format it to show the population and individual IDs for each sample

#Converting this to get Population Group and Individual ID data

GroupData = read.table('../Data/Sample1KGP/sampleData1KGP.fam', colClasses = c(rep("character", 2), rep("NULL", 4)), header = FALSE)  
colnames(GroupData) = c("PopGroup", "IndividualID")
GroupData <- subset(GroupData, select=c(2,1))

Now let's get their Ancestry estimates obtained from Admixture (.Q file)

admixtureAncestryEstimates = read.table("../Data/Sample1KGP/repTest/sampleData1KGP.3.Q", header = FALSE)
head(admixtureAncestryEstimates)

Now we can combine the ancestry estimates from the .Q file with Individual IDs and Population Groups from the .fam file

GroupData = cbind(GroupData, admixtureAncestryEstimates)
head(GroupData)

Ancestry Groups not assigned by Admixture, so we need to figure it out ourselves

options(scipen = 10000) 
GroupData %>% group_by(PopGroup) %>% summarise_at(vars(V1, V2, V3), funs(mean))

########################
Output:

PopGroup V1            V2        V3       
1 ACB      0.00128136458 0.1080905 0.8906281
2 ASW      0.02441562295 0.2064456 0.7691388
3 CEU      0.00001019192 0.9999798 0.0000100
4 CHB      0.99998000000 0.0000100 0.0000100
5 ESN      0.00001016162 0.0000100 0.9999798
6 GBR      0.00001021978 0.9999798 0.0000100
7 JPT      0.99998000000 0.0000100 0.0000100
8 YRI      0.00001008333 0.0000100 0.9999799

This allows us to determine what column corresponds to which ancestry group given our knowlege of our reference populations

Now we can rename the columns with the corresponding ancestry groups (In this example, V1 = East Asian, V2 = European, V3 = African)

GroupData = GroupData %>% dplyr::rename("European" = "V2", 
                                    "African" = "V3",
                                    "EastAsian" = "V1",)

Now we need to bring ancestry into one column instead of three:

GroupDataSorted = arrange(GroupData, EastAsian, European, African, group_by = PopGroup)

row.names(GroupDataSorted) <- NULL
GroupDataSorted$index= as.numeric(rownames(GroupDataSorted))

GroupDataSorted = melt(data = GroupDataSorted, id.vars = c("IndividualID", "PopGroup", "EastAsian", "European", "African", "index"), measure.vars = c("EastAsian", "European", "African"))

colnames(GroupDataSorted)[7] <- 'Ancestry'
colnames(GroupDataSorted)[8] <- 'AncestryFraction'

print(GroupDataSorted[c('IndividualID', "PopGroup", "Ancestry", "AncestryFraction", "index")])

This next step is not necessary, but I did it for simplicity for the final plot. Here I am subsetting my entire table to extract only admixed samples, which are our samples of interest in this example:

GroupDataSorted2 = GroupDataSorted[grep("A", GroupDataSorted$IndividualID),]
#GroupDataSorted2$index= as.numeric(rownames(GroupDataSorted2))

Lastly, we can now create our admixture plot!

colors  <- c("African" = "blue", "European" = "orange", "EastAsian" = "red")

Admix_Plot = ggplot(data=GroupDataSorted2, aes(x=as.character(index), y=AncestryFraction, fill=Ancestry)) +
    geom_bar(stat="identity", width=1) + facet_grid(cols = vars(PopGroup), scales = "free", space = "free", drop = TRUE, switch="both") +
    scale_fill_manual(values=colors) + 
    labs(y="Perxcentage Ancestry Estimate", title = "Admixture plot for four population groups from 1KGP", subtitle = "Groups: ACB, ASW\nRef Groups: CEU, CHB, ESN, GBR, JPT, YRI") +
    theme(axis.title.x=element_blank(), axis.text.x=element_blank(), axis.ticks.x=element_blank(), panel.spacing.x=unit(1, "lines"), 
        strip.background = element_blank(), panel.background = element_blank(), axis.text=element_text(size=12),
        axis.title=element_text(size=18), strip.text.x = element_text(size = 18), title = element_text(size = 20))

Final Plot:

Alt text