Skip to content

Principal Component Analysis (PCA)

Before we can perform ancestry inference, we need to perform PCA with our data. This allows us to simplify the complexity of the data and create vectors that explain the majority of the variance within the dataset.

We will be performing PCA calculations using Plink.

File Types

We will be using 3 different file types together to run PCA.

  1. BED file

    A bed file is a binary biallelic genotype table. We are unable to read it but Plink uses it for PCA and other operations. More Information

  2. BIM file

    A BIM file contains information on every variant in a given datset, with data such as chromosome and variant identifier. More Information

  3. FAM

    A FAM file is a text file containing information on each person in the dataset. More Information

Running PCA

Here is a script to run a simple PCA:

plink --bfile {myfile} --keep-allele-order --pca 20 --out {output}

//Outputs
    - output.eigenvec
    - output.eigenval

Note: bfile refers to all three plink files (BED, BIM, FAM), and they must have the same file header. This can also be run on a vcf file using the --vcf flag

Options

Due to the way that plink handles alleles between different file types, it is always recommended to use the flag --keep-allele-order for every plink command

There are several options to fine tune your pca command depending on the data input, more information on that here

Plotting

More information to plot PCA results will be in the plotting section