Skip to content

Quality Control

Quality control is a process that occurs before and after harmonization/merging of data, and is a vital step to ensure that the data is viable for ancestry inference.

There are several quality control parameters that can be used with plink to ensure that our dataset is good to go. These filters are to be used on a case-by-case basis, and learning more about each filter before implementing them is recommended.

Variant Level QC

Variant level QC is quality control done on specific variants, or on a variant level.

HWE Filter

Excludes variants that have Hardy-Weinberg equilibrium exact test p-value below the given threshold

--hwe <threshold>

//Threshold format in scientific notation (e.g. 1e-30)

Indel Filter

Removes insertions (variants with length greater than 1)

--exclude <txt file with list of excluded variants>

LD Pruning

LD pruning is the process of filtering out variants above a certain LD threshold. More on linkage disequilibrium here: https://www.nature.com/articles/nrg2361

--indep-pairwise <window size> <step size> <r^2 threshold>

Geno Filter

Removes SNPs with > a given % of missingness

--geno <missingness % threshold>

MAF Filter

Only include SNPs with > given allele frequency

--maf <allele frequency threshold>

Sample Level QC

Sample level QC is quality control done on the entire sample / dataset, and not just on specific variants

Mind Filter

Removes individuals with a given percentage of missing genotpyes

--mind <percentage missing genotypes>

QUAL Filter

Excludes variants under a given quality threshold

--qual-threshold <quality threshold>