Skip to content

Data Harmonization

Data harmonization is imporant, because it ensures that the reference file and the query / test file have the same variants and merges them together. This is a crucial step, becuase it then allows us to run PCA and Rye/Admixture.

Commands

Below is a generalized framework to harmonize and merge reference and query data so that PCA and ancestry inference can be run:

/* 
This command extracts variants that pass LD pruning
You should do the same on the reference dataset
*/
plink --bfile {Query Datset} --keep-allele-order --extract {Variant List} --make-bed --out {Output Filename}


//Next we need to find overlapping variants (using R code)
reference_bim = as.data.frame(fread("{Path to reference .bim file}"))
query_bim = as.data.frame(fread("{Path to query .bim file}"))


//Extracting overlapping variants between the two files
common_snps = which(reference_bim$V2 %in% query_bim$V2)


//Writing to a new text file
write.table(query_bim$V2[common_snps], file="{Overlapping Variant file path}", sep="\t", col.names=F, row.names=F, quote=F)


//And now prune files yet again
plink --bfile {Query Datset} --keep-allele-order --extract {Overlap List} --make-bed --out {Output Filename}
plink --bfile {Reference Datset} --keep-allele-order --extract {Overlap List} --make-bed --out {Output Filename}


//Finally, merge the files
plink --bfile {Query file path} --keep-allele-order  --bmerge {Reference file path} --out {Outfile name}

This will create multiple files:

  1. merged dataset

    this is the merged dataset that combines the reference and query datsets

  2. overlap variant list

    This is the list of the overlapping variants between the query and reference datasets