Data Harmonization
Data harmonization is imporant, because it ensures that the reference file and the query / test file have the same variants and merges them together. This is a crucial step, becuase it then allows us to run PCA and Rye/Admixture.
Commands
Below is a generalized framework to harmonize and merge reference and query data so that PCA and ancestry inference can be run:
/*
This command extracts variants that pass LD pruning
You should do the same on the reference dataset
*/
plink --bfile {Query Datset} --keep-allele-order --extract {Variant List} --make-bed --out {Output Filename}
//Next we need to find overlapping variants (using R code)
reference_bim = as.data.frame(fread("{Path to reference .bim file}"))
query_bim = as.data.frame(fread("{Path to query .bim file}"))
//Extracting overlapping variants between the two files
common_snps = which(reference_bim$V2 %in% query_bim$V2)
//Writing to a new text file
write.table(query_bim$V2[common_snps], file="{Overlapping Variant file path}", sep="\t", col.names=F, row.names=F, quote=F)
//And now prune files yet again
plink --bfile {Query Datset} --keep-allele-order --extract {Overlap List} --make-bed --out {Output Filename}
plink --bfile {Reference Datset} --keep-allele-order --extract {Overlap List} --make-bed --out {Output Filename}
//Finally, merge the files
plink --bfile {Query file path} --keep-allele-order --bmerge {Reference file path} --out {Outfile name}
This will create multiple files:
-
merged dataset
this is the merged dataset that combines the reference and query datsets
-
overlap variant list
This is the list of the overlapping variants between the query and reference datasets