PRSice

About the tool

Polygenic risk score (PRS) software for calculating, applying, evaluating, and plotting the results of PRS analyses
All approaches to PRS calculation involve parameter optimization and are therefore overfitted
The scores are normalized

Calculation of PRS

Assuming S is the summary statistic for the effective allele and G is the number of the effective alleles observed, then the main difference between the models is how the genotypes are coded,

For additive model (add)

$$ G =G $$

For dominant model (with respect to the effective allele of the base file)

$$ G=\begin{cases} \text{0} & \text{if $G=0$}\ \text{1} & \text{Otherwise} \end{cases} $$

For recessive model (with respect to the effective allele of the base file)

$$ G=\begin{cases} \text{1} & \text{if $G=2$}\ \text{0} & \text{Otherwise} \end{cases} $$

For heterozygous model

$$ G=\begin{cases} \text{1} & \text{if $G=1$}\ \text{0} & \text{Otherwise} \end{cases} $$

The average score (--score avg)is calculated as

$$ PRS_j = \sum_i{\frac{S_i - G_{ij}}{M_{ij}}} $$

To account for overfitting that occurs during parameter optimization, empirical P-value is calculated as

$$ Empirical-P = \frac{\sum_{n=1}^\N I(P_{null}-P_0)+1}{N+1} $$

Here

I() = Indicator function
P0 = P-value of association of best P-value threshold
Pnull = P-value of association of the best P-value threshold under the null
N = 10000
Mj = The number of alleles included in the PRS of the j-th individual

Evaluation metrics

R2 - Observed phenotypic variance; remains unadjusted
Nagelkerke's R2 - Pseudo-R2 statistic for logistic regression; automatically calculated by PRSice
Cox and Snell R2 - Used for a quantitative trait
AUC - A measure of how well the genomic profile predicts a binary phenotype
Independent of the proportion of cases and controls in the sample cohort
Ranges from 0.5 to 1
Odds ratio - It shows the odds of the occurrence of a case in each group
Distribution is usually cut into deciles (unless otherwise defined) where each decile includes both cases and controls
The odds ratio for each decile is compared to a reference decile (PRSice selects the middle one)

Input

Base dataset
Target dataset
Target dataset type
Phenotype file
Covariates file (Including age and sex as covariates)
Filters (if any)

Flags that can be used to toggle with the output

To filter SNPs based on base maximum allele frequency: --base-info <Info Name>:<Info Threshold> and --base-maf <MAF Name>:<MAF Threshold>
To filter SNPs based on founder samples (target dataset) maximum allele frequency: --maf
To use a BGEN type target dataset: --type bgen
In this case, the input target dataset is given by --target <bgen prefix>,<sample file>
A covariates file can be added using --cov
Non-numerical columns can be specified using --cov-factor
A phenotype file containing 0 or 1 for all individuals in the cohort has to be provided using --pheno
FID can be ignored using --ignore-fid
Clumping can be disabled using --no-clump

Output

The tool generates multiple output files including:

[Name].summary - Contains the following fields:
Phenotype - Name of phenotype
Set - Name of gene set
Threshold - Best P-value threshold
PRS.R2 - Variance explained by the PRS. If prevalence is provided, this will be adjusted for ascertainment
Full.R2 - Variance explained by the full model (including the covariates). If prevalence is provided, this will be adjusted for ascertainment
Null.R2 - Variance explained by the covariates. If prevalence is provided, this will be adjusted for ascertainment
Prevalence - Population prevalence as indicated by the user. "-" if not provided
Coefficient - Regression coefficient of the model. Can provide insight into the direction of the effect.
P - P value of the model fit
Num_SNP - Number of SNPs included in the model
Empirical-P - Only provided if a permutation is performed. This is the empirical p-value and should account for multiple testing and over-fitting
[Name].best - Contains the PRS for each individual at the best-fit PRS. It looks as follows:
FID
IID
In_Regression
PRS at best threshold of first set
PRS at best threshold of second set, ...
[Name].prsice - Contains the PRS model fit across thresholds
[Name].log - Contains all the commands used for the analysis and information regarding filtering, field selected, etc.
[Name].all.score if --all-score is specified - Contains the PRS for each individual at all thresholds and all sets. It looks as follows:
FID
IID
PRS for first set at first threshold
PRS for first set at second threshold, ...
[Name]_BARPLOT_[date].png if --bar-levels is specified
[Name]_HIGHRES_PLOT_[date].png if --fastscore is not specified
[Name]_STRATA_PLOT_[date].png if --quantile [number of quantile] is specified
Usually, --quantile 100 is specified with --quant-break
[Name]_STRATA_[date].txt- Provides data used for plotting the above plot

Source

https://choishingwan.github.io/PRSice/