PRSice
About the tool
- Polygenic risk score (PRS) software for calculating, applying, evaluating, and plotting the results of PRS analyses
- All approaches to PRS calculation involve parameter optimization and are therefore overfitted
- The scores are normalized
Calculation of PRS
Assuming S is the summary statistic for the effective allele and G is the number of the effective alleles observed, then the main difference between the models is how the genotypes are coded,
For additive model (add)
$$ G =G $$
For dominant model (with respect to the effective allele of the base file)
$$ G=\begin{cases} \text{0} & \text{if $G=0$}\ \text{1} & \text{Otherwise} \end{cases} $$
For recessive model (with respect to the effective allele of the base file)
$$ G=\begin{cases} \text{1} & \text{if $G=2$}\ \text{0} & \text{Otherwise} \end{cases} $$
For heterozygous model
$$ G=\begin{cases} \text{1} & \text{if $G=1$}\ \text{0} & \text{Otherwise} \end{cases} $$
The average score (--score avg
)is calculated as
$$ PRS_j = \sum_i{\frac{S_i - G_{ij}}{M_{ij}}} $$
To account for overfitting that occurs during parameter optimization, empirical P-value is calculated as
$$ Empirical-P = \frac{\sum_{n=1}^\N I(P_{null}-P_0)+1}{N+1} $$
Here
- I() = Indicator function
- P0 = P-value of association of best P-value threshold
- Pnull = P-value of association of the best P-value threshold under the null
- N = 10000
- Mj = The number of alleles included in the PRS of the j-th individual
Evaluation metrics
- R2 - Observed phenotypic variance; remains unadjusted
- Nagelkerke's R2 - Pseudo-R2 statistic for logistic regression; automatically calculated by PRSice
- Cox and Snell R2 - Used for a quantitative trait
- AUC - A measure of how well the genomic profile predicts a binary phenotype
- Independent of the proportion of cases and controls in the sample cohort
- Ranges from 0.5 to 1
- Odds ratio - It shows the odds of the occurrence of a case in each group
- Distribution is usually cut into deciles (unless otherwise defined) where each decile includes both cases and controls
- The odds ratio for each decile is compared to a reference decile (PRSice selects the middle one)
Input
- Base dataset
- Target dataset
- Target dataset type
- Phenotype file
- Covariates file (Including age and sex as covariates)
- Filters (if any)
Flags that can be used to toggle with the output
- To filter SNPs based on base maximum allele frequency:
--base-info <Info Name>:<Info Threshold> and --base-maf <MAF Name>:<MAF Threshold>
- To filter SNPs based on founder samples (target dataset) maximum allele frequency:
--maf
- To use a BGEN type target dataset:
--type bgen
- In this case, the input target dataset is given by
--target <bgen prefix>,<sample file>
- A covariates file can be added using
--cov
- Non-numerical columns can be specified using
--cov-factor
- A phenotype file containing 0 or 1 for all individuals in the cohort has to be provided using
--pheno
- FID can be ignored using
--ignore-fid
- Clumping can be disabled using
--no-clump
Output
The tool generates multiple output files including:
- [Name].summary - Contains the following fields:
- Phenotype - Name of phenotype
- Set - Name of gene set
- Threshold - Best P-value threshold
- PRS.R2 - Variance explained by the PRS. If prevalence is provided, this will be adjusted for ascertainment
- Full.R2 - Variance explained by the full model (including the covariates). If prevalence is provided, this will be adjusted for ascertainment
- Null.R2 - Variance explained by the covariates. If prevalence is provided, this will be adjusted for ascertainment
- Prevalence - Population prevalence as indicated by the user. "-" if not provided
- Coefficient - Regression coefficient of the model. Can provide insight into the direction of the effect.
- P - P value of the model fit
- Num_SNP - Number of SNPs included in the model
- Empirical-P - Only provided if a permutation is performed. This is the empirical p-value and should account for multiple testing and over-fitting
- [Name].best - Contains the PRS for each individual at the best-fit PRS. It looks as follows:
- FID
- IID
- In_Regression
- PRS at best threshold of first set
- PRS at best threshold of second set, ...
- [Name].prsice - Contains the PRS model fit across thresholds
- [Name].log - Contains all the commands used for the analysis and information regarding filtering, field selected, etc.
- [Name].all.score if
--all-score
is specified - Contains the PRS for each individual at all thresholds and all sets. It looks as follows: - FID
- IID
- PRS for first set at first threshold
- PRS for first set at second threshold, ...
- [Name]_BARPLOT_[date].png if
--bar-levels
is specified - [Name]_HIGHRES_PLOT_[date].png if
--fastscore
is not specified - [Name]_STRATA_PLOT_[date].png if
--quantile [number of quantile]
is specified - Usually,
--quantile 100
is specified with--quant-break
- [Name]_STRATA_[date].txt- Provides data used for plotting the above plot