Statistical methods in QSAR and validation metrics MCQs With Answer

Introduction:

This quiz collection focuses on statistical methods in QSAR (Quantitative Structure–Activity Relationship) and validation metrics tailored for M.Pharm students studying Computer Aided Drug Design. It covers core concepts such as regression techniques, descriptor selection, multicollinearity, cross-validation strategies, and external validation metrics used to judge model robustness and predictivity. Emphasis is placed on practical validation criteria—R2, Q2, RMSE, CCC, applicability domain, leverage and Tropsha’s rules—so students can critically evaluate QSAR models for reliability and avoid pitfalls like overfitting and chance correlations. These MCQs reinforce theoretical understanding and prepare students for applying rigorous statistical checks in real QSAR projects.

Q1. Which statistical metric primarily measures the proportion of variance in the observed activity explained by the QSAR model?

Root Mean Square Error (RMSE)
Concordance Correlation Coefficient (CCC)
Adjusted R-squared (R2)
Mean Absolute Error (MAE)

Correct Answer: Adjusted R-squared (R2)

Q2. Which cross-validation method leaves out one compound at a time to estimate internal predictivity?

k-fold cross-validation
Leave-One-Out (LOO) cross-validation
Bootstrapping
External validation

Correct Answer: Leave-One-Out (LOO) cross-validation

Q3. Which metric indicates how closely predicted values agree with observed values considering both precision and accuracy?

Q2 (cross-validated R2)
Concordance Correlation Coefficient (CCC)
Variance Inflation Factor (VIF)
Area Under Curve (AUC)

Correct Answer: Concordance Correlation Coefficient (CCC)

Q4. In QSAR model validation, what does a high Variance Inflation Factor (VIF > 10) indicate?

The model has excellent predictive performance
Severe multicollinearity among descriptors
Low external predictivity
Good agreement between observed and predicted values

Correct Answer: Severe multicollinearity among descriptors

Q5. Which of the following tests is used to detect chance correlation by randomly permuting response values?

Y-randomization (Y-scrambling)
Williams plot
Leverage analysis
Principal Component Analysis (PCA)

Correct Answer: Y-randomization (Y-scrambling)

Q6. Tropsha’s external validation criterion includes comparing R2 and R0^2; what does R0^2 refer to?

R-squared for model fitted with shuffled descriptors
R-squared for regression of predicted vs observed with intercept forced to zero
Cross-validated R-squared (Q2)
R-squared adjusted for number of descriptors

Correct Answer: R-squared for regression of predicted vs observed with intercept forced to zero

Q7. Which metric quantifies average magnitude of prediction errors without emphasizing large errors?

Root Mean Square Error (RMSE)
Mean Absolute Error (MAE)
Coefficient of determination (R2)
Leverage

Correct Answer: Mean Absolute Error (MAE)

Q8. In applicability domain analysis, which plot displays standardized residuals versus leverage to identify outliers and influential compounds?

ROC curve
Williams plot
Box plot
Scree plot

Correct Answer: Williams plot

Q9. Which method reduces descriptor dimensionality by creating orthogonal linear combinations of descriptors?

Multiple Linear Regression (MLR)
Principal Component Analysis (PCA)
Y-randomization
Leverage calculation

Correct Answer: Principal Component Analysis (PCA)

Q10. Which validation metric is most appropriate for binary classification QSAR models and measures discrimination capability?

Concordance Correlation Coefficient (CCC)
Area Under the ROC Curve (AUC)
RMSE
Adjusted R-squared

Correct Answer: Area Under the ROC Curve (AUC)

Q11. Which of the following is a recommended threshold for an acceptable cross-validated Q2 in QSAR predictive models?

Q2 > 0.1
Q2 > 0.9
Q2 > 0.5
Q2 < 0.0

Correct Answer: Q2 > 0.5

Q12. Which descriptor selection technique uses evolutionary processes (selection, crossover, mutation) to find an optimal subset?

Stepwise regression
Genetic algorithm (GA)
Principal Component Regression (PCR)
Leverage-based pruning

Correct Answer: Genetic algorithm (GA)

Q13. For external validation, which statistic compares prediction errors between training and test sets to reveal systematic bias?

Root Mean Square Error of Prediction (RMSEP)
Q2 (leave-one-out)
Variance Inflation Factor (VIF)
k-fold cross-validation score

Correct Answer: Root Mean Square Error of Prediction (RMSEP)

Q14. Which of the following indicates an influential compound in leverage analysis?

Leverage value much lower than warning leverage (h*)
Standardized residual close to zero
Leverage value greater than the warning leverage (h*)
High Q2 value

Correct Answer: Leverage value greater than the warning leverage (h*)

Q15. Matthews Correlation Coefficient (MCC) is preferred for imbalanced classification because it:

Only measures sensitivity
Combines TP, TN, FP, FN into a single balanced metric
Is identical to accuracy
Depends solely on prevalence

Correct Answer: Combines TP, TN, FP, FN into a single balanced metric

Q16. Which regression approach is most appropriate when predictors are highly collinear and the number of descriptors exceeds samples?

Ordinary Least Squares (OLS)
Partial Least Squares (PLS)
Univariate linear regression
Y-randomization

Correct Answer: Partial Least Squares (PLS)

Q17. In external validation, what does a slope k close to 1 in the regression of predicted versus observed imply?

Severe systematic underestimation
Model bias toward the mean
Good agreement without systematic scaling bias
An overfitted model

Correct Answer: Good agreement without systematic scaling bias

Q18. Which metric is most sensitive to large individual prediction errors because it squares residuals?

Mean Absolute Error (MAE)
Root Mean Square Error (RMSE)
R2
Concordance Correlation Coefficient (CCC)

Correct Answer: Root Mean Square Error (RMSE)

Q19. What is the main purpose of using an external test set separate from training during QSAR modelling?

To increase model complexity
To estimate true predictive performance on unseen compounds
To perform descriptor scaling
To calculate variance inflation

Correct Answer: To estimate true predictive performance on unseen compounds

Q20. Which criterion indicates a reliable QSAR model according to Golbraikh and Tropsha when R2pred and slopes satisfy specified limits?

High VIF values for descriptors
R2 for training > 0.6 and |k – 1| < 0.1 with R2pred sufficiently high
Very low RMSE only for training set
Q2 < 0.2

Correct Answer: R2 for training > 0.6 and |k – 1| < 0.1 with R2pred sufficiently high

Download

G S Sachin

I am a Registered Pharmacist under the Pharmacy Act, 1948, and the founder of PharmacyFreak.com. I hold a Bachelor of Pharmacy degree from Rungta College of Pharmaceutical Science and Research. With a strong academic foundation and practical knowledge, I am committed to providing accurate, easy-to-understand content to support pharmacy students and professionals. My aim is to make complex pharmaceutical concepts accessible and useful for real-world application.

Mail- Sachin@pharmacyfreak.com

Leave a Comment Cancel reply