QSAR statistical methods: regression and PLS MCQs With Answer

Introduction

QSAR statistical methods: regression and PLS MCQs With Answer is a concise study aid designed for M.Pharm students preparing for exams in Principles of Drug Discovery. This blog focuses on core statistical approaches used in QSAR modelling, emphasizing multiple linear regression (MLR) and partial least squares (PLS). Through practical multiple-choice questions, students will reinforce concepts such as descriptor selection, multicollinearity, model validation (R2, Q2, RMSE), cross-validation techniques, latent variable selection, and interpretation of PLS outputs like VIP and loadings. Each question includes clear options and the correct answer to support active learning and improve competence in building robust predictive models for drug discovery.

Q1. What is the primary objective of regression methods in QSAR modelling?

To predict biological activity or property from molecular descriptors
To classify molecules into activity classes only
To optimize chemical synthesis pathways
To identify biological targets of compounds

Correct Answer: To predict biological activity or property from molecular descriptors

Q2. Which assumption is critical for reliable multiple linear regression (MLR) in QSAR?

No multicollinearity among descriptors
Descriptors must be categorical variables
The number of descriptors must exceed number of molecules by a large margin
Dependent variable must be non-numeric

Correct Answer: No multicollinearity among descriptors

Q3. What is the main advantage of Partial Least Squares (PLS) over ordinary MLR in QSAR?

PLS can handle highly correlated descriptors and reduces dimensionality simultaneously
PLS requires larger sample sizes than MLR
PLS ignores the response variable when extracting components
PLS always yields higher R2 regardless of model quality

Correct Answer: PLS can handle highly correlated descriptors and reduces dimensionality simultaneously

Q4. Which metric specifically reflects internal predictive ability estimated by cross-validation?

Q2 (cross-validated R2)
R2 (training set explained variance)
p-value of regression coefficients
Descriptor variance

Correct Answer: Q2 (cross-validated R2)

Q5. What does RMSE represent in QSAR model evaluation?

Root mean squared error between predicted and observed values
Ratio of variance explained to unexplained variance
Number of latent variables in PLS
Average descriptor value

Correct Answer: Root mean squared error between predicted and observed values

Q6. What is a common drawback of leave-one-out (LOO) cross-validation?

It can produce high variance and overly optimistic estimates for small datasets
It requires an external test set for calculation
It cannot be used with PLS models
It always underestimates model performance

Correct Answer: It can produce high variance and overly optimistic estimates for small datasets

Q7. In PLS modelling, a Variable Importance in Projection (VIP) score greater than 1 usually indicates what?

The descriptor is influential for the model and contributes significantly to prediction
The descriptor should always be removed from the model
The descriptor is uncorrelated with other descriptors
The descriptor has no effect on model performance

Correct Answer: The descriptor is influential for the model and contributes significantly to prediction

Q8. Which criterion is commonly used to select the number of latent variables in a PLS model?

Minimizing cross-validated prediction error (e.g., lowest RMSE or PRESS)
Maximizing the number of latent variables regardless of error
Selecting the same number as descriptors
Using the number of latent variables equal to the number of observations

Correct Answer: Minimizing cross-validated prediction error (e.g., lowest RMSE or PRESS)

Q9. What is the purpose of Y-scrambling (response permutation) in QSAR validation?

To test whether the model’s predictive power is due to chance correlations
To increase the number of descriptors available
To generate additional training samples
To normalize descriptor values

Correct Answer: To test whether the model’s predictive power is due to chance correlations

Q10. Which approach best assesses external predictivity of a QSAR model?

Use an independent external test set and report external Q2 or R2pred
Report only the training set R2
Report the number of descriptors used
Perform only leave-one-out cross-validation on the training set

Correct Answer: Use an independent external test set and report external Q2 or R2pred

Q11. Which scaling method is most commonly recommended before PLS modelling when descriptors have different units?

Autoscaling (mean-centering followed by unit variance scaling)
Leave raw values without scaling
Log-transform response only
Randomly permute descriptor values

Correct Answer: Autoscaling (mean-centering followed by unit variance scaling)

Q12. Which pattern indicates potential overfitting in QSAR models?

Very high training R2 but poor external Q2 or large test RMSE
Low training R2 and similarly low test performance
Consistent residuals with no trends
High VIP scores for a few descriptors

Correct Answer: Very high training R2 but poor external Q2 or large test RMSE

Q13. How does Principal Component Regression (PCR) differ from PLS?

PCR extracts components from X only, whereas PLS extracts components considering both X and Y
PCR uses Y information during component extraction and PLS does not
PCR is designed for classification only, not continuous responses
PCR requires fewer samples than PLS in all cases

Correct Answer: PCR extracts components from X only, whereas PLS extracts components considering both X and Y

Q14. Which diagnostic measure helps detect multicollinearity among descriptors?

Variance Inflation Factor (VIF) with values >10 indicating problematic collinearity
Prediction residual sum of squares (PRESS)
External Q2 only
Number of latent variables in PLS

Correct Answer: Variance Inflation Factor (VIF) with values >10 indicating problematic collinearity

Q15. Which method performs simultaneous variable selection and regularization for regression models?

LASSO (Least Absolute Shrinkage and Selection Operator)
Standard stepwise MLR without penalty
Calculating pairwise correlations only
Principal Component Analysis without regression

Correct Answer: LASSO (Least Absolute Shrinkage and Selection Operator)

Q16. In PLS, what do loadings represent compared to scores?

Loadings indicate how variables contribute to components; scores represent observations projected onto those components
Loadings represent observations and scores represent variables
Loadings measure model error and scores measure variable variance
Loadings and scores are identical concepts with different names

Correct Answer: Loadings indicate how variables contribute to components; scores represent observations projected onto those components

Q17. When is it appropriate to apply non-linear regression methods in QSAR?

When the relationship between descriptors and activity is clearly non-linear
Only when PLS fails due to multicollinearity
For datasets with fewer than 10 compounds exclusively
Non-linear methods are never appropriate in QSAR

Correct Answer: When the relationship between descriptors and activity is clearly non-linear

Q18. What should residuals ideally display in a well-fit regression model?

No systematic pattern and approximate normal distribution
A clear increasing trend with fitted values
A periodic oscillation correlated with descriptors
Residuals equal to zero for all observations

Correct Answer: No systematic pattern and approximate normal distribution

Q19. What is a common rule-of-thumb regarding the number of observations per descriptor in MLR?

At least 5–10 observations per descriptor
One observation per descriptor is sufficient
Descriptors should always outnumber observations
There is no relationship between observations and descriptors

Correct Answer: At least 5–10 observations per descriptor

Q20. What is the primary purpose of cross-validation when building PLS models?

To prevent overfitting and to determine the optimal number of latent variables
To increase the number of descriptors automatically
To guarantee a perfect fit on the training set
To avoid any descriptor preprocessing

Correct Answer: To prevent overfitting and to determine the optimal number of latent variables

Download

G S Sachin

I am a Registered Pharmacist under the Pharmacy Act, 1948, and the founder of PharmacyFreak.com. I hold a Bachelor of Pharmacy degree from Rungta College of Pharmaceutical Science and Research. With a strong academic foundation and practical knowledge, I am committed to providing accurate, easy-to-understand content to support pharmacy students and professionals. My aim is to make complex pharmaceutical concepts accessible and useful for real-world application.

Mail- Sachin@pharmacyfreak.com

Leave a Comment Cancel reply