QSAR statistical methods: regression and PLS MCQs With Answer

Introduction

QSAR statistical methods: regression and PLS MCQs With Answer is a concise study aid designed for M.Pharm students preparing for exams in Principles of Drug Discovery. This blog focuses on core statistical approaches used in QSAR modelling, emphasizing multiple linear regression (MLR) and partial least squares (PLS). Through practical multiple-choice questions, students will reinforce concepts such as descriptor selection, multicollinearity, model validation (R2, Q2, RMSE), cross-validation techniques, latent variable selection, and interpretation of PLS outputs like VIP and loadings. Each question includes clear options and the correct answer to support active learning and improve competence in building robust predictive models for drug discovery.

Q1. What is the primary objective of regression methods in QSAR modelling?

  • To predict biological activity or property from molecular descriptors
  • To classify molecules into activity classes only
  • To optimize chemical synthesis pathways
  • To identify biological targets of compounds

Correct Answer: To predict biological activity or property from molecular descriptors

Q2. Which assumption is critical for reliable multiple linear regression (MLR) in QSAR?

  • No multicollinearity among descriptors
  • Descriptors must be categorical variables
  • The number of descriptors must exceed number of molecules by a large margin
  • Dependent variable must be non-numeric

Correct Answer: No multicollinearity among descriptors

Q3. What is the main advantage of Partial Least Squares (PLS) over ordinary MLR in QSAR?

  • PLS can handle highly correlated descriptors and reduces dimensionality simultaneously
  • PLS requires larger sample sizes than MLR
  • PLS ignores the response variable when extracting components
  • PLS always yields higher R2 regardless of model quality

Correct Answer: PLS can handle highly correlated descriptors and reduces dimensionality simultaneously

Q4. Which metric specifically reflects internal predictive ability estimated by cross-validation?

  • Q2 (cross-validated R2)
  • R2 (training set explained variance)
  • p-value of regression coefficients
  • Descriptor variance

Correct Answer: Q2 (cross-validated R2)

Q5. What does RMSE represent in QSAR model evaluation?

  • Root mean squared error between predicted and observed values
  • Ratio of variance explained to unexplained variance
  • Number of latent variables in PLS
  • Average descriptor value

Correct Answer: Root mean squared error between predicted and observed values

Q6. What is a common drawback of leave-one-out (LOO) cross-validation?

  • It can produce high variance and overly optimistic estimates for small datasets
  • It requires an external test set for calculation
  • It cannot be used with PLS models
  • It always underestimates model performance

Correct Answer: It can produce high variance and overly optimistic estimates for small datasets

Q7. In PLS modelling, a Variable Importance in Projection (VIP) score greater than 1 usually indicates what?

  • The descriptor is influential for the model and contributes significantly to prediction
  • The descriptor should always be removed from the model
  • The descriptor is uncorrelated with other descriptors
  • The descriptor has no effect on model performance

Correct Answer: The descriptor is influential for the model and contributes significantly to prediction

Q8. Which criterion is commonly used to select the number of latent variables in a PLS model?

  • Minimizing cross-validated prediction error (e.g., lowest RMSE or PRESS)
  • Maximizing the number of latent variables regardless of error
  • Selecting the same number as descriptors
  • Using the number of latent variables equal to the number of observations

Correct Answer: Minimizing cross-validated prediction error (e.g., lowest RMSE or PRESS)

Q9. What is the purpose of Y-scrambling (response permutation) in QSAR validation?

  • To test whether the model’s predictive power is due to chance correlations
  • To increase the number of descriptors available
  • To generate additional training samples
  • To normalize descriptor values

Correct Answer: To test whether the model’s predictive power is due to chance correlations

Q10. Which approach best assesses external predictivity of a QSAR model?

  • Use an independent external test set and report external Q2 or R2pred
  • Report only the training set R2
  • Report the number of descriptors used
  • Perform only leave-one-out cross-validation on the training set

Correct Answer: Use an independent external test set and report external Q2 or R2pred

Q11. Which scaling method is most commonly recommended before PLS modelling when descriptors have different units?

  • Autoscaling (mean-centering followed by unit variance scaling)
  • Leave raw values without scaling
  • Log-transform response only
  • Randomly permute descriptor values

Correct Answer: Autoscaling (mean-centering followed by unit variance scaling)

Q12. Which pattern indicates potential overfitting in QSAR models?

  • Very high training R2 but poor external Q2 or large test RMSE
  • Low training R2 and similarly low test performance
  • Consistent residuals with no trends
  • High VIP scores for a few descriptors

Correct Answer: Very high training R2 but poor external Q2 or large test RMSE

Q13. How does Principal Component Regression (PCR) differ from PLS?

  • PCR extracts components from X only, whereas PLS extracts components considering both X and Y
  • PCR uses Y information during component extraction and PLS does not
  • PCR is designed for classification only, not continuous responses
  • PCR requires fewer samples than PLS in all cases

Correct Answer: PCR extracts components from X only, whereas PLS extracts components considering both X and Y

Q14. Which diagnostic measure helps detect multicollinearity among descriptors?

  • Variance Inflation Factor (VIF) with values >10 indicating problematic collinearity
  • Prediction residual sum of squares (PRESS)
  • External Q2 only
  • Number of latent variables in PLS

Correct Answer: Variance Inflation Factor (VIF) with values >10 indicating problematic collinearity

Q15. Which method performs simultaneous variable selection and regularization for regression models?

  • LASSO (Least Absolute Shrinkage and Selection Operator)
  • Standard stepwise MLR without penalty
  • Calculating pairwise correlations only
  • Principal Component Analysis without regression

Correct Answer: LASSO (Least Absolute Shrinkage and Selection Operator)

Q16. In PLS, what do loadings represent compared to scores?

  • Loadings indicate how variables contribute to components; scores represent observations projected onto those components
  • Loadings represent observations and scores represent variables
  • Loadings measure model error and scores measure variable variance
  • Loadings and scores are identical concepts with different names

Correct Answer: Loadings indicate how variables contribute to components; scores represent observations projected onto those components

Q17. When is it appropriate to apply non-linear regression methods in QSAR?

  • When the relationship between descriptors and activity is clearly non-linear
  • Only when PLS fails due to multicollinearity
  • For datasets with fewer than 10 compounds exclusively
  • Non-linear methods are never appropriate in QSAR

Correct Answer: When the relationship between descriptors and activity is clearly non-linear

Q18. What should residuals ideally display in a well-fit regression model?

  • No systematic pattern and approximate normal distribution
  • A clear increasing trend with fitted values
  • A periodic oscillation correlated with descriptors
  • Residuals equal to zero for all observations

Correct Answer: No systematic pattern and approximate normal distribution

Q19. What is a common rule-of-thumb regarding the number of observations per descriptor in MLR?

  • At least 5–10 observations per descriptor
  • One observation per descriptor is sufficient
  • Descriptors should always outnumber observations
  • There is no relationship between observations and descriptors

Correct Answer: At least 5–10 observations per descriptor

Q20. What is the primary purpose of cross-validation when building PLS models?

  • To prevent overfitting and to determine the optimal number of latent variables
  • To increase the number of descriptors automatically
  • To guarantee a perfect fit on the training set
  • To avoid any descriptor preprocessing

Correct Answer: To prevent overfitting and to determine the optimal number of latent variables

Leave a Comment