Data mining in bioinformatics MCQs With Answer

Data mining in bioinformatics MCQs With Answer is designed for M.Pharm students to build a firm foundation in computational techniques applied to biological and pharmaceutical data. This collection emphasizes concepts such as preprocessing of high-throughput data, supervised and unsupervised learning, feature selection, association rules, and evaluation metrics specific to gene expression, sequence, and proteomic analyses. Each question links data-mining methodology to practical bioinformatics scenarios—microarray/RNA-seq analysis, SNP and GWAS interpretation, pathway and network mining, and clinical data integration—so students develop both conceptual understanding and applied reasoning needed for research, drug discovery, and translational studies.

Q1. What is the primary goal of data preprocessing in bioinformatics datasets such as microarray or RNA‑seq?

  • To increase the size of the dataset by adding synthetic samples
  • To remove noise, normalize measurements, and handle missing values for reliable downstream analysis
  • To convert all sequences into protein structures
  • To automatically annotate genes with functional terms

Correct Answer: To remove noise, normalize measurements, and handle missing values for reliable downstream analysis

Q2. Which distance or similarity metric is most appropriate when clustering gene expression profiles that require capturing linear correlation rather than absolute magnitude?

  • Euclidean distance
  • Manhattan distance
  • Pearson correlation
  • Hamming distance

Correct Answer: Pearson correlation

Q3. In supervised classification of drug response from gene expression data, which technique reduces dimensionality while preserving variance and can be used prior to classification?

  • k-means clustering
  • Principal component analysis (PCA)
  • Apriori association mining
  • Hierarchical clustering

Correct Answer: Principal component analysis (PCA)

Q4. Which method is typically used to control the false discovery rate (FDR) when testing thousands of genes for differential expression?

  • Bonferroni correction only
  • Benjamini–Hochberg procedure
  • Leave-one-out cross-validation
  • Maximum likelihood estimation

Correct Answer: Benjamini–Hochberg procedure

Q5. Which of the following is a wrapper method for feature selection commonly used in predictive modeling of biomarkers?

  • Correlation-based feature selection (filter)
  • Recursive feature elimination with cross-validated classifiers
  • PCA (unsupervised)
  • Variance thresholding

Correct Answer: Recursive feature elimination with cross-validated classifiers

Q6. In association rule mining applied to adverse drug reactions, which metric indicates how much more often the antecedent and consequent occur together than expected if they were independent?

  • Support
  • Confidence
  • Lift
  • p-value

Correct Answer: Lift

Q7. Which clustering algorithm is most sensitive to initial seeds and may converge to local minima, often requiring multiple restarts for stable gene expression clustering?

  • Hierarchical agglomerative clustering
  • k-means clustering
  • DBSCAN
  • Affinity propagation

Correct Answer: k-means clustering

Q8. When evaluating a binary classifier for predicting patient responders, which metric is most informative when classes are highly imbalanced?

  • Accuracy
  • Precision, recall and F1-score
  • Mean squared error
  • Euclidean distance

Correct Answer: Precision, recall and F1-score

Q9. Which technique is specifically used to identify co-expressed gene modules and relate them to external traits in systems biology?

  • Support vector machine
  • Weighted gene co-expression network analysis (WGCNA)
  • Apriori algorithm
  • Hidden Markov Models

Correct Answer: Weighted gene co-expression network analysis (WGCNA)

Q10. For sequence motif discovery in promoter regions associated with drug-regulated genes, which tool or approach is commonly employed?

  • MEME (Multiple EM for Motif Elicitation)
  • k-means clustering
  • PCA
  • t-test for differential expression

Correct Answer: MEME (Multiple EM for Motif Elicitation)

Q11. In GWAS data mining, which problem arises from testing millions of SNPs and is typically addressed by stringent multiple testing correction?

  • High homogeneity
  • Multiple hypothesis testing and increased type I error
  • Insufficient sample variance
  • Missing gene annotations

Correct Answer: Multiple hypothesis testing and increased type I error

Q12. Which machine-learning algorithm is known for good performance with high-dimensional biological data, built-in feature importance measures, and robustness to overfitting when tuned properly?

  • Decision tree (single)
  • Random forest
  • k-nearest neighbors
  • Naive Bayes

Correct Answer: Random forest

Q13. What is the main advantage of using cross-validation (e.g., k-fold) in model evaluation for biomarker discovery?

  • It increases the size of the training data permanently
  • It provides an unbiased estimate of generalization performance and reduces overfitting risk
  • It guarantees perfect prediction on new data
  • It eliminates the need for feature selection

Correct Answer: It provides an unbiased estimate of generalization performance and reduces overfitting risk

Q14. Which normalization method is commonly used for RNA‑seq count data prior to differential expression analysis to account for library size?

  • RPKM/FPKM or TPM and methods like DESeq size factor normalization
  • Pearson correlation
  • z-score normalization across genes only
  • Min-max scaling to [0,1]

Correct Answer: RPKM/FPKM or TPM and methods like DESeq size factor normalization

Q15. In clustering gene expression data, which linkage method tends to produce compact, spherical clusters and is sensitive to outliers?

  • Single linkage
  • Complete linkage
  • Average linkage
  • Ward’s method

Correct Answer: Ward’s method

Q16. Which algorithm is specifically designed for detecting conserved domains or profile matches in protein families and is widely used in sequence annotation?

  • BLAST (basic local alignment search tool)
  • HMMER (using Hidden Markov Models)
  • PCA
  • k-means clustering

Correct Answer: HMMER (using Hidden Markov Models)

Q17. In expression analysis, what does a volcano plot combine to help prioritize candidate genes?

  • Fold change on x-axis and p-value (statistical significance) on y-axis
  • Gene length vs GC content
  • Principal component 1 vs principal component 2
  • Expression level vs transcript length

Correct Answer: Fold change on x-axis and p-value (statistical significance) on y-axis

Q18. Which approach is most appropriate to handle class imbalance when training a classifier to predict rare adverse drug events?

  • Discard minority class samples
  • Use resampling techniques (SMOTE or undersampling) and cost-sensitive learning
  • Always predict the majority class
  • Remove all correlated features

Correct Answer: Use resampling techniques (SMOTE or undersampling) and cost-sensitive learning

Q19. For pathway enrichment analysis of a gene list from differential expression, which resources/tools are commonly used to interpret biological pathways?

  • KEGG, Reactome, and Gene Ontology (GO) enrichment tools
  • BLAST only
  • k-means and hierarchical clustering
  • Support vector machines

Correct Answer: KEGG, Reactome, and Gene Ontology (GO) enrichment tools

Q20. Which strategy helps prevent overfitting when developing predictive models from high-dimensional omics data?

  • Using overly complex models with no regularization
  • Feature selection, cross-validation, regularization (e.g., LASSO) and external validation
  • Training and testing on the same dataset without partitioning
  • Ignoring preprocessing and normalization steps

Correct Answer: Feature selection, cross-validation, regularization (e.g., LASSO) and external validation

Leave a Comment