Data mining in bioinformatics MCQs With Answer

Data mining in bioinformatics MCQs With Answer is designed for M.Pharm students to build a firm foundation in computational techniques applied to biological and pharmaceutical data. This collection emphasizes concepts such as preprocessing of high-throughput data, supervised and unsupervised learning, feature selection, association rules, and evaluation metrics specific to gene expression, sequence, and proteomic analyses. Each question links data-mining methodology to practical bioinformatics scenarios—microarray/RNA-seq analysis, SNP and GWAS interpretation, pathway and network mining, and clinical data integration—so students develop both conceptual understanding and applied reasoning needed for research, drug discovery, and translational studies.

Q1. What is the primary goal of data preprocessing in bioinformatics datasets such as microarray or RNA‑seq?

To increase the size of the dataset by adding synthetic samples
To remove noise, normalize measurements, and handle missing values for reliable downstream analysis
To convert all sequences into protein structures
To automatically annotate genes with functional terms

Correct Answer: To remove noise, normalize measurements, and handle missing values for reliable downstream analysis

Q2. Which distance or similarity metric is most appropriate when clustering gene expression profiles that require capturing linear correlation rather than absolute magnitude?

Euclidean distance
Manhattan distance
Pearson correlation
Hamming distance

Correct Answer: Pearson correlation

Q3. In supervised classification of drug response from gene expression data, which technique reduces dimensionality while preserving variance and can be used prior to classification?

k-means clustering
Principal component analysis (PCA)
Apriori association mining
Hierarchical clustering

Correct Answer: Principal component analysis (PCA)

Q4. Which method is typically used to control the false discovery rate (FDR) when testing thousands of genes for differential expression?

Bonferroni correction only
Benjamini–Hochberg procedure
Leave-one-out cross-validation
Maximum likelihood estimation

Correct Answer: Benjamini–Hochberg procedure

Q5. Which of the following is a wrapper method for feature selection commonly used in predictive modeling of biomarkers?

Correlation-based feature selection (filter)
Recursive feature elimination with cross-validated classifiers
PCA (unsupervised)
Variance thresholding

Correct Answer: Recursive feature elimination with cross-validated classifiers

Q6. In association rule mining applied to adverse drug reactions, which metric indicates how much more often the antecedent and consequent occur together than expected if they were independent?

Support
Confidence
Lift
p-value

Correct Answer: Lift

Q7. Which clustering algorithm is most sensitive to initial seeds and may converge to local minima, often requiring multiple restarts for stable gene expression clustering?

Hierarchical agglomerative clustering
k-means clustering
DBSCAN
Affinity propagation

Correct Answer: k-means clustering

Q8. When evaluating a binary classifier for predicting patient responders, which metric is most informative when classes are highly imbalanced?

Accuracy
Precision, recall and F1-score
Mean squared error
Euclidean distance

Correct Answer: Precision, recall and F1-score

Q9. Which technique is specifically used to identify co-expressed gene modules and relate them to external traits in systems biology?

Support vector machine
Weighted gene co-expression network analysis (WGCNA)
Apriori algorithm
Hidden Markov Models

Correct Answer: Weighted gene co-expression network analysis (WGCNA)

Q10. For sequence motif discovery in promoter regions associated with drug-regulated genes, which tool or approach is commonly employed?

MEME (Multiple EM for Motif Elicitation)
k-means clustering
PCA
t-test for differential expression

Correct Answer: MEME (Multiple EM for Motif Elicitation)

Q11. In GWAS data mining, which problem arises from testing millions of SNPs and is typically addressed by stringent multiple testing correction?

High homogeneity
Multiple hypothesis testing and increased type I error
Insufficient sample variance
Missing gene annotations

Correct Answer: Multiple hypothesis testing and increased type I error

Q12. Which machine-learning algorithm is known for good performance with high-dimensional biological data, built-in feature importance measures, and robustness to overfitting when tuned properly?

Decision tree (single)
Random forest
k-nearest neighbors
Naive Bayes

Correct Answer: Random forest

Q13. What is the main advantage of using cross-validation (e.g., k-fold) in model evaluation for biomarker discovery?

It increases the size of the training data permanently
It provides an unbiased estimate of generalization performance and reduces overfitting risk
It guarantees perfect prediction on new data
It eliminates the need for feature selection

Correct Answer: It provides an unbiased estimate of generalization performance and reduces overfitting risk

Q14. Which normalization method is commonly used for RNA‑seq count data prior to differential expression analysis to account for library size?

RPKM/FPKM or TPM and methods like DESeq size factor normalization
Pearson correlation
z-score normalization across genes only
Min-max scaling to [0,1]

Correct Answer: RPKM/FPKM or TPM and methods like DESeq size factor normalization

Q15. In clustering gene expression data, which linkage method tends to produce compact, spherical clusters and is sensitive to outliers?

Single linkage
Complete linkage
Average linkage
Ward’s method

Correct Answer: Ward’s method

Q16. Which algorithm is specifically designed for detecting conserved domains or profile matches in protein families and is widely used in sequence annotation?

BLAST (basic local alignment search tool)
HMMER (using Hidden Markov Models)
PCA
k-means clustering

Correct Answer: HMMER (using Hidden Markov Models)

Q17. In expression analysis, what does a volcano plot combine to help prioritize candidate genes?

Fold change on x-axis and p-value (statistical significance) on y-axis
Gene length vs GC content
Principal component 1 vs principal component 2
Expression level vs transcript length

Correct Answer: Fold change on x-axis and p-value (statistical significance) on y-axis

Q18. Which approach is most appropriate to handle class imbalance when training a classifier to predict rare adverse drug events?

Discard minority class samples
Use resampling techniques (SMOTE or undersampling) and cost-sensitive learning
Always predict the majority class
Remove all correlated features

Correct Answer: Use resampling techniques (SMOTE or undersampling) and cost-sensitive learning

Q19. For pathway enrichment analysis of a gene list from differential expression, which resources/tools are commonly used to interpret biological pathways?

KEGG, Reactome, and Gene Ontology (GO) enrichment tools
BLAST only
k-means and hierarchical clustering
Support vector machines

Correct Answer: KEGG, Reactome, and Gene Ontology (GO) enrichment tools

Q20. Which strategy helps prevent overfitting when developing predictive models from high-dimensional omics data?

Using overly complex models with no regularization
Feature selection, cross-validation, regularization (e.g., LASSO) and external validation
Training and testing on the same dataset without partitioning
Ignoring preprocessing and normalization steps

Correct Answer: Feature selection, cross-validation, regularization (e.g., LASSO) and external validation

Download

G S Sachin

I am a Registered Pharmacist under the Pharmacy Act, 1948, and the founder of PharmacyFreak.com. I hold a Bachelor of Pharmacy degree from Rungta College of Pharmaceutical Science and Research. With a strong academic foundation and practical knowledge, I am committed to providing accurate, easy-to-understand content to support pharmacy students and professionals. My aim is to make complex pharmaceutical concepts accessible and useful for real-world application.

Mail- Sachin@pharmacyfreak.com

Leave a Comment Cancel reply