Sequence data collection and storage MCQs With Answer

Introduction: This quiz collection focuses on sequence data collection and storage — a crucial area in bioinformatics and computational biotechnology for M.Pharm students. It covers how nucleotide and protein sequences are generated, quality-controlled, formatted, annotated, archived and shared. Questions address sequencing platforms (short- and long-read), common file formats (FASTA, FASTQ, SAM/BAM/CRAM), metadata standards, public repositories, and practical concerns like compression, checksums, data provenance, privacy and laboratory information management systems. Designed to reinforce classroom learning and practical skills, these MCQs emphasize real-world considerations for managing sequence data in pharmaceutical research and regulatory settings.

Q1. Which file format is primarily used to store raw sequencing reads together with their per-base quality scores?

  • FASTA
  • GFF
  • FASTQ
  • BAM

Correct Answer: FASTQ

Q2. What does the quality score in a FASTQ file (Phred score) represent?

  • The position of the read in the sequencing run
  • The probability that a base call is incorrect
  • The GC content of the read
  • The length of homopolymer runs

Correct Answer: The probability that a base call is incorrect

Q3. Which public repository is primarily used to deposit raw high-throughput sequencing data and is part of the International Nucleotide Sequence Database Collaboration?

  • Protein Data Bank (PDB)
  • Sequence Read Archive (SRA)
  • UniProt
  • RefSeq

Correct Answer: Sequence Read Archive (SRA)

Q4. Which of the following formats stores sequence alignments and can be indexed for random access?

  • FASTQ
  • SAM/BAM
  • TXT
  • FASTA

Correct Answer: SAM/BAM

Q5. CRAM format is often preferred over BAM because it:

  • Removes all read names
  • Is a human-readable plain text format
  • Provides more efficient compression by referring to a reference genome
  • Only stores consensus sequences

Correct Answer: Provides more efficient compression by referring to a reference genome

Q6. When submitting sequence data to public databases, which piece of information is considered essential metadata?

  • Sequencer operator’s home address
  • Library construction method and sample source
  • Preferred file compression algorithm
  • Name of the laboratory instrument vendor only

Correct Answer: Library construction method and sample source

Q7. Which checksum algorithm is commonly used to verify integrity of downloaded sequence files (e.g., from SRA or GenBank)?

  • ROT13
  • SHA-512
  • MD5
  • Base64

Correct Answer: MD5

Q8. Adapter contamination in sequencing reads is best removed by which preprocessing step?

  • Indexing
  • Trimming
  • Annotation
  • Assembly

Correct Answer: Trimming

Q9. Which standard or guideline is commonly used to describe sequence metadata to improve reproducibility and data reuse?

  • MIxS (Minimum Information about any (x) Sequence)
  • HTML5
  • ISO-9001
  • SMTP

Correct Answer: MIxS (Minimum Information about any (x) Sequence)

Q10. Paired-end sequencing differs from single-end sequencing primarily because paired-end reads:

  • Are always longer than single-end reads
  • Consist of two reads from opposite ends of the same DNA fragment
  • Contain quality scores while single-end does not
  • Do not require alignment

Correct Answer: Consist of two reads from opposite ends of the same DNA fragment

Q11. Which of the following describes the primary difference between raw and processed sequence data?

  • Raw data has been aligned; processed data is unaligned
  • Raw data is instrument output without significant transformation; processed data has undergone QC, trimming, alignment or assembly
  • Processed data is always larger in file size than raw data
  • Raw data cannot be stored in public repositories

Correct Answer: Raw data is instrument output without significant transformation; processed data has undergone QC, trimming, alignment or assembly

Q12. Which laboratory information system feature is most important for provenance tracking of sequence datasets in a pharmaceutical lab?

  • Automated invoicing module
  • Versioned sample and workflow audit trails
  • Graphical color themes
  • Email notification frequency settings

Correct Answer: Versioned sample and workflow audit trails

Q13. Which compression tool is commonly applied to FASTQ files to reduce storage while maintaining compatibility with many bioinformatics tools?

  • gzip
  • tar
  • zip (with proprietary extensions)
  • 7zip exclusive format

Correct Answer: gzip

Q14. Ethical considerations when sharing human sequencing data often require which additional protection?

  • Publishing raw reads with full patient identifiers
  • De-identification and controlled-access repository deposit
  • Removal of quality scores only
  • Conversion of FASTQ to plain text CSV

Correct Answer: De-identification and controlled-access repository deposit

Q15. Which accession identifier prefix is commonly associated with GenBank nucleotide sequence records?

  • PDB
  • SAM
  • NC_ or accession strings like MN123456
  • UNI

Correct Answer: NC_ or accession strings like MN123456

Q16. Indexing a BAM file (creating a .bai) is important because it:

  • Makes the file human-readable
  • Allows efficient retrieval of alignments from specific genomic regions
  • Encrypts the data for security
  • Converts it to FASTQ

Correct Answer: Allows efficient retrieval of alignments from specific genomic regions

Q17. Which of the following best describes “data provenance” in the context of sequence data management?

  • A log of software UI color changes
  • Record of the origin, processing steps, parameters and versions used to generate the data
  • A list of publications citing the dataset only
  • Random metadata unrelated to the sequencing experiment

Correct Answer: Record of the origin, processing steps, parameters and versions used to generate the data

Q18. Which ontology or controlled vocabulary would help standardize sample attributes like organism, tissue, and disease state?

  • Gene Ontology (GO)
  • Medical Subject Headings (MeSH) and ontologies like EFO or Uberon
  • JPEG
  • SMTP

Correct Answer: Medical Subject Headings (MeSH) and ontologies like EFO or Uberon

Q19. Which practice reduces the chance of accidental loss when storing large sequencing datasets?

  • Keeping a single copy on the local instrument only
  • Implementing automated off-site backups and checksums
  • Uploading to social media platforms
  • Renaming files daily without tracking

Correct Answer: Implementing automated off-site backups and checksums

Q20. Which factor is most important when choosing cloud storage for long-term archiving of sequence data in a regulated pharmaceutical environment?

  • Lowest possible latency for streaming videos
  • Compliance with regulatory standards (e.g., HIPAA/GxP), encryption, and auditability
  • Availability of free emoticons
  • Support for legacy proprietary office formats only

Correct Answer: Compliance with regulatory standards (e.g., HIPAA/GxP), encryption, and auditability

Leave a Comment