The advancement of next-generation sequencing (NGS) has revolutionised biological research, allowing scientists to delve deep into genomics and transcriptomics. The journey from raw sequencing reads to meaningful biological insights is a multi-step process, starting with the crucial first step of quality checking raw sequencing reads.
Raw sequencing data can be prone to errors due to various factors such as machine imperfections, chemical reactions or even sample quality. Ensuring that integrity and accuracy of the initial data is paramount for the robustness of subsequent analyses and is an integral (and surprisingly time-consuming) part of INSIGENe’s data analysis pipeline. In this article, we will discuss the various QC parameters to determine the quality of your data before pre-processing.
In high-throughput DNA and RNA sequencing, FASTQ formats are the standard file format for storing and sharing sequencing read data. They are an extension to the FASTA format, which stores a numeric quality score (Phred score) associated with each nucleotide in a sequence.
Tools such as FastQC are employed to evaluate FASTQ files for sequence quality and identify anomalies. Let’s go through the various metrices you can look at the determine the quality of your raw sequencing data:
Per base sequencing quality
This refers to the quality scores assigned to each nucleotide position in a sequencing read.
Purpose: It helps to assess the reliability of the base calls at each position along the length of the read. A high per base sequencing quality indicates high confidence in the accuracy of the nucleotide assignment at that position, while a lower score suggests potential sequencing errors or uncertainties. The most common reason for warnings and failures is a general degradation of quality over the duration of long runs.
Good Quality: A Phred score (Q score) of <30 is typically considered good quality. Q
Acceptable Quality: Q20 is often acceptable for some applications, although high scores are preferred for critical analyses.
Per tile sequence quality
This involves assessing the quality scores for each tile on the sequencing instrument’s flow cell.
Purpose: It helps identify variations in sequencing quality across different regions of the flow cell. Non-uniformities in per tile sequencing might indicate technical issues, such as problems with the sequencing chemistry and instrument performance for specific tiles. Variation in Phred scores may also appear when a flowcell is overloaded, smudges on the flowcell or debris inside the flowcell lane.
Good Quality: Uniform per tile sequencing quality across the flow cells.
Bad Quality: A deviation of more than 5 less than the mean for that base across all tiles or consistently low-quality scores for specific tiles.
Per sequence quality scores
This evaluates the overall quality of each entire sequencing read.
Purpose: It provides a summary of the quality scores for an entire read. This metric is useful for gauging the overall reliability of a sequencing run or a set of reads. Low per sequence quality may indicate issues such as poor base calling, adapter contamination, or other factors affecting the entire read. Often, a subset of sequences will have universally poor quality because they are poorly imaged (for eg, on the edge of the field of view). However, these should represent only a small percentage of the total sequences (<0.2% error rate) for a FASTQ file passing this criterion.
Good quality: High average per sequence quality is desirable. For Illumina sequencing, an average Q score of 30 or higher is often considered good.
Bad Quality: A substantial decrease in average per sequence quality may indicate issues with the entire dataset.
Per Base Sequence Content
This assesses the distribution of individual nucleotides (A, T, G, C) at each position along the length of a sequencing read.
Purpose: It helps identify position-specific biases or errors that might affect downstream analysis which may be indicative of issues with the sequencing chemistry, adapter contamination, or other technical artifacts.
Good quality: Relatively stable distribution of nucleotides across positions is considered good quality. In FASTQC, it issues a warning if the difference is greater than 10% in any position and fail if the difference is greater than 20%.
GC content represents the percentage of guanine (G) and cytosine (C) bases in the entire sequence.
Purpose: Examining GC content provides a broader overview of the nucleotide composition of the entire read. Sudden shifts in GC content can indicate biases introduced during library preparation, PCR amplification, or other stages of the sequencing process.
Good quality: Normal distribution of GC content where the central peak corresponds to the overall GC content of the underlying genome is expected from a normal random library.
These are short DNA sequences of length ‘k’ that appear more frequently than expected in the dataset.
Purpose: Identifying and addressing overrepresented k-mers is essential for detecting potential contaminants, adapter sequences or PCR artifacts in the sequencing data. These sequences might not align to the reference genome and could affect downstream analyses. However, overrepresented k-mers can also occur when analysing small RNA libraries, such as microRNAs, small interfering RNA (siRNAs), and other regulatory RNAs. These can be accounted for with careful pre-processing and normalisation of data.
Good quality: Low number of k-mers are desirable.
Ambiguous bases (N’s)
If a sequence is unable to make a base call with sufficient confidence, then it will normally substitute an N rather than a conventional base.
Purpose: N’s represent positions in the sequence where the base cannot be confidently determined. Sequencing reads with few, or no ambiguous bases are preferred for downstream analyses. High levels of N’s can introduce uncertainty into variant calling and other analyses. It is not unusual to see a very low proportion of Ns appearing in a sequence, especially nearer the end of a sequence.
This measures the proportion of duplicated sequences in the dataset.
Purpose: In a diverse library, most sequences will occur only once in the final set. A low level of duplication may indicate a high level of coverage of the target sequence, but a high level of duplication is more likely to indicate some kind of enrichment bias such as PCR over-amplification during library preparation.
Good quality: Low levels of duplicated reads are desirable.
This refers to the number of nucleotides in a sequencing read.
Purpose: Uniform sequence lengths simplify downstream analysis, such as alignment and assembly. Deviations in length can indicate issues with library preparation or sequencing, impacting data reliability. However, different sequencing platforms may vary in this aspect, so it is best to keep in mind the technology-specific expectations.
Presence of adapters
During library preparation, adapters are added to DNA or RNA samples to facilitate the attachment of the fragments to the sequencing platform. The adapters provide priming sites for amplification and sequencing. Unincorporated adapters or adapter dimers should be removed during the library purification step.
Purpose: Before pre-processing, detecting adapters helps assess the quality of the sequencing run and determine if there was successful removal of adapters during the library preparation phase. After pre-processing (which includes steps like adapter trimming), analysts perform a second check to confirm that all adapters have been effectively removed.
Good quality: Before pre-processing, data should show minimal to no presence of adapter sequences. After pre-processing, the data should be free from adapter sequences, confirming the successful removal of any residual adapter contamination during data cleaning.
Using a combination or all the above parameters, you should have a snapshot of the quality of your raw data. Checking the quality of your data before pre-processing is a fundamental step in ensuring the accuracy and reliability of the genomics or transcriptomic analyses. By assessing data quality upfront, QC ensures that the data used for analysis meets certain standards. It also enables you to optimise your next steps of pre-processing. For example, the presence of adapter sequences might necessitate adapter trimming, and knowledge of base quality can guide decisions on quality filtering thresholds. Finally, establishing baseline metrics of raw data quality allows for the comparison of different datasets or experimental conditions.