Understanding Quality Control in Single-Cell RNA Sequencing: Part I – Detecting Low UMI Cells

June 28, 2024

Single-cell RNA sequencing (scRNA-seq) has revolutionized our ability to explore cellular heterogeneity, providing insights into the complexity of gene expression at an unprecedented resolution. However, the accuracy of the conclusions drawn from scRNA-seq data heavily relies on rigorous quality control (QC) procedures. In this series of blog posts, we will delve into various aspects of scRNA-seq QC, starting with the detection of low Unique Molecular Identifier (UMI) cells. We’ll use examples from the πšœπš’πš—πšπš•πšŽπ™²πšŽπš•πš•πšƒπ™Ί R package to illustrate these processes.

What Are UMIs and Why Do They Matter?

UMIs are short sequences added to individual RNA molecules before amplification. They help in accurately quantifying the original number of RNA molecules, reducing amplification bias. In scRNA-seq, each cell’s RNA is tagged with a unique barcode, and each RNA molecule within that cell is tagged with a UMI. This allows researchers to distinguish between true biological variation and technical noise.

Why Detect Low UMI Cells?

Cells with low UMI counts often indicate poor quality samples. These cells might be damaged, dying, or have undergone insufficient library preparation. Including such cells in downstream analyses can skew results and lead to incorrect biological interpretations.

This is what the CellRanger documentation says:

β€œ1. Filtering cell barcodes by UMI counts: The total UMI counts associated with a cell barcode represent the absolute number of observed transcripts in the droplet. Barcodes associated with unusually high UMI counts might be multiplets (i.e. one droplet containing multiple cells), whereas barcodes with low UMI counts might be droplets containing ambient RNAs but not real cells. Therefore, using UMI counts to filter cell barcodes may help to eliminate barcodes that do not represent a single cell. The choice of UMI count thresholds in published literature can vary between arbitrary cutoffs or the use of data-driven threshold, e.g. three to five times of standard deviation or median absolute deviation from the median (You et al., 2021, Ocasio et al., 2019). In Cell Ranger, UMI count is capped at 500 in the second step of cell calling – barcodes with less than 500 UMI counts will not be regarded as cells. However, when the sample is highly heterogeneous, one threshold on UMI count for the whole sample may not always be suitable as it may eliminate real single cells with very high or low RNA contents, for example, neutrophils.”

Step-by-Step Guide to Detecting Low UMI Cells with πšœπš’πš—πšπš•πšŽπ™²πšŽπš•πš•πšƒπ™Ί

πšœπš’πš—πšπš•πšŽπ™²πšŽπš•πš•πšƒπ™Ί (Hong et al. 2022) is a powerful tool for scRNA-seq analysis, providing comprehensive methods for QC, data processing, and visualisation. Here’s how to use it to detect low UMI cells:

Step 1: Load the Data

First, you need to load your scRNA-seq data into R. πšœπš’πš—πšπš•πšŽπ™²πšŽπš•πš•πšƒπ™Ί supports various data formats, including πš‚πš’πš—πšπš•πšŽπ™²πšŽπš•πš•π™΄πš‘πš™πšŽπš›πš’πš–πšŽπš—πš objects and πš‚πšŽπšžπš›πšŠπš objects.

Step 2: Calculate QC Metrics

Next, calculate the QC metrics using the πš›πšžπš—π™ΏπšŽπš›π™²πšŽπš•πš•πš€π™²() function, which provides various metrics such as total counts (UMIs per cell), number of detected genes, and mitochondrial gene counts.

Step 3: Visualize UMI Distribution

Visualizing the distribution of UMIs across cells helps identify a threshold for low UMI cells. A common approach is a boxplot or violin plot.

The resulting figure shows the distributions of number of UMI across the cells in the data set:

Step 4.1: Set a Threshold for Low UMI Cells

Based on Cell Ranger, set a threshold to filter out cells with low UMIs. For instance, you might decide to remove cells with fewer than 500 UMIs.

Step 4.2: Use πšœπšŒπšžπšπšπš•πšŽ::πš’πšœπ™Ύπšžπšπš•πš’πšŽπš›()

To have a more data driven approach, we can use the πš’πšœπ™Ύπšžπšπš•πš’πšŽπš›() function from πšœπšŒπšžπšπšπš•πšŽ. This defines outliers based on the Median Absolute Deviation (MAD), which is the average distance from the median count. The default of the πšœπšŒπšžπšπšπš•πšŽ::πš’πšœπ™Ύπšžπšπš•πš’πšŽπš›() method is 3 x MAD.

Step 5: Filter Out Low UMI Cells

Remove the low UMI cells from the dataset for subsequent analyses.

The difference is small, but we can see that more cells are identified as outliers based on their UMI count when we use the MAD method.

Putting it all together

Let’s consider a practical example using the πš™πš‹πš–πšŒπŸΉπš” dataset, which contains peripheral blood mononuclear cells (PBMCs).

Conclusion

Detecting and filtering low UMI cells is a crucial first step in scRNA-seq quality control. By using πšœπš’πš—πšπš•πšŽπ™²πšŽπš•πš•πšƒπ™Ί, researchers can efficiently identify these cells, ensuring that subsequent analyses are based on high-quality data. In the next part of this series, we will explore methods to detect empty droplets, which represent another common source of noise in scRNA-seq datasets. Stay tuned!


By adhering to these guidelines, you can significantly improve the reliability of your scRNA-seq data, leading to more accurate and insightful biological discoveries.

References

Hong R, Koga Y, Bandyadka S, Leshchyk A, Wang Y, Akavoor V, Cao X, Sarfraz I, Wang Z, Alabdullatif S, Jansen F. Comprehensive generation, visualization, and reporting of quality control metrics for single-cell RNA sequencing data. Nature communications. 2022 Mar 30;13(1):1688.

Ocasio JK, Babcock B, Malawsky D, Weir SJ, Loo L, Simon JM, Zylka MJ, Hwang D, Dismuke T, Sokolsky M, Rosen EP. scRNA-seq in medulloblastoma shows cellular heterogeneity and lineage expansion support resistance to SHH inhibitor therapy. Nature communications. 2019 Dec 20;10(1):5829.

You Y, Tian L, Su S, Dong X, Jabbari JS, Hickey PF, Ritchie ME. Benchmarking UMI-based single-cell RNA-seq preprocessing workflows. Genome biology. 2021 Dec 14;22(1):339.

Schedule your free discovery call here

Contact us atΒ info@insigene.com

Β© 2024 INSiGENe Ltd. Site maintained by NFIC ServicesΒ  |Β  Privacy Policy.