Exploratory Data Analysis in Bioinformatics: Unveiling Patterns and Insights

March 11, 2024

Exploratory Data Analysis (EDA) is a fundamental step in bioinformatics analyses, providing a basis for the initial examination of datasets. The volume and complexity of biological data have grown exponentially in recent years due to advancements in high-throughput technologies like next-generation sequencing. Massive datasets, encompassing genomics, transcriptomics, proteomics, and other-omics data require robust analytical approaches.

What is EDA? It refers to different methods used to get an overview of data sets and is an important step in data analysis. EDA can help identify patterns, spot anomalies, confirm hypotheses and check assumptions and is used at different stages of the data analysis process. For example, EDA carried out prior to normalisation of data may identify technical biases or batch effects which may impact downstream analyses.

One of the most effective methods of EDA is through graphical representations, as it allows researchers to visualise complex biological data in an accessible manner. Here are some plots commonly used in EDA of omics-data with examples from analyses previously carried out by our INSiGENe data science team.

Scatter plots

Scatter plots are a two-dimensional plot to visualise the association between two continuous variables. In a scatter plot, the X and Y axes represent different variables, and each point on the plot corresponds to one observation of a unique combination of values for those variables (for example, the height and weight of a patient).  They can reveal associations between the expression of genes (i,e. co-expression), identify potential outliers and highlight patterns in data distribution.

A scatterplot showing the correlation of genome-wide gene expression for patients at their first visit (x axis) and their second visit (y axis) using ranked expression values. Each point represents a gene, and it is evident that there is little variation in genome-wide gene expression between the first and second visit for the majority of genes. Credit: Dr Anya Jones

Bar Charts

Bar charts are employed to represent the association between a numeric variable and a categorical variable. It consists of rectangular bars of equal width, with the length of each bar corresponding to the value of the numeric variable within a specific category.

This bar chart shows the 260/280 absorbance ratio, the primary measure of RNA purity, of RNA samples in a transcriptomics dataset. The horizontal red line indicates the minimum ratio (1.8) recommended. This bar chart shows all samples have good RNA quality. Credit: Dr Denise Anderson

Box Plots

Also known as the box-and-whisker plot, it is effective in highlighting the central tendency, spread, and skewness of the data. They key components of a box plot include the median (line inside the box), quartiles (box edges), and potential outliers (whiskers extending from the box). They are particularly useful for comparing distributions across different groups or conditions.

This box plot shows RNA-Seq raw counts for each sample, colour coded by batch. There are no outlier samples identified, therefore all samples can be carried forward for downstream analyses. Credit: Dr Denise Anderson

Violin Plots

A violin plot combines aspects of a box plot and a density plot. Instead of a simple box, a violin plot features a mirrored density plot on each side, resembling a violin or bean shape. The width of the “violin” at different points represents the data density, providing more information about the distribution compared to a traditional box plot. They are helpful for visualising the entire distribution of the data, offering a more detailed view of its shape and variability.

Violin plot showing the number of detected genes for each cell, across the four samples sequenced in this single-cell transcriptomics study. No extreme outlying cells are evident based on the number of detected genes. Credit: Dr Denise Anderson

Heatmaps

Heatmaps provide a two-dimensional visual representation of data where a numeric value is represented on a colour scale. They are often used to display gene expression patterns across samples. Hierarchical clustering can be performed and illustrated on heatmaps to identify clusters of genes or samples with similar expression profiles.

An example of a volcano plot showing differential abundance results from a proteomics dataset. Dashed horizontal line is the adjusted p-value chosen to denote statistical significance. Each point represents a protein. The vertical dotted lines represent a two-fold change in abundance (right = up-regulated; left = down-regulated). Proteins highlighted in red significant and have an absolute fold change of at least two. Proteins highlighted in blue are significant but did not meet the two-fold change in abundance threshold. Proteins highlighted in green have an at least two-fold change in abundance but are not significant and genes highlighted in black are neither significant nor met the two-fold change threshold. Credit: Dr Laura Harris

Principal Component Analysis

PCA is a dimensionality reduction method used to visualise the overall structure and patterns in high-dimensional datasets. It typically visualises the data in terms of its principal components, which are linear combinations of the original variables. The first principal component captures the maximum variance in the data, followed by subsequent components in decreasing order of variance. PCA can be used to check that your samples are clustering together by experimental group and whether your data is affected by confounders and/or other technical biases (e.g. batch effects) that may need to be accounted for in downstream analyses.

This example PCA plot visualises clustering of samples based on genome-wide gene expression data. The left plot shows clustering along the first principal component is due to different scan dates and the right plot shows that clustering along the second principal component is by the experimental variable of interest. Given that scan date is a confounding variable that is not directly of interest, it will need to be adjusted for in downstream analyses to ensure results are reliable and robust. Credit: Dr Anya Jones

By using a combination of these visualisation tools, you will be able to uncover patterns and clusters within your data before delving into more complex analyses. If you are interested in learning more about other methodologies to visualise your data, chat to our INSiGENe data science team today!

Schedule your free discovery call here

Contact us at info@insigene.com

© 2023 INSiGENe Ltd. Site maintained by NFIC Services  |  Privacy Policy.