Metadata
The metadata is collected from the samplesheet.
| Sample ID | Group ID | |
|---|---|---|
| 1 | SRR25825482 | 3HB |
| 3 | SRR25825483 | 3HB |
| 5 | SRR25825484 | 3HB |
| 7 | SRR25825485 | 3HB |
| 9 | SRR25825486 | Control |
| 11 | SRR25825487 | Control |
| 13 | SRR25825488 | Control |
| 15 | SRR25825489 | Control |
Sample-level Quality control
A useful initial step in RNA-seq analysis is often to assess the overall similarity between samples:
- Which samples are similar to each other, which are different?
- Does this fit the expectation from the experiment’s design?
- What are the major sources of variation in the dataset?
To address these questions we provide a sample correlation, an MDS and a mean-variance trend plot.
Sample correlation
Sample correlation allows us to see how well our replicates cluster together and observe whether our experimental condition represents the major source of variation in the data. In our case, we measure correlation with the Pearson correlation coefficient. Sample outliers can also be detected using a sample correlation plot.
MDS Plot
A multidimensional scaling plot is generated to inspect how samples are clustered based on their relative normalization factors. Samples clustered together in the MSD plot are more alike than those further apart. Ideally, samples should cluster based on group ID.
Mean-variance trend
Dispersion is a measure of spread or variability in the data. The DESeq2 dispersion estimates are inversely related to the mean and directly related to variance.
Based on this relationship, the dispersion is higher for small mean counts and lower for large mean counts. The dispersion estimates for genes with the same mean will differ only based on their variance. Therefore, the dispersion estimates reflect the variance in gene expression for a given mean value.
The plot of mean versus variance in count data below shows that the variance (y-axis) in gene expression increases with the mean expression (x-axis). Notice that the relationship between mean and variance is linear on the log scale and that for higher means we could predict the variance relatively accurately given the mean (less dispersion). However, for low mean counts, the variance estimates have a much larger spread, therefore, the dispersion estimates will differ much more between genes with small means.
Differential Gene Expression (DGE) analysis
Differential expression analysis is performed using DESeq2 and is based on the Negative Binomial (a.k.a. Gamma-Poission) distribution. Differential expression testing aims to determine which genes are expressed at different levels between conditions. These genes can offer biological insight into the processes affected by the condition(s) of interest.
Gene-level Quality control
The pipeline omits genes that have little or no chance of being detected as differentially expressed. This will increase the power to detect differentially expressed genes.
The genes omitted fall into three categories:
- Genes with zero counts in all samples
- Genes with an extreme count outlier
- Genes with a low mean normalized counts
The next step in the pipeline is to fit a curve to the dispersion estimates for each gene (red line). The idea behind fitting this curve to the data is that different genes will have different scales of biological variability, but, overall, there will be a distribution of reasonable estimates of dispersion. Therefore, a per-gene dispersion estimate together with the fitted a mean-dispersion relationship is plotted.
PCA
Principal Component Analysis (PCA) is a dimensionality technique that finds the greatest amounts of variation in a dataset and assigns it to principal components (PC). The PC explaining the greatest amount of variation in the dataset is PC1, while the PC explaining the second greatest amount is PC2, and so on. In an ideal experiment, we would expect all the replicates for each sample group to cluster together and the sample groups to cluster apart. DESeq2 uses a regularized log transform (rlog) of the normalized counts for sample-level QC as it moderates the variance across the mean, improving the clustering.
Another technique is variance stabilizing transformation (vst). Both techniques aim to remove the dependence of the variance on the mean. In particular, genes with low expression levels and therefore low read counts tend to have high variance, which is not removed efficiently by the ordinary logarithmic transformation.
The chosen technique to transform normalized counts here: rlog
Hierarchical clustering/Heatmap
Similar to PCA, hierarchical clustering is another, complementary method for identifying strong patterns in a dataset and potential outliers. The hierarchical clustering uses the same technique (rlog) as the PCA plot, showing the correlation between samples. Since the majority of genes are not differentially expressed, samples generally have high correlations with each other (>0.80). Samples below 0.80 may indicate an outlier in your data and/or sample contamination.
Pairwise comparisons
The order of the group names specified in the samplesheet determines the order in which the plots are displayed. The name provided in the second element is the level that is used as the baseline. So for example, if we observe a log2 fold change of -2 this would mean the gene expression is lower in the first element relative to the control (second element). E.g. treatment (first element) vs control (second element).
(Pairwise) UP, DOWN and TOTAL-regulated significant genes
Below is a summary of up, down and total significant genes per pairwise comparison.
| Comparison | Low | High | Total |
|---|---|---|---|
| 3HB-Control | 3058 | 2914 | 5972 |