Metadata
Below is an overview of the metadata of the count tables and samples. The high quality count tables are produced and filtered by STARsolo. Filename column corresponds to the prefix of uploaded samples. This is added to the table as this may differ from samplename.
| Sample | Group ID | Filename |
|---|---|---|
| SRR32158738 | GDM | SRR32158738 |
| SRR32158739 | GDM | SRR32158739 |
| SRR32158740 | GDM | SRR32158740 |
| SRR32158741 | GDM | SRR32158741 |
| SRR32158742 | Control | SRR32158742 |
| SRR32158743 | Control | SRR32158743 |
| SRR32158744 | Control | SRR32158744 |
| SRR32158745 | Control | SRR32158745 |
Quality Control
Low quality libraries from different cells can cluster together due to similarities in damage-induced expression profiles. These low quality libraries are not removed from the filtered dataset from STARsolo, therefore a quality control has mainly three metrics to check the quality of the data:
- Number of UMI: check for cells with low total counts
- Genes: check for low expressed genes
- Percentage of mitochondrial: High percentage can cause their own distinct clusters.
QC plots with cell density are created, instead of the normal violin plots from Seurat package, as these are more intuitive to understand. To identify cells that are outliers for the various QC metrics, it uses the median absolute deviation (MAD) from the median value of each metric across all cells. Specifically, a value is considered an outlier if it is more than 3 MADs from the median in the “problematic” direction. This is loosely motivated by the fact that such a filter will retain 99% of non-outlier values that follow a normal distribution.
The count tables of each sample are converted to a Seurat object [1]. Filtering is based on MAD (median absolute deviation) with default value MAD = 3.
QC metrics (log10nUMI, log10nGene, percentage mito) have been calculated, added to metadata and up to 5 samples are plotted here. The remaining plots can be found under QC folder in the results folder.
The UMI counts per cell should generally be above 500, that is the low end of what we expect. If UMI counts are between 500-1000 counts, it is usable but the cells probably should have been sequenced more deeply.
Note: Assumption is that batches have high quality to apply MAD. Samples from multiple batches can influence MAD. If sequence coverage is lower in one batch, it will drag down the median and MAD. This will reduce the suitability of adaptive threshold for other batches.