SAVER (Single-cell Analyses Via Expression Recovery) is a method for denoising single-cell RNA sequencing data by borrowing information across genes and cells. Here, we demonstrate how to use the method. For more information, take a look at the Github page and the paper.
You can install the most recent updates of SAVER from github with:
To install the stable release of SAVER:
Once we have the package installed, we can load the package.
In this tutorial, we will run SAVER on the mouse cortex data from Zeisel (2015).
data.path <- "../data/expression_mRNA_17-Aug-2014.txt"
# Need to remove first 10 rows of metadata
raw.data <- read.table(data.path, header = FALSE, skip = 11, row.names = 1,
check.names = FALSE)
cortex <- as.matrix(raw.data[, -1])
cellnames <- read.table(data.path, skip = 7, nrows = 1, row.names = 1,
stringsAsFactors = FALSE)
colnames(cortex) <- cellnames[-1]
dim(cortex)
The input to SAVER is a matrix of gene expression counts with genes along the rows and cells along the columns. SAVER was developed for use on UMI count data but it seems to work well with non-UMI TPM data. Standard quality control such as removing low quality cells and filtering out genes with low expression should be performed prior to running SAVER.
Here, the Zeisel dataset has already been preprocessed.
The main function to run SAVER is the saver
function.
Since SAVER is computationally intensive for large datasets, we
recommend running it in parallel on a cluster either by creating your
own parallel environment or by specifying ncores
. We will
run SAVER on the Zeisel dataset across 12 cores.
cortex.saver
is a saver object with the following
components:
estimate
gives the library size normalized SAVER
estimates.se
gives the standard error of the estimatesinfo
gives the more information about the run.If you are only interested in the estimate, you can run
saver
setting estimates.only = TRUE
. For
example,
By default, SAVER takes in an unnormalized count matrix and performs
library size normalization during the denoising step. However, if your
data is already normalized or normalization is not desired, you can set
size.factor = 1
. In addition, if you want to use custom
cell size factors, you can set size.factor
to be a vector
of size factors.
SAVER has two main steps: the first is the prediction step and the second is a shrinkage step. The prediction step is the time-consuming part of SAVER. If you are only interested in a handful of genes, you can specify those genes and generate predictions for those genes only and choose to return the entire data matrix or for those genes only. For example,
# Identify the indices of the genes of interest
genes <- c("Thy1", "Mbp", "Stim2", "Psmc6", "Rps19")
genes.ind <- which(rownames(cortex) %in% genes)
# Generate predictions for those genes and return entire dataset
cortex.saver.genes <- saver(cortex, pred.genes = genes.ind,
estimates.only = TRUE)
# Generate predictions for those genes and return only those genes
cortex.saver.genes.only <- saver(cortex, pred.genes = genes.ind,
pred.genes.only = TRUE, estimates.only = TRUE)
The main contribution of SAVER is the shrinkage step after generating
the predictions. The shrinkage step allows the algorithm to adaptively
evaluate the quality of the predictions and weigh the estimates either
towards the predictions or the observed value. SAVER performs a Lasso
Poisson regression for each gene using other genes as covariates to
generate the predictions. However, if you want to use other predictions
such as MAGIC or
DCA, you can input the
results in matrix format into the mu
parameter of the
saver
function.
By default, if you specify ncores
without already having
a parallel backend set up, saver
will make a default
cluster using the makeCluster
function in the parallel
package and register a parallel backend via the
registerDoParallel
function in the doParallel package.
However, if you want to set up a more complicated backend, such as using
MPI across nodes, you can do so and run saver
without
specifying the ncores
option.
Running SAVER on large datasets may run into memory issues,
especially when parallelized across cores where each core has limited
memory. One solution would be to register a parallel backend using SOCK
or MPI to parallelize across nodes. Another solution would be to split
the genes into multiple runs using pred.genes
and
pred.genes.only
. For example, say we have a dataset
x
with 10,000 genes. We can split this up into 4 runs on
different machines and combine using the combine.saver
function. We recommend setting do.fast = FALSE
when
splitting the dataset.
saver1 <- saver(x, pred.genes = 1:2500, pred.genes.only = TRUE,
do.fast = FALSE)
saver2 <- saver(x, pred.genes = 2501:5000, pred.genes.only = TRUE,
do.fast = FALSE)
saver3 <- saver(x, pred.genes = 5001:7500, pred.genes.only = TRUE,
do.fast = FALSE)
saver4 <- saver(x, pred.genes = 7501:10000, pred.genes.only = TRUE,
do.fast = FALSE)
saver.all <- combine.saver(list(saver1, saver2, saver3, saver4))
Oftentimes for downstream analysis, you may want to sample from the
posterior distributions to account for estimation uncertainty. This is
similar to performing multiple imputation to obtain multiple imputed
datasets. To sample from the posterior distributions, we use the
function sample.saver
, which takes in the
saver
result and outputs sampled datasets. For example,
Because the SAVER estimates contain uncertainty, correlations between
genes and cells cannot be directly calculated using the SAVER estimates.
To adjust for the uncertainty, we provide the functions
cor.genes
and cor.cells
to construct
gene-to-gene and cell-to-cell correlation matrices respectively for the
SAVER output. These functions take in the saver
result and
outputs the gene-to-gene or cell-to-cell correlation matrix. For
example,