Title: | Single-Cell RNA-Seq Gene Expression Recovery |
---|---|
Description: | An implementation of a regularized regression prediction and empirical Bayes method to recover the true gene expression profile in noisy and sparse single-cell RNA-seq data. See Huang M, et al (2018) <doi:10.1038/s41592-018-0033-z> for more details. |
Authors: | Mo Huang [aut, cre], Nancy Zhang [aut], Mingyao Li [aut] |
Maintainer: | Mo Huang <[email protected]> |
License: | MIT + file LICENSE |
Version: | 1.1.3 |
Built: | 2024-12-28 03:08:37 UTC |
Source: | https://github.com/mohuangx/saver |
The SAVER package implements SAVER, a gene expression recovery method for single-cell RNA sequencing (scRNA-seq). Borrowing information across all genes and cells, SAVER provides estimates for true expression levels as well as posterior distributions to account for estimation uncertainty. See Huang et al (2018) doi:10.1038/s41592-018-0033-z for more details.
Mo Huang (Maintainer), [email protected]
Nancy Zhang, [email protected]
Mingyao Li, [email protected]
https://github.com/mohuangx/SAVER
Finds the prior parameter that maximizes the marginal likelihood given the prediction.
calc.a(y, mu, sf) calc.b(y, mu, sf) calc.k(y, mu, sf)
calc.a(y, mu, sf) calc.b(y, mu, sf) calc.k(y, mu, sf)
y |
A vector of observed gene counts. |
mu |
A vector of predictions from |
sf |
Vector of normalized size factors. |
calc.a
returns a prior alpha parameter assuming constant
coefficient of variation. calc.b
returns a prior beta parameter
assuming constant Fano factor. calc.k
returns a prior variance
parameter assuming constant variance.
A vector with the optimized parameter and the negative log-likelihood.
Calculates SAVER estimate
calc.estimate( x, x.est, cutoff = 0, coefs = NULL, sf, scale.sf, pred.gene.names, pred.cells, null.model, nworkers, calc.maxcor, estimates.only ) calc.estimate.mean(x, sf, scale.sf, mu, nworkers, estimates.only) calc.estimate.null(x, sf, scale.sf, nworkers, estimates.only)
calc.estimate( x, x.est, cutoff = 0, coefs = NULL, sf, scale.sf, pred.gene.names, pred.cells, null.model, nworkers, calc.maxcor, estimates.only ) calc.estimate.mean(x, sf, scale.sf, mu, nworkers, estimates.only) calc.estimate.null(x, sf, scale.sf, nworkers, estimates.only)
x |
An expression count matrix. The rows correspond to genes and the columns correspond to cells. |
x.est |
The log-normalized predictor matrix. The rows correspond to cells and the columns correspond to genes. |
cutoff |
Maximum absolute correlation to determine whether a gene should be predicted. |
coefs |
Coefficients of a linear fit of log-squared ratio of largest lambda to lambda of lowest cross-validation error. Used to estimate model with lowest cross-validation error. |
sf |
Normalized size factor. |
scale.sf |
Scale of size factor. |
pred.gene.names |
Names of genes to perform regression prediction. |
pred.cells |
Index of cells to perform regression prediction. |
null.model |
Whether to use mean gene expression as prediction. |
nworkers |
Number of cores registered to parallel backend. |
calc.maxcor |
Whether to calculate maximum absolute correlation. |
estimates.only |
Only return SAVER estimates. Default is FALSE. |
mu |
Matrix of prior means |
The SAVER method starts by estimating the prior mean and variance for the
true expression level for each gene and cell. The prior mean is obtained
through predictions from a LASSO Poisson regression for each gene
implemented using the glmnet
package. Then, the variance is estimated
through maximum likelihood assuming constant variance, Fano factor, or
coefficient of variation variance structure for each gene. The posterior
distribution is calculated and the posterior mean is reported as the SAVER
estimate.
A list with the following components
est |
Recovered (normalized) expression |
se |
Standard error of estimates |
maxcor |
Maximum absolute correlation for each gene. 2 if not calculated |
lambda.max |
Smallest value of lambda which gives the null model. |
lambda.min |
Value of lambda from which the prediction model is used |
sd.cv |
Difference in the number of standard deviations in deviance between the model with lowest cross-validation error and the null model |
ct |
Time taken to generate predictions. |
vt |
Time taken to estimate variance. |
Calculates the marginal likelihood given the prediction under constant coefficient of variation (a), Fano factor (b), and variance (k).
calc.loglik.a(a, y, mu, sf) calc.loglik.b(b, y, mu, sf) calc.loglik.k(k, y, mu, sf)
calc.loglik.a(a, y, mu, sf) calc.loglik.b(b, y, mu, sf) calc.loglik.k(k, y, mu, sf)
a , b , k
|
Prior parameter. |
y |
A vector of observed gene counts. |
mu |
A vector of predictions from |
sf |
Vector of normalized size factors. |
calc.loglik.a
returns the shifted negative log-likelihood under
constant coefficient of variation.
calc.loglik.b
returns the shifted negative log-likelihood under
constant Fano factor.
calc.loglik.k
returns the shifted negative log-likelihood under
constant variance.
A shifted negative marginal log-likelihood.
Calculates the maximum absolute correlation between two matrices along the columns
calc.maxcor(x1, x2)
calc.maxcor(x1, x2)
x1 |
Named matrix 1 |
x2 |
Named matrix 2 |
This function calculates the maximum absolute correlation for each column of
x2
against each column of x1
. The matrices are named and if
the names overlap between x1
and x2
, the correlation between
the same named entries is set to zero.
A vector of maximum absolute correlations
x1 <- matrix(rnorm(500), 100, 5) x2 <- x1 + matrix(rnorm(500), 100, 5) colnames(x1) <- c("A", "B", "C", "D", "E") colnames(x2) <- c("A", "B", "C", "D", "E") cor(x1, x2) calc.maxcor(x1, x2)
x1 <- matrix(rnorm(500), 100, 5) x2 <- x1 + matrix(rnorm(500), 100, 5) colnames(x1) <- c("A", "B", "C", "D", "E") colnames(x2) <- c("A", "B", "C", "D", "E") cor(x1, x2) calc.maxcor(x1, x2)
Given prediction and prior variance, calculates the Gamma posterior distribution parameters for a single gene.
calc.post(y, mu, sf, scale.sf)
calc.post(y, mu, sf, scale.sf)
y |
A vector of observed gene counts. |
mu |
A vector of prior means. |
sf |
Vector of normalized size factors. |
scale.sf |
Mean of the original size factors. |
Let be the shape parameter and
be the rate
parameter of the prior Gamma distribution. Then, the posterior Gamma
distribution will be
where y is the observed gene count and sf is the size factor.
A list with the following components
estimate |
Recovered (normalized) expression |
se |
Standard error of expression estimate |
Combines SAVER objects
combine.saver(saver.list)
combine.saver(saver.list)
saver.list |
List of SAVER objects |
If SAVER was applied to a dataset for chunks of genes (by specifying
pred.genes
and pred.genes.only = TRUE
), this function combines
the individual SAVER objects into one SAVER object.
A combined SAVER object
data("linnarsson") ## Not run: a <- saver(linnarsson, pred.genes = 1:5, pred.genes.only = TRUE) b <- saver(linnarsson, pred.genes = 6:10, pred.genes.only = TRUE) ab <- combine.saver(list(a, b)) ## End(Not run)
data("linnarsson") ## Not run: a <- saver(linnarsson, pred.genes = 1:5, pred.genes.only = TRUE) b <- saver(linnarsson, pred.genes = 6:10, pred.genes.only = TRUE) ab <- combine.saver(list(a, b)) ## End(Not run)
Adjusts for SAVER estimation uncertainty by calculating and adjusting gene-to-gene and cell-to-cell correlation matrices
cor.genes(x, cor.mat = NULL) cor.cells(x, cor.mat = NULL)
cor.genes(x, cor.mat = NULL) cor.cells(x, cor.mat = NULL)
x |
A |
cor.mat |
If a correlation matrix of the SAVER estimates was already obtained, then it can be provided as an input to avoid recomputation. |
The SAVER estimates that are produced have varying levels of uncertainty depending on the gene and the cell. These functions adjust the gene-to-gene and cell-to-cell correlations of the SAVER estimates to reflect the estimation uncertainty.
An adjusted correlation matrix.
data("linnarsson_saver") gene.cor <- cor.genes(linnarsson_saver)
data("linnarsson_saver") gene.cor <- cor.genes(linnarsson_saver)
Uses cv.glmnet
from the glmnet
package to return the SAVER
prediction.
expr.predict( x, y, pred.cells = 1:length(y), seed = NULL, lambda.max = NULL, lambda.min = NULL )
expr.predict( x, y, pred.cells = 1:length(y), seed = NULL, lambda.max = NULL, lambda.min = NULL )
x |
A log-normalized expression count matrix of genes to be used in the prediction. |
y |
A normalized expression count vector of the gene to be predicted. |
pred.cells |
Index of cells to use for prediction. Default is to use all cells. |
seed |
Sets the seed for reproducible results. |
lambda.max |
Maximum value of lambda which gives null model. |
lambda.min |
Value of lambda from which the prediction model is used |
The SAVER method starts with predicting the prior mean for each cell for a
specific gene. The prediction is performed using the observed normalized
gene count as the response and the normalized gene counts of other genes as
predictors. cv.glmnet
from the glmnet
package is used to fit
the LASSO Poisson regression. The model with the lowest cross-validation
error is chosen and the fitted response values are returned and used as the
SAVER prediction.
A vector of predicted gene expression.
Outputs prior predictions
get.mu(x, saver.obj)
get.mu(x, saver.obj)
x |
Original count matrix. |
saver.obj |
SAVER output. |
This function outputs prior mean predictions used in fitting the
SAVER model.
A matrix of prior mean predictions
data("linnarsson") data("linnarsson_saver") mu <- get.mu(linnarsson, linnarsson_saver)
data("linnarsson") data("linnarsson_saver") mu <- get.mu(linnarsson, linnarsson_saver)
3,529 genes and 200 cells from a mouse brain scRNA-seq dataset.
linnarsson
linnarsson
An object of class matrix
(inherits from array
) with 3529 rows and 200 columns.
Zeisel, A., Munoz-Manchado, A. B., Codeluppi, S., Lonnerberg, P., La Manno, G., Jureus, A., ... Linnarsson, S. (2015). Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science, 347(6226), 1138-1142.
Output of running 'saver' on the 'linnarsson' dataset.
linnarsson_saver
linnarsson_saver
An object of class saver
of length 3.
Zeisel, A., Munoz-Manchado, A. B., Codeluppi, S., Lonnerberg, P., La Manno, G., Jureus, A., ... Linnarsson, S. (2015). Cell types in the mouse cortex and hippocampus revealed by single-cell RNA-seq. Science, 347(6226), 1138-1142.
Samples from the posterior distribution output by SAVER.
sample.saver(x, rep = 1, efficiency.known = FALSE, seed = NULL)
sample.saver(x, rep = 1, efficiency.known = FALSE, seed = NULL)
x |
A |
rep |
Number of sampled datasets. Default is 1. |
efficiency.known |
Whether the efficiency is known. Default is
|
seed |
seed used in |
The SAVER method outputs a posterior distribution, which we can sample from for downstream analysis. The posterior distribution accounts for uncertainty in the SAVER estimation procedure. If the efficiency is known, negative binomial sampling is performed; otherwise, gamma sampling is performed.
A matrix of expression values sampled from SAVER posterior. If
rep
> 1, a list of matrices is returned
data("linnarsson_saver") samp1 <- sample.saver(linnarsson_saver, seed = 50)
data("linnarsson_saver") samp1 <- sample.saver(linnarsson_saver, seed = 50)
Recovers expression using the SAVER method.
saver( x, do.fast = TRUE, ncores = 1, size.factor = NULL, npred = NULL, pred.cells = NULL, pred.genes = NULL, pred.genes.only = FALSE, null.model = FALSE, mu = NULL, estimates.only = FALSE )
saver( x, do.fast = TRUE, ncores = 1, size.factor = NULL, npred = NULL, pred.cells = NULL, pred.genes = NULL, pred.genes.only = FALSE, null.model = FALSE, mu = NULL, estimates.only = FALSE )
x |
An expression count matrix. The rows correspond to genes and the columns correspond to cells. Can be sparse. |
do.fast |
Approximates the prediction step. Default is TRUE. |
ncores |
Number of cores to use. Default is 1. |
size.factor |
Vector of cell size normalization factors.
If |
npred |
Number of genes for regression prediction. Selects the top
|
pred.cells |
Indices of cells to perform regression prediction. Default is all cells. |
pred.genes |
Indices of specific genes to perform regression
prediction. Overrides |
pred.genes.only |
Return expression levels of only |
null.model |
Whether to use mean gene expression as prediction. |
mu |
Matrix of prior means. |
estimates.only |
Only return SAVER estimates. Default is FALSE. |
The SAVER method starts by estimating the prior mean and variance for the
true expression level for each gene and cell. The prior mean is obtained
through predictions from a LASSO Poisson regression for each gene
implemented using the glmnet
package. Then, the variance is estimated
through maximum likelihood assuming constant variance, Fano factor, or
coefficient of variation variance structure for each gene. The posterior
distribution is calculated and the posterior mean is reported as the SAVER
estimate.
If 'estimates.only = TRUE', then a matrix of SAVER estimates.
If 'estimates.only = FALSE', a list with the following components
estimate |
Recovered (normalized) expression. |
se |
Standard error of estimates. |
info |
Information about dataset. |
The info
element is a list with the following components:
size.factor |
Size factor used for normalization. |
maxcor |
Maximum absolute correlation for each gene. 2 if not calculated |
lambda.max |
Smallest value of lambda which gives the null model. |
lambda.min |
Value of lambda from which the prediction model is used |
sd.cv |
Difference in the number of standard deviations in deviance between the model with lowest cross-validation error and the null model |
pred.time |
Time taken to generate predictions. |
var.time |
Time taken to estimate variance. |
maxcor |
Maximum absolute correlation cutoff used to determine if a gene should be predicted. |
lambda.coefs |
Coefficients for estimating lambda with lowest cross-validation error. |
total.time |
Total time for SAVER estimation. |
data("linnarsson") ## Not run: system.time(linnarsson_saver <- saver(linnarsson, ncores = 12)) ## End(Not run) # predictions for top 5 highly expressed genes ## Not run: saver2 <- saver(linnarsson, npred = 5) ## End(Not run) # predictions for certain genes ## Not run: genes <- c("Thy1", "Mbp", "Stim2", "Psmc6", "Rps19") genes.ind <- which(rownames(linnarsson) %in% genes) saver3 <- saver(linnarsson, pred.genes = genes.ind) ## End(Not run) # return only certain genes ## Not run: saver4 <- saver(linnarsson, pred.genes = genes.ind, pred.genes.only = TRUE) ## End(Not run)
data("linnarsson") ## Not run: system.time(linnarsson_saver <- saver(linnarsson, ncores = 12)) ## End(Not run) # predictions for top 5 highly expressed genes ## Not run: saver2 <- saver(linnarsson, npred = 5) ## End(Not run) # predictions for certain genes ## Not run: genes <- c("Thy1", "Mbp", "Stim2", "Psmc6", "Rps19") genes.ind <- which(rownames(linnarsson) %in% genes) saver3 <- saver(linnarsson, pred.genes = genes.ind) ## End(Not run) # return only certain genes ## Not run: saver4 <- saver(linnarsson, pred.genes = genes.ind, pred.genes.only = TRUE) ## End(Not run)
Fits SAVER object
saver.fit( x, x.est, do.fast, ncores, sf, scale.sf, pred.genes, pred.cells, null.model, ngenes = nrow(x), ncells = ncol(x), gene.names = rownames(x), cell.names = colnames(x), estimates.only ) saver.fit.mean( x, ncores, sf, scale.sf, mu, ngenes = nrow(x), ncells = ncol(x), gene.names = rownames(x), cell.names = colnames(x), estimates.only ) saver.fit.null( x, ncores, sf, scale.sf, ngenes = nrow(x), ncells = ncol(x), gene.names = rownames(x), cell.names = colnames(x), estimates.only )
saver.fit( x, x.est, do.fast, ncores, sf, scale.sf, pred.genes, pred.cells, null.model, ngenes = nrow(x), ncells = ncol(x), gene.names = rownames(x), cell.names = colnames(x), estimates.only ) saver.fit.mean( x, ncores, sf, scale.sf, mu, ngenes = nrow(x), ncells = ncol(x), gene.names = rownames(x), cell.names = colnames(x), estimates.only ) saver.fit.null( x, ncores, sf, scale.sf, ngenes = nrow(x), ncells = ncol(x), gene.names = rownames(x), cell.names = colnames(x), estimates.only )
x |
An expression count matrix. The rows correspond to genes and the columns correspond to cells. |
x.est |
The log-normalized predictor matrix. The rows correspond to cells and the columns correspond to genes. |
do.fast |
Approximates the prediction step. Default is TRUE. |
ncores |
Number of cores to use. Default is 1. |
sf |
Normalized size factor. |
scale.sf |
Scale of size factor. |
pred.genes |
Index of genes to perform regression prediction. |
pred.cells |
Index of cells to perform regression prediction. |
null.model |
Whether to use mean gene expression as prediction. |
ngenes |
Number of genes. |
ncells |
Number of cells. |
gene.names |
Name of genes. |
cell.names |
Name of cells. |
estimates.only |
Only return SAVER estimates. Default is FALSE. |
mu |
Matrix of prior means. |
The SAVER method starts by estimating the prior mean and variance for the
true expression level for each gene and cell. The prior mean is obtained
through predictions from a Lasso Poisson regression for each gene
implemented using the glmnet
package. Then, the variance is estimated
through maximum likelihood assuming constant variance, Fano factor, or
coefficient of variation variance structure for each gene. The posterior
distribution is calculated and the posterior mean is reported as the SAVER
estimate.
A list with the following components
estimate |
Recovered (normalized) expression |
se |
Standard error of estimates |
info |
Information about fit |