spateo.tools.ST_regression.generalized_lm
#
Generalized linear model regression for spatially-aware regression of spatial transcriptomic (gene expression) data. Rather than assuming the response variable necessarily follows the normal distribution, instead allows the specification of models whose response variable follows different distributions (e.g. Poisson or Gamma), although allows also for normal (Gaussian) modeling. Additionally features capability to perform elastic net regularized regression.
Module Contents#
Classes#
Functions#
|
Computes z, an intermediate comprising the result of a linear regression, just before non-linearity is applied. |
|
Applies nonlinear operation to linear estimation. |
|
Derivative of the non-linearity. |
|
Computes the gradient (for parameter updating) via batch gradient descent |
|
Computes negative log-likelihood of an observation, based on true values and predictions from the regression. |
|
Objective function, comprised of a combination of the log-likelihood and regularization losses. |
|
Compute r^2 using log-likelihood, taking into account the observed and predicted distributions as well as the |
|
Deviance goodness-of-fit |
|
Wrapper for fitting a generalized elastic net linear model to large biological data, with automated finding of |
|
- spateo.tools.ST_regression.generalized_lm._z(beta0: float, beta: numpy.ndarray, X: numpy.ndarray, fit_intercept: bool) numpy.ndarray [source]#
Computes z, an intermediate comprising the result of a linear regression, just before non-linearity is applied.
- Parameters
- beta0
The intercept
- beta
Array of shape [n_features,]; learned model coefficients
- X
Array of shape [n_samples, n_features]; input data
- fit_intercept
Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function. Defaults to True.
- Returns
Array of shape [n_samples, n_features]; prediction of the target values
- Return type
z
- spateo.tools.ST_regression.generalized_lm._nl(distr: Literal[gaussian, poisson, spateo.tools.ST_regression.regression_utils.softplus, neg - binomial, gamma], z: numpy.ndarray, eta: float, fit_intercept: bool) numpy.ndarray [source]#
Applies nonlinear operation to linear estimation.
- Parameters
- distr
Distribution family- can be “gaussian”, “poisson”, “softplus”, “neg-binomial”, or “gamma”
- z
Array of shape [n_samples, n_features]; prediction of the target values
- eta
A threshold parameter that linearizes the exp() function above threshold eta
- fit_intercept
Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function. Defaults to True.
- Returns
An array of size [n_samples, n_features]; result following application of the nonlinear layer
- Return type
nl
- spateo.tools.ST_regression.generalized_lm._grad_nl(distr: Literal[gaussian, poisson, spateo.tools.ST_regression.regression_utils.softplus, neg - binomial, gamma], z: numpy.ndarray, eta: float)[source]#
Derivative of the non-linearity.
- Parameters
- distr
Distribution family- can be “gaussian”, “poisson”, “softplus”, “neg-binomial”, or “gamma”
- z
Array of shape [n_samples, n_features]; prediction of the target values
- eta
A threshold parameter that linearizes the exp() function above threshold eta
- Returns
Array of size [n_samples, n_features]; first derivative of each parameter estimate
- Return type
grad_nl
- spateo.tools.ST_regression.generalized_lm.batch_grad(distr: Literal[gaussian, poisson, spateo.tools.ST_regression.regression_utils.softplus, neg - binomial, gamma], alpha: float, reg_lambda: float, X: numpy.ndarray, y: numpy.ndarray, beta: numpy.ndarray, Tau: Union[None, numpy.ndarray] = None, eta: float = 2.0, theta: float = 1.0, fit_intercept: bool = True) numpy.ndarray [source]#
Computes the gradient (for parameter updating) via batch gradient descent
- Parameters
- distr
Distribution family- can be “gaussian”, “softplus”, “poisson”, “neg-binomial”, or “gamma”. Case sensitive.
- alpha
The weighting between L1 penalty (alpha=1.) and L2 penalty (alpha=0.) term of the loss function
- reg_lambda
Regularization parameter \(\lambda\) of penalty term
- X
Array of shape [n_samples, n_features]; input data
- y
Array of shape [n_samples, 1]; labels or targets for the data
- beta
Array of shape [n_features,]; learned model coefficients
- Tau
optional array of shape [n_features, n_features]; the Tikhonov matrix for ridge regression. If not provided, Tau will default to the identity matrix.
- eta
A threshold parameter that linearizes the exp() function above threshold eta
- theta
Shape parameter of the negative binomial distribution (number of successes before the first failure). Used only if ‘distr’ is “neg-binomial”
- fit_intercept
Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function. Defaults to True.
- Returns
Gradient for each parameter
- Return type
g
- spateo.tools.ST_regression.generalized_lm.log_likelihood(distr: Literal[gaussian, poisson, spateo.tools.ST_regression.regression_utils.softplus, neg - binomial, gamma], y: numpy.ndarray, y_hat: Union[numpy.ndarray, float], theta: float = 1.0) float [source]#
Computes negative log-likelihood of an observation, based on true values and predictions from the regression.
- Parameters
- distr
Distribution family- can be “gaussian”, “poisson”, “softplus”, “neg-binomial”, or “gamma”. Case sensitive.
- y
Target values
- y_hat
Predicted values, either array of predictions or scalar value
- Returns
Numerical value for the log-likelihood
- Return type
logL
- spateo.tools.ST_regression.generalized_lm._loss(distr: Literal[gaussian, poisson, spateo.tools.ST_regression.regression_utils.softplus, neg - binomial, gamma], alpha: float, reg_lambda: float, X: numpy.ndarray, y: numpy.ndarray, beta: numpy.ndarray, Tau: Union[None, numpy.ndarray] = None, eta: float = 2.0, theta: float = 1.0, fit_intercept: bool = True) float [source]#
Objective function, comprised of a combination of the log-likelihood and regularization losses.
- Parameters
- distr
Distribution family- can be “gaussian”, “poisson”, “softplus”, “neg-binomial”, or “gamma”. Case sensitive.
- alpha
The weighting between L1 penalty (alpha=1.) and L2 penalty (alpha=0.) term of the loss function
- reg_lambda
Regularization parameter \(\lambda\) of penalty term
- X
Array of shape [n_samples, n_features]; input data
- y
Array of shape [n_samples, 1]; labels or targets for the data
- beta
Array of shape [n_features,]; learned model coefficients
- Tau
optional array of shape [n_features, n_features]; the Tikhonov matrix for ridge regression. If not provided, Tau will default to the identity matrix.
- eta
A threshold parameter that linearizes the exp() function above threshold eta
- theta
Shape parameter of the negative binomial distribution (number of successes before the first failure). Used only if ‘distr’ is “neg-binomial”
- fit_intercept
Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function. Defaults to True.
- Returns
Numerical value for loss
- Return type
loss
- spateo.tools.ST_regression.generalized_lm.pseudo_r2(y: numpy.ndarray, yhat: numpy.ndarray, ynull_: float, distr: Literal[gaussian, poisson, spateo.tools.ST_regression.regression_utils.softplus, neg - binomial, gamma], theta: float)[source]#
Compute r^2 using log-likelihood, taking into account the observed and predicted distributions as well as the observed and predicted values.
- Parameters
- y
Array of shape [n_samples,]; target values for regression
- yhat
Predicted targets of shape [n_samples,]
- ynull
Mean of the target labels (null model prediction)
- distr
Distribution family- can be “gaussian”, “poisson”, “softplus”, “neg-binomial”, or “gamma”. Case sensitive.
- theta
Shape parameter of the negative binomial distribution (number of successes before the first failure). It is used only if ‘distr’ is equal to “neg-binomial”, otherwise it is ignored.
- spateo.tools.ST_regression.generalized_lm.deviance(y: numpy.ndarray, yhat: numpy.ndarray, distr: Literal[gaussian, poisson, spateo.tools.ST_regression.regression_utils.softplus, neg - binomial, gamma], theta: float)[source]#
Deviance goodness-of-fit
- Parameters
- y
Array of shape [n_samples,]; target values for regression
- yhat
Predicted targets of shape [n_samples,]
- distr
Distribution family- can be “gaussian”, “poisson”, “softplus”, “neg-binomial”, or “gamma”. Case sensitive.
- theta
Shape parameter of the negative binomial distribution (number of successes before the first failure). It is used only if ‘distr’ is equal to “neg-binomial”, otherwise it is ignored.
- Returns
Deviance of the predicted labels
- Return type
score
- class spateo.tools.ST_regression.generalized_lm.GLM(distr: Literal[gaussian, poisson, spateo.tools.ST_regression.regression_utils.softplus, neg - binomial, gamma] = 'poisson', alpha: float = 0.5, Tau: Union[None, numpy.ndarray] = None, reg_lambda: float = 0.1, learning_rate: float = 0.2, max_iter: int = 1000, tol: float = 1e-06, eta: float = 2.0, clip_coeffs: float = 0.01, score_metric: Literal[deviance, pseudo_r2] = 'deviance', fit_intercept: bool = True, random_seed: int = 888, theta: float = 1.0, verbose: bool = True)[source]#
Bases:
sklearn.base.BaseEstimator
Fitting generalized linear models (Gaussian, Poisson, negative binomial, gamma) for modeling gene expression.
NOTES: ‘Tau’ is the Tikhonov matrix (a square factorization of the inverse covariance matrix), used to set the degree to which the algorithm tends towards solutions with smaller norms. If not given, defaults to the ridge ( L2) penalty.
- Parameters
- distr
Distribution family- can be “gaussian”, “poisson”, “neg-binomial”, or “gamma”. Case sensitive.
- alpha
The weighting between L1 penalty (alpha=1.) and L2 penalty (alpha=0.) term of the loss function
- Tau
optional array of shape [n_features, n_features]; the Tikhonov matrix for ridge regression. If not provided, Tau will default to the identity matrix.
- reg_lambda
Regularization parameter \(\lambda\) of penalty term
- learning_rate
Governs the magnitude of parameter updates for the gradient descent algorithm
- max_iter
Maximum number of iterations for the solver
- tol
Convergence threshold or stopping criteria. Optimization loop will stop when relative change in parameter norm is below the threshold.
- eta
A threshold parameter that linearizes the exp() function above eta.
- clip_coeffs
Coefficients of lower absolute value than this threshold are set to zero.
- score_metric
Scoring metric. Options: - “deviance”: Uses the difference between the saturated (perfectly predictive) model and the true model. - “pseudo_r2”: Uses the coefficient of determination b/w the true and predicted values.
- fit_intercept
Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function
- random_seed
Seed of the random number generator used to initialize the solution. Default: 888
- theta
Shape parameter of the negative binomial distribution (number of successes before the first failure). It is used only if ‘distr’ is equal to “neg-binomial”, otherwise it is ignored.
- verbose
If True, will display information about number of iterations until convergence. Defaults to False.
- beta0_#
The intercept
- beta_#
Learned parameters
- n_iter#
Number of iterations
- _prox(beta: numpy.ndarray, thresh: float)[source]#
Proximal operator to slowly guide convergence during gradient descent.
- fit(X: numpy.ndarray, y: numpy.ndarray)[source]#
The fit function.
- Parameters
- X
2D array of shape [n_samples, n_features]; input data
- y
1D array of shape [n_samples,]; target data
- Returns
Fitted instance of class GLM
- Return type
self
- predict(X: numpy.ndarray) numpy.ndarray [source]#
Given predictor values, reconstruct expression of dependent/target variables.
- Parameters
- X
Array of shape [n_samples, n_features]; input data for prediction
- Returns
Predicted targets of shape [n_samples,]
- Return type
yhat
- fit_predict(X: numpy.ndarray, y: numpy.ndarray)[source]#
Fit the model and predict on the same data.
- Parameters
- X
array of shape [n_samples, n_features]; input data to fit and predict
- y
array of shape [n_samples,]; target values for regression
- Returns
Predicted targets of shape [n_samples,]
- Return type
yhat
- score(X: numpy.ndarray, y: numpy.ndarray)[source]#
Score model by computing either the deviance or R^2 for predicted values.
- Parameters
- X
array of shape [n_samples, n_features]; input data to fit and predict
- y
array of shape [n_samples,]; target values for regression
- Returns
Value of chosen metric (any pos number for deviance, 0-1 for R^2)
- Return type
score
- class spateo.tools.ST_regression.generalized_lm.GLMCV(distr: Literal[gaussian, poisson, spateo.tools.ST_regression.regression_utils.softplus, neg - binomial, gamma] = 'poisson', alpha: float = 0.5, Tau: Union[None, numpy.ndarray] = None, reg_lambda: Union[None, List[float]] = None, n_lambdas: int = 25, cv: int = 5, learning_rate: float = 0.2, max_iter: int = 1000, tol: float = 1e-06, eta: float = 2.0, clip_coeffs: float = 0.01, score_metric: Literal[deviance, pseudo_r2] = 'deviance', fit_intercept: bool = True, random_seed: int = 888, theta: float = 1.0)[source]#
Bases:
sklearn.base.BaseEstimator
For estimating regularized generalized linear models (GLM) along a regularization path with warm restarts.
- Parameters
- distr
Distribution family- can be “gaussian”, “poisson”, “neg-binomial”, or “gamma”. Case sensitive.
- alpha
The weighting between L1 penalty (alpha=1.) and L2 penalty (alpha=0.) term of the loss function
- Tau
optional array of shape [n_features, n_features]; the Tikhonov matrix for ridge regression. If not provided, Tau will default to the identity matrix.
- reg_lambda
Regularization parameter \(\lambda\) of penalty term
- n_lambdas
Number of lambdas along the regularization path. Defaults to 25.
- cv
Number of cross-validation repeats
- learning_rate
Governs the magnitude of parameter updates for the gradient descent algorithm
- max_iter
Maximum number of iterations for the solver
- tol
Convergence threshold or stopping criteria. Optimization loop will stop when relative change in parameter norm is below the threshold.
- eta
A threshold parameter that linearizes the exp() function above eta.
- clip_coeffs
Absolute value below which to set coefficients to zero.
- score_metric
Scoring metric. Options: - “deviance”: Uses the difference between the saturated (perfectly predictive) model and the true model. - “pseudo_r2”: Uses the coefficient of determination b/w the true and predicted values.
- fit_intercept
Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function
- random_seed
Seed of the random number generator used to initialize the solution. Default: 888
- theta
Shape parameter of the negative binomial distribution (number of successes before the first failure). It is used only if ‘distr’ is equal to “neg-binomial”, otherwise it is ignored.
- verbose
If True, returns logging information as program runs. Recommended to set to False for any parallelized processes.
- beta0_#
The intercept
- beta_#
Learned parameters
- glm_#
The GLM object with the best score
- reg_lambda_opt#
The value of reg_lambda for the best GLM model
- n_iter#
Number of iterations
- fit(X: numpy.ndarray, y: numpy.ndarray)[source]#
The fit function.
- Parameters
- X
2D array of shape [n_samples, n_features]; input data
- y
1D array of shape [n_samples,]; target data
- Returns
Fitted instance of class GLM
- Return type
self
- predict(X: numpy.ndarray) numpy.ndarray [source]#
Using the best scoring model, predict target values.
- Parameters
- X
Array of shape [n_samples, n_features]; input data for prediction
- Returns
Predicted targets based on the model with optimal reg_lambda, of shape [n_samples,]
- Return type
yhat
- fit_predict(X: numpy.ndarray, y: numpy.ndarray)[source]#
Fit the model and, after finding the best model, predict on the same data using that model.
- Parameters
- X
array of shape [n_samples, n_features]; input data to fit and predict
- y
array of shape [n_samples,]; target values for regression
- Returns
Predicted targets based on the model with optimal reg_lambda, of shape [n_samples,]
- Return type
yhat
- score(X: numpy.ndarray, y: numpy.ndarray)[source]#
Score model by computing either the deviance or R^2 for predicted values.
- Parameters
- X
array of shape [n_samples, n_features]; input data to fit and predict
- y
array of shape [n_samples,]; target values for regression
- Returns
Value of chosen metric (any pos number for deviance, 0-1 for R^2) for the optimal reg_lambda
- Return type
score
- spateo.tools.ST_regression.generalized_lm.fit_glm(X: Union[numpy.ndarray, pandas.DataFrame], adata: anndata.AnnData, y_feat, calc_first_moment: bool = True, log_transform: bool = True, gs_params: Union[None, dict] = None, n_gs_cv: Union[None, int] = None, return_model: bool = True, **kwargs) Tuple[numpy.ndarray, numpy.ndarray, float, numpy.ndarray, Union[None, GLMCV]] [source]#
Wrapper for fitting a generalized elastic net linear model to large biological data, with automated finding of optimum lambda regularization parameter and optional further grid search for parameter optimization.
- Parameters
- X
Array or DataFrame containing data for fitting- all columns in this array will be used as independent variables
- adata
AnnData object from which dependent variable gene expression values will be taken from
- y_feat
Name of the feature in ‘adata’ corresponding to the dependent variable
- log_transform
If True, will log transform expression. Defaults to True.
- calc_first_moment
If True, will alleviate dropout effects by computing the first moment of each gene across cells, consistent with the method used by the original RNA velocity method (La Manno et al., 2018). Defaults to True.
- gs_params
Optional dictionary where keys are variable names for either the classifier or the regressor and values are lists of potential values for which to find the best combination using grid search. Classifier parameters should be given in the following form: ‘classifier__{parameter name}’.
- n_gs_cv
Number of folds for cross-validation, will only be used if gs_params is not None. If None, will default to a 5-fold cross-validation.
- return_model
If True, returns fitted model. Defaults to True.
- kwargs
Additional named arguments that will be provided to :class GLMCV. Valid options are: - distr: Distribution family- can be “gaussian”, “poisson”, “neg-binomial”, or “gamma”. Case sensitive. - alpha: The weighting between L1 penalty (alpha=1.) and L2 penalty (alpha=0.) term of the loss function - Tau: optional array of shape [n_features, n_features]; the Tikhonov matrix for ridge regression. If not
provided, Tau will default to the identity matrix.
reg_lambda: Regularization parameter \(\lambda\) of penalty term
n_lambdas: Number of lambdas along the regularization path. Only used if ‘reg_lambda’ is not given.
cv: Number of cross-validation repeats
learning_rate: Governs the magnitude of parameter updates for the gradient descent algorithm
max_iter: Maximum number of iterations for the solver
- tol: Convergence threshold or stopping criteria. Optimization loop will stop when relative change in
parameter norm is below the threshold.
eta: A threshold parameter that linearizes the exp() function above eta.
- score_metric: Scoring metric. Options:
”deviance”: Uses the difference between the saturated (perfectly predictive) model and the true model.
”pseudo_r2”: Uses the coefficient of determination b/w the true and predicted values.
fit_intercept: Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function
random_seed: Seed of the random number generator used to initialize the solution. Default: 888
- theta: Shape parameter of the negative binomial distribution (number of successes before the first
failure). It is used only if ‘distr’ is equal to “neg-binomial”, otherwise it is ignored.
- Returns
Array of shape [n_parameters, 1], contains weight for each parameter rex: Array of shape [n_samples, 1]. Reconstructed independent variable values. reg: Instance of regression model. Returned only if ‘return_model’ is True.
- Return type
Beta