`spateo.tools.ST_regression.generalized_lm`#

Generalized linear model regression for spatially-aware regression of spatial transcriptomic (gene expression) data. Rather than assuming the response variable necessarily follows the normal distribution, instead allows the specification of models whose response variable follows different distributions (e.g. Poisson or Gamma), although allows also for normal (Gaussian) modeling. Additionally features capability to perform elastic net regularized regression.

Module Contents#

Classes#

`GLM`	Fitting generalized linear models (Gaussian, Poisson, negative binomial, gamma) for modeling gene expression.
`GLMCV`	For estimating regularized generalized linear models (GLM) along a regularization path with warm restarts.

Functions#

`_z`(→ numpy.ndarray)	Computes z, an intermediate comprising the result of a linear regression, just before non-linearity is applied.
`_nl`(→ numpy.ndarray)	Applies nonlinear operation to linear estimation.
`_grad_nl`(distr, z, eta)	Derivative of the non-linearity.
`batch_grad`(→ numpy.ndarray)	Computes the gradient (for parameter updating) via batch gradient descent
`log_likelihood`(→ float)	Computes negative log-likelihood of an observation, based on true values and predictions from the regression.
`_loss`(→ float)	Objective function, comprised of a combination of the log-likelihood and regularization losses.
`pseudo_r2`(y, yhat, ynull_, distr, theta)	Compute r^2 using log-likelihood, taking into account the observed and predicted distributions as well as the
`deviance`(y, yhat, distr, theta)	Deviance goodness-of-fit
`fit_glm`(→ Tuple[numpy.ndarray, numpy.ndarray, float, ...)	Wrapper for fitting a generalized elastic net linear model to large biological data, with automated finding of
`calc_1nd_moment`(X, W[, normalize_W])

spateo.tools.ST_regression.generalized_lm._z(beta0: float, beta: numpy.ndarray, X: numpy.ndarray, fit_intercept: bool) → numpy.ndarray[source]#

Computes z, an intermediate comprising the result of a linear regression, just before non-linearity is applied.

Parameters

beta0: The intercept
beta: Array of shape [n_features,]; learned model coefficients
X: Array of shape [n_samples, n_features]; input data
fit_intercept: Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function. Defaults to True.

Returns

Array of shape [n_samples, n_features]; prediction of the target values

Return type

spateo.tools.ST_regression.generalized_lm._nl(distr: Literal[gaussian, poisson, spateo.tools.ST_regression.regression_utils.softplus, neg - binomial, gamma], z: numpy.ndarray, eta: float, fit_intercept: bool) → numpy.ndarray[source]#

Applies nonlinear operation to linear estimation.

Parameters

distr: Distribution family- can be “gaussian”, “poisson”, “softplus”, “neg-binomial”, or “gamma”
z: Array of shape [n_samples, n_features]; prediction of the target values
eta: A threshold parameter that linearizes the exp() function above threshold eta
fit_intercept: Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function. Defaults to True.

Returns

An array of size [n_samples, n_features]; result following application of the nonlinear layer

Return type

spateo.tools.ST_regression.generalized_lm._grad_nl(distr: Literal[gaussian, poisson, spateo.tools.ST_regression.regression_utils.softplus, neg - binomial, gamma], z: numpy.ndarray, eta: float)[source]#

Derivative of the non-linearity.

Parameters

distr: Distribution family- can be “gaussian”, “poisson”, “softplus”, “neg-binomial”, or “gamma”
z: Array of shape [n_samples, n_features]; prediction of the target values
eta: A threshold parameter that linearizes the exp() function above threshold eta

Returns

Array of size [n_samples, n_features]; first derivative of each parameter estimate

Return type

grad_nl

spateo.tools.ST_regression.generalized_lm.batch_grad(distr: Literal[gaussian, poisson, spateo.tools.ST_regression.regression_utils.softplus, neg - binomial, gamma], alpha: float, reg_lambda: float, X: numpy.ndarray, y: numpy.ndarray, beta: numpy.ndarray, Tau: Union[None, numpy.ndarray] = None, eta: float = 2.0, theta: float = 1.0, fit_intercept: bool = True) → numpy.ndarray[source]#

Computes the gradient (for parameter updating) via batch gradient descent

Parameters

distr: Distribution family- can be “gaussian”, “softplus”, “poisson”, “neg-binomial”, or “gamma”. Case sensitive.
alpha: The weighting between L1 penalty (alpha=1.) and L2 penalty (alpha=0.) term of the loss function
reg_lambda: Regularization parameter \(\lambda\) of penalty term
X: Array of shape [n_samples, n_features]; input data
y: Array of shape [n_samples, 1]; labels or targets for the data
beta: Array of shape [n_features,]; learned model coefficients
Tau: optional array of shape [n_features, n_features]; the Tikhonov matrix for ridge regression. If not provided, Tau will default to the identity matrix.
eta: A threshold parameter that linearizes the exp() function above threshold eta
theta: Shape parameter of the negative binomial distribution (number of successes before the first failure). Used only if ‘distr’ is “neg-binomial”
fit_intercept: Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function. Defaults to True.

Returns

Gradient for each parameter

Return type

spateo.tools.ST_regression.generalized_lm.log_likelihood(distr: Literal[gaussian, poisson, spateo.tools.ST_regression.regression_utils.softplus, neg - binomial, gamma], y: numpy.ndarray, y_hat: Union[numpy.ndarray, float], theta: float = 1.0) → float[source]#

Computes negative log-likelihood of an observation, based on true values and predictions from the regression.

Parameters

distr: Distribution family- can be “gaussian”, “poisson”, “softplus”, “neg-binomial”, or “gamma”. Case sensitive.
y: Target values
y_hat: Predicted values, either array of predictions or scalar value

Returns

Numerical value for the log-likelihood

Return type

logL

spateo.tools.ST_regression.generalized_lm._loss(distr: Literal[gaussian, poisson, spateo.tools.ST_regression.regression_utils.softplus, neg - binomial, gamma], alpha: float, reg_lambda: float, X: numpy.ndarray, y: numpy.ndarray, beta: numpy.ndarray, Tau: Union[None, numpy.ndarray] = None, eta: float = 2.0, theta: float = 1.0, fit_intercept: bool = True) → float[source]#

Objective function, comprised of a combination of the log-likelihood and regularization losses.

Parameters

distr: Distribution family- can be “gaussian”, “poisson”, “softplus”, “neg-binomial”, or “gamma”. Case sensitive.
alpha: The weighting between L1 penalty (alpha=1.) and L2 penalty (alpha=0.) term of the loss function
reg_lambda: Regularization parameter \(\lambda\) of penalty term
X: Array of shape [n_samples, n_features]; input data
y: Array of shape [n_samples, 1]; labels or targets for the data
beta: Array of shape [n_features,]; learned model coefficients
Tau: optional array of shape [n_features, n_features]; the Tikhonov matrix for ridge regression. If not provided, Tau will default to the identity matrix.
eta: A threshold parameter that linearizes the exp() function above threshold eta
theta: Shape parameter of the negative binomial distribution (number of successes before the first failure). Used only if ‘distr’ is “neg-binomial”
fit_intercept: Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function. Defaults to True.

Returns

Numerical value for loss

Return type

loss

spateo.tools.ST_regression.generalized_lm.pseudo_r2(y: numpy.ndarray, yhat: numpy.ndarray, ynull_: float, distr: Literal[gaussian, poisson, spateo.tools.ST_regression.regression_utils.softplus, neg - binomial, gamma], theta: float)[source]#

Compute r^2 using log-likelihood, taking into account the observed and predicted distributions as well as the observed and predicted values.

Parameters

y: Array of shape [n_samples,]; target values for regression
yhat: Predicted targets of shape [n_samples,]
ynull: Mean of the target labels (null model prediction)
distr: Distribution family- can be “gaussian”, “poisson”, “softplus”, “neg-binomial”, or “gamma”. Case sensitive.
theta: Shape parameter of the negative binomial distribution (number of successes before the first failure). It is used only if ‘distr’ is equal to “neg-binomial”, otherwise it is ignored.

spateo.tools.ST_regression.generalized_lm.deviance(y: numpy.ndarray, yhat: numpy.ndarray, distr: Literal[gaussian, poisson, spateo.tools.ST_regression.regression_utils.softplus, neg - binomial, gamma], theta: float)[source]#

Deviance goodness-of-fit

Parameters

y: Array of shape [n_samples,]; target values for regression
yhat: Predicted targets of shape [n_samples,]
distr: Distribution family- can be “gaussian”, “poisson”, “softplus”, “neg-binomial”, or “gamma”. Case sensitive.
theta: Shape parameter of the negative binomial distribution (number of successes before the first failure). It is used only if ‘distr’ is equal to “neg-binomial”, otherwise it is ignored.

Returns

Deviance of the predicted labels

Return type

score

class spateo.tools.ST_regression.generalized_lm.GLM(distr: Literal[gaussian, poisson, spateo.tools.ST_regression.regression_utils.softplus, neg - binomial, gamma] = 'poisson', alpha: float = 0.5, Tau: Union[None, numpy.ndarray] = None, reg_lambda: float = 0.1, learning_rate: float = 0.2, max_iter: int = 1000, tol: float = 1e-06, eta: float = 2.0, clip_coeffs: float = 0.01, score_metric: Literal[deviance, pseudo_r2] = 'deviance', fit_intercept: bool = True, random_seed: int = 888, theta: float = 1.0, verbose: bool = True)[source]#

Bases: sklearn.base.BaseEstimator

Fitting generalized linear models (Gaussian, Poisson, negative binomial, gamma) for modeling gene expression.

NOTES: ‘Tau’ is the Tikhonov matrix (a square factorization of the inverse covariance matrix), used to set the degree to which the algorithm tends towards solutions with smaller norms. If not given, defaults to the ridge ( L2) penalty.

Parameters

distr: Distribution family- can be “gaussian”, “poisson”, “neg-binomial”, or “gamma”. Case sensitive.
alpha: The weighting between L1 penalty (alpha=1.) and L2 penalty (alpha=0.) term of the loss function
Tau: optional array of shape [n_features, n_features]; the Tikhonov matrix for ridge regression. If not provided, Tau will default to the identity matrix.
reg_lambda: Regularization parameter \(\lambda\) of penalty term
learning_rate: Governs the magnitude of parameter updates for the gradient descent algorithm
max_iter: Maximum number of iterations for the solver
tol: Convergence threshold or stopping criteria. Optimization loop will stop when relative change in parameter norm is below the threshold.
eta: A threshold parameter that linearizes the exp() function above eta.
clip_coeffs: Coefficients of lower absolute value than this threshold are set to zero.
score_metric: Scoring metric. Options: - “deviance”: Uses the difference between the saturated (perfectly predictive) model and the true model. - “pseudo_r2”: Uses the coefficient of determination b/w the true and predicted values.
fit_intercept: Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function
random_seed: Seed of the random number generator used to initialize the solution. Default: 888
theta: Shape parameter of the negative binomial distribution (number of successes before the first failure). It is used only if ‘distr’ is equal to “neg-binomial”, otherwise it is ignored.
verbose: If True, will display information about number of iterations until convergence. Defaults to False.

beta0_#: The intercept

beta_#: Learned parameters

n_iter#: Number of iterations

__repr__()[source]#: Return repr(self).

_prox(beta: numpy.ndarray, thresh: float)[source]#: Proximal operator to slowly guide convergence during gradient descent.

fit(X: numpy.ndarray, y: numpy.ndarray)[source]#

The fit function.

Parameters

X: 2D array of shape [n_samples, n_features]; input data
y: 1D array of shape [n_samples,]; target data

Returns

Fitted instance of class GLM

Return type

self

predict(X: numpy.ndarray) → numpy.ndarray[source]#

Given predictor values, reconstruct expression of dependent/target variables.

Parameters

X: Array of shape [n_samples, n_features]; input data for prediction

Returns

Predicted targets of shape [n_samples,]

Return type

yhat

fit_predict(X: numpy.ndarray, y: numpy.ndarray)[source]#

Fit the model and predict on the same data.

Parameters

X: array of shape [n_samples, n_features]; input data to fit and predict
y: array of shape [n_samples,]; target values for regression

Returns

Predicted targets of shape [n_samples,]

Return type

yhat

score(X: numpy.ndarray, y: numpy.ndarray)[source]#

Score model by computing either the deviance or R^2 for predicted values.

Parameters

X: array of shape [n_samples, n_features]; input data to fit and predict
y: array of shape [n_samples,]; target values for regression

Returns

Value of chosen metric (any pos number for deviance, 0-1 for R^2)

Return type

score

class spateo.tools.ST_regression.generalized_lm.GLMCV(distr: Literal[gaussian, poisson, spateo.tools.ST_regression.regression_utils.softplus, neg - binomial, gamma] = 'poisson', alpha: float = 0.5, Tau: Union[None, numpy.ndarray] = None, reg_lambda: Union[None, List[float]] = None, n_lambdas: int = 25, cv: int = 5, learning_rate: float = 0.2, max_iter: int = 1000, tol: float = 1e-06, eta: float = 2.0, clip_coeffs: float = 0.01, score_metric: Literal[deviance, pseudo_r2] = 'deviance', fit_intercept: bool = True, random_seed: int = 888, theta: float = 1.0)[source]#

Bases: sklearn.base.BaseEstimator

For estimating regularized generalized linear models (GLM) along a regularization path with warm restarts.

Parameters

distr: Distribution family- can be “gaussian”, “poisson”, “neg-binomial”, or “gamma”. Case sensitive.
alpha: The weighting between L1 penalty (alpha=1.) and L2 penalty (alpha=0.) term of the loss function
Tau: optional array of shape [n_features, n_features]; the Tikhonov matrix for ridge regression. If not provided, Tau will default to the identity matrix.
reg_lambda: Regularization parameter \(\lambda\) of penalty term
n_lambdas: Number of lambdas along the regularization path. Defaults to 25.
cv: Number of cross-validation repeats
learning_rate: Governs the magnitude of parameter updates for the gradient descent algorithm
max_iter: Maximum number of iterations for the solver
tol: Convergence threshold or stopping criteria. Optimization loop will stop when relative change in parameter norm is below the threshold.
eta: A threshold parameter that linearizes the exp() function above eta.
clip_coeffs: Absolute value below which to set coefficients to zero.
score_metric: Scoring metric. Options: - “deviance”: Uses the difference between the saturated (perfectly predictive) model and the true model. - “pseudo_r2”: Uses the coefficient of determination b/w the true and predicted values.
fit_intercept: Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function
random_seed: Seed of the random number generator used to initialize the solution. Default: 888
theta: Shape parameter of the negative binomial distribution (number of successes before the first failure). It is used only if ‘distr’ is equal to “neg-binomial”, otherwise it is ignored.
verbose: If True, returns logging information as program runs. Recommended to set to False for any parallelized processes.

beta0_#: The intercept

beta_#: Learned parameters

glm_#: The GLM object with the best score

reg_lambda_opt#: The value of reg_lambda for the best GLM model

n_iter#: Number of iterations

__repr__()[source]#: Return repr(self).

fit(X: numpy.ndarray, y: numpy.ndarray)[source]#

The fit function.

Parameters

X: 2D array of shape [n_samples, n_features]; input data
y: 1D array of shape [n_samples,]; target data

Returns

Fitted instance of class GLM

Return type

self

predict(X: numpy.ndarray) → numpy.ndarray[source]#

Using the best scoring model, predict target values.

Parameters

X: Array of shape [n_samples, n_features]; input data for prediction

Returns

Predicted targets based on the model with optimal reg_lambda, of shape [n_samples,]

Return type

yhat

fit_predict(X: numpy.ndarray, y: numpy.ndarray)[source]#

Fit the model and, after finding the best model, predict on the same data using that model.

Parameters

X: array of shape [n_samples, n_features]; input data to fit and predict
y: array of shape [n_samples,]; target values for regression

Returns

Predicted targets based on the model with optimal reg_lambda, of shape [n_samples,]

Return type

yhat

score(X: numpy.ndarray, y: numpy.ndarray)[source]#

Score model by computing either the deviance or R^2 for predicted values.

Parameters

X: array of shape [n_samples, n_features]; input data to fit and predict
y: array of shape [n_samples,]; target values for regression

Returns

Value of chosen metric (any pos number for deviance, 0-1 for R^2) for the optimal reg_lambda

Return type

score

spateo.tools.ST_regression.generalized_lm.fit_glm(X: Union[numpy.ndarray, pandas.DataFrame], adata: anndata.AnnData, y_feat, calc_first_moment: bool = True, log_transform: bool = True, gs_params: Union[None, dict] = None, n_gs_cv: Union[None, int] = None, return_model: bool = True, **kwargs) → Tuple[numpy.ndarray, numpy.ndarray, float, numpy.ndarray, Union[None, GLMCV]][source]#

Wrapper for fitting a generalized elastic net linear model to large biological data, with automated finding of optimum lambda regularization parameter and optional further grid search for parameter optimization.

Parameters

X

Array or DataFrame containing data for fitting- all columns in this array will be used as independent variables

adata

AnnData object from which dependent variable gene expression values will be taken from

y_feat

Name of the feature in ‘adata’ corresponding to the dependent variable

log_transform

If True, will log transform expression. Defaults to True.

calc_first_moment

If True, will alleviate dropout effects by computing the first moment of each gene across cells, consistent with the method used by the original RNA velocity method (La Manno et al., 2018). Defaults to True.

gs_params

Optional dictionary where keys are variable names for either the classifier or the regressor and values are lists of potential values for which to find the best combination using grid search. Classifier parameters should be given in the following form: ‘classifier__{parameter name}’.

n_gs_cv

Number of folds for cross-validation, will only be used if gs_params is not None. If None, will default to a 5-fold cross-validation.

return_model

If True, returns fitted model. Defaults to True.

kwargs

Additional named arguments that will be provided to :class GLMCV. Valid options are: - distr: Distribution family- can be “gaussian”, “poisson”, “neg-binomial”, or “gamma”. Case sensitive. - alpha: The weighting between L1 penalty (alpha=1.) and L2 penalty (alpha=0.) term of the loss function - Tau: optional array of shape [n_features, n_features]; the Tikhonov matrix for ridge regression. If not

provided, Tau will default to the identity matrix.

reg_lambda: Regularization parameter \(\lambda\) of penalty term
n_lambdas: Number of lambdas along the regularization path. Only used if ‘reg_lambda’ is not given.
cv: Number of cross-validation repeats
learning_rate: Governs the magnitude of parameter updates for the gradient descent algorithm
max_iter: Maximum number of iterations for the solver
tol: Convergence threshold or stopping criteria. Optimization loop will stop when relative change in
parameter norm is below the threshold.
eta: A threshold parameter that linearizes the exp() function above eta.
score_metric: Scoring metric. Options:
- ”deviance”: Uses the difference between the saturated (perfectly predictive) model and the true model.
- ”pseudo_r2”: Uses the coefficient of determination b/w the true and predicted values.
fit_intercept: Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function
random_seed: Seed of the random number generator used to initialize the solution. Default: 888
theta: Shape parameter of the negative binomial distribution (number of successes before the first
failure). It is used only if ‘distr’ is equal to “neg-binomial”, otherwise it is ignored.

Returns

Array of shape [n_parameters, 1], contains weight for each parameter rex: Array of shape [n_samples, 1]. Reconstructed independent variable values. reg: Instance of regression model. Returned only if ‘return_model’ is True.

Return type

Beta

spateo.tools.ST_regression.generalized_lm.calc_1nd_moment(X, W, normalize_W=True)[source]#

spateo.tools.ST_regression.generalized_lm#

Module Contents#

Classes#

Functions#

`spateo.tools.ST_regression.generalized_lm`#