spateo.tools.ST_regression.generalized_lm#

Generalized linear model regression for spatially-aware regression of spatial transcriptomic (gene expression) data. Rather than assuming the response variable necessarily follows the normal distribution, instead allows the specification of models whose response variable follows different distributions (e.g. Poisson or Gamma), although allows also for normal (Gaussian) modeling. Additionally features capability to perform elastic net regularized regression.

Module Contents#

Classes#

GLM

Fitting generalized linear models (Gaussian, Poisson, negative binomial, gamma) for modeling gene expression.

GLMCV

For estimating regularized generalized linear models (GLM) along a regularization path with warm restarts.

Functions#

_z(→ numpy.ndarray)

Computes z, an intermediate comprising the result of a linear regression, just before non-linearity is applied.

_nl(→ numpy.ndarray)

Applies nonlinear operation to linear estimation.

_grad_nl(distr, z, eta)

Derivative of the non-linearity.

batch_grad(→ numpy.ndarray)

Computes the gradient (for parameter updating) via batch gradient descent

log_likelihood(→ float)

Computes negative log-likelihood of an observation, based on true values and predictions from the regression.

_loss(→ float)

Objective function, comprised of a combination of the log-likelihood and regularization losses.

pseudo_r2(y, yhat, ynull_, distr, theta)

Compute r^2 using log-likelihood, taking into account the observed and predicted distributions as well as the

deviance(y, yhat, distr, theta)

Deviance goodness-of-fit

fit_glm(→ Tuple[numpy.ndarray, numpy.ndarray, float, ...)

Wrapper for fitting a generalized elastic net linear model to large biological data, with automated finding of

calc_1nd_moment(X, W[, normalize_W])

spateo.tools.ST_regression.generalized_lm._z(beta0: float, beta: numpy.ndarray, X: numpy.ndarray, fit_intercept: bool) numpy.ndarray[source]#

Computes z, an intermediate comprising the result of a linear regression, just before non-linearity is applied.

Parameters
beta0

The intercept

beta

Array of shape [n_features,]; learned model coefficients

X

Array of shape [n_samples, n_features]; input data

fit_intercept

Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function. Defaults to True.

Returns

Array of shape [n_samples, n_features]; prediction of the target values

Return type

z

spateo.tools.ST_regression.generalized_lm._nl(distr: Literal[gaussian, poisson, spateo.tools.ST_regression.regression_utils.softplus, neg - binomial, gamma], z: numpy.ndarray, eta: float, fit_intercept: bool) numpy.ndarray[source]#

Applies nonlinear operation to linear estimation.

Parameters
distr

Distribution family- can be “gaussian”, “poisson”, “softplus”, “neg-binomial”, or “gamma”

z

Array of shape [n_samples, n_features]; prediction of the target values

eta

A threshold parameter that linearizes the exp() function above threshold eta

fit_intercept

Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function. Defaults to True.

Returns

An array of size [n_samples, n_features]; result following application of the nonlinear layer

Return type

nl

spateo.tools.ST_regression.generalized_lm._grad_nl(distr: Literal[gaussian, poisson, spateo.tools.ST_regression.regression_utils.softplus, neg - binomial, gamma], z: numpy.ndarray, eta: float)[source]#

Derivative of the non-linearity.

Parameters
distr

Distribution family- can be “gaussian”, “poisson”, “softplus”, “neg-binomial”, or “gamma”

z

Array of shape [n_samples, n_features]; prediction of the target values

eta

A threshold parameter that linearizes the exp() function above threshold eta

Returns

Array of size [n_samples, n_features]; first derivative of each parameter estimate

Return type

grad_nl

spateo.tools.ST_regression.generalized_lm.batch_grad(distr: Literal[gaussian, poisson, spateo.tools.ST_regression.regression_utils.softplus, neg - binomial, gamma], alpha: float, reg_lambda: float, X: numpy.ndarray, y: numpy.ndarray, beta: numpy.ndarray, Tau: Union[None, numpy.ndarray] = None, eta: float = 2.0, theta: float = 1.0, fit_intercept: bool = True) numpy.ndarray[source]#

Computes the gradient (for parameter updating) via batch gradient descent

Parameters
distr

Distribution family- can be “gaussian”, “softplus”, “poisson”, “neg-binomial”, or “gamma”. Case sensitive.

alpha

The weighting between L1 penalty (alpha=1.) and L2 penalty (alpha=0.) term of the loss function

reg_lambda

Regularization parameter \(\lambda\) of penalty term

X

Array of shape [n_samples, n_features]; input data

y

Array of shape [n_samples, 1]; labels or targets for the data

beta

Array of shape [n_features,]; learned model coefficients

Tau

optional array of shape [n_features, n_features]; the Tikhonov matrix for ridge regression. If not provided, Tau will default to the identity matrix.

eta

A threshold parameter that linearizes the exp() function above threshold eta

theta

Shape parameter of the negative binomial distribution (number of successes before the first failure). Used only if ‘distr’ is “neg-binomial”

fit_intercept

Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function. Defaults to True.

Returns

Gradient for each parameter

Return type

g

spateo.tools.ST_regression.generalized_lm.log_likelihood(distr: Literal[gaussian, poisson, spateo.tools.ST_regression.regression_utils.softplus, neg - binomial, gamma], y: numpy.ndarray, y_hat: Union[numpy.ndarray, float], theta: float = 1.0) float[source]#

Computes negative log-likelihood of an observation, based on true values and predictions from the regression.

Parameters
distr

Distribution family- can be “gaussian”, “poisson”, “softplus”, “neg-binomial”, or “gamma”. Case sensitive.

y

Target values

y_hat

Predicted values, either array of predictions or scalar value

Returns

Numerical value for the log-likelihood

Return type

logL

spateo.tools.ST_regression.generalized_lm._loss(distr: Literal[gaussian, poisson, spateo.tools.ST_regression.regression_utils.softplus, neg - binomial, gamma], alpha: float, reg_lambda: float, X: numpy.ndarray, y: numpy.ndarray, beta: numpy.ndarray, Tau: Union[None, numpy.ndarray] = None, eta: float = 2.0, theta: float = 1.0, fit_intercept: bool = True) float[source]#

Objective function, comprised of a combination of the log-likelihood and regularization losses.

Parameters
distr

Distribution family- can be “gaussian”, “poisson”, “softplus”, “neg-binomial”, or “gamma”. Case sensitive.

alpha

The weighting between L1 penalty (alpha=1.) and L2 penalty (alpha=0.) term of the loss function

reg_lambda

Regularization parameter \(\lambda\) of penalty term

X

Array of shape [n_samples, n_features]; input data

y

Array of shape [n_samples, 1]; labels or targets for the data

beta

Array of shape [n_features,]; learned model coefficients

Tau

optional array of shape [n_features, n_features]; the Tikhonov matrix for ridge regression. If not provided, Tau will default to the identity matrix.

eta

A threshold parameter that linearizes the exp() function above threshold eta

theta

Shape parameter of the negative binomial distribution (number of successes before the first failure). Used only if ‘distr’ is “neg-binomial”

fit_intercept

Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function. Defaults to True.

Returns

Numerical value for loss

Return type

loss

spateo.tools.ST_regression.generalized_lm.pseudo_r2(y: numpy.ndarray, yhat: numpy.ndarray, ynull_: float, distr: Literal[gaussian, poisson, spateo.tools.ST_regression.regression_utils.softplus, neg - binomial, gamma], theta: float)[source]#

Compute r^2 using log-likelihood, taking into account the observed and predicted distributions as well as the observed and predicted values.

Parameters
y

Array of shape [n_samples,]; target values for regression

yhat

Predicted targets of shape [n_samples,]

ynull

Mean of the target labels (null model prediction)

distr

Distribution family- can be “gaussian”, “poisson”, “softplus”, “neg-binomial”, or “gamma”. Case sensitive.

theta

Shape parameter of the negative binomial distribution (number of successes before the first failure). It is used only if ‘distr’ is equal to “neg-binomial”, otherwise it is ignored.

spateo.tools.ST_regression.generalized_lm.deviance(y: numpy.ndarray, yhat: numpy.ndarray, distr: Literal[gaussian, poisson, spateo.tools.ST_regression.regression_utils.softplus, neg - binomial, gamma], theta: float)[source]#

Deviance goodness-of-fit

Parameters
y

Array of shape [n_samples,]; target values for regression

yhat

Predicted targets of shape [n_samples,]

distr

Distribution family- can be “gaussian”, “poisson”, “softplus”, “neg-binomial”, or “gamma”. Case sensitive.

theta

Shape parameter of the negative binomial distribution (number of successes before the first failure). It is used only if ‘distr’ is equal to “neg-binomial”, otherwise it is ignored.

Returns

Deviance of the predicted labels

Return type

score

class spateo.tools.ST_regression.generalized_lm.GLM(distr: Literal[gaussian, poisson, spateo.tools.ST_regression.regression_utils.softplus, neg - binomial, gamma] = 'poisson', alpha: float = 0.5, Tau: Union[None, numpy.ndarray] = None, reg_lambda: float = 0.1, learning_rate: float = 0.2, max_iter: int = 1000, tol: float = 1e-06, eta: float = 2.0, clip_coeffs: float = 0.01, score_metric: Literal[deviance, pseudo_r2] = 'deviance', fit_intercept: bool = True, random_seed: int = 888, theta: float = 1.0, verbose: bool = True)[source]#

Bases: sklearn.base.BaseEstimator

Fitting generalized linear models (Gaussian, Poisson, negative binomial, gamma) for modeling gene expression.

NOTES: ‘Tau’ is the Tikhonov matrix (a square factorization of the inverse covariance matrix), used to set the degree to which the algorithm tends towards solutions with smaller norms. If not given, defaults to the ridge ( L2) penalty.

Parameters
distr

Distribution family- can be “gaussian”, “poisson”, “neg-binomial”, or “gamma”. Case sensitive.

alpha

The weighting between L1 penalty (alpha=1.) and L2 penalty (alpha=0.) term of the loss function

Tau

optional array of shape [n_features, n_features]; the Tikhonov matrix for ridge regression. If not provided, Tau will default to the identity matrix.

reg_lambda

Regularization parameter \(\lambda\) of penalty term

learning_rate

Governs the magnitude of parameter updates for the gradient descent algorithm

max_iter

Maximum number of iterations for the solver

tol

Convergence threshold or stopping criteria. Optimization loop will stop when relative change in parameter norm is below the threshold.

eta

A threshold parameter that linearizes the exp() function above eta.

clip_coeffs

Coefficients of lower absolute value than this threshold are set to zero.

score_metric

Scoring metric. Options: - “deviance”: Uses the difference between the saturated (perfectly predictive) model and the true model. - “pseudo_r2”: Uses the coefficient of determination b/w the true and predicted values.

fit_intercept

Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function

random_seed

Seed of the random number generator used to initialize the solution. Default: 888

theta

Shape parameter of the negative binomial distribution (number of successes before the first failure). It is used only if ‘distr’ is equal to “neg-binomial”, otherwise it is ignored.

verbose

If True, will display information about number of iterations until convergence. Defaults to False.

beta0_#

The intercept

beta_#

Learned parameters

n_iter#

Number of iterations

__repr__()[source]#

Return repr(self).

_prox(beta: numpy.ndarray, thresh: float)[source]#

Proximal operator to slowly guide convergence during gradient descent.

fit(X: numpy.ndarray, y: numpy.ndarray)[source]#

The fit function.

Parameters
X

2D array of shape [n_samples, n_features]; input data

y

1D array of shape [n_samples,]; target data

Returns

Fitted instance of class GLM

Return type

self

predict(X: numpy.ndarray) numpy.ndarray[source]#

Given predictor values, reconstruct expression of dependent/target variables.

Parameters
X

Array of shape [n_samples, n_features]; input data for prediction

Returns

Predicted targets of shape [n_samples,]

Return type

yhat

fit_predict(X: numpy.ndarray, y: numpy.ndarray)[source]#

Fit the model and predict on the same data.

Parameters
X

array of shape [n_samples, n_features]; input data to fit and predict

y

array of shape [n_samples,]; target values for regression

Returns

Predicted targets of shape [n_samples,]

Return type

yhat

score(X: numpy.ndarray, y: numpy.ndarray)[source]#

Score model by computing either the deviance or R^2 for predicted values.

Parameters
X

array of shape [n_samples, n_features]; input data to fit and predict

y

array of shape [n_samples,]; target values for regression

Returns

Value of chosen metric (any pos number for deviance, 0-1 for R^2)

Return type

score

class spateo.tools.ST_regression.generalized_lm.GLMCV(distr: Literal[gaussian, poisson, spateo.tools.ST_regression.regression_utils.softplus, neg - binomial, gamma] = 'poisson', alpha: float = 0.5, Tau: Union[None, numpy.ndarray] = None, reg_lambda: Union[None, List[float]] = None, n_lambdas: int = 25, cv: int = 5, learning_rate: float = 0.2, max_iter: int = 1000, tol: float = 1e-06, eta: float = 2.0, clip_coeffs: float = 0.01, score_metric: Literal[deviance, pseudo_r2] = 'deviance', fit_intercept: bool = True, random_seed: int = 888, theta: float = 1.0)[source]#

Bases: sklearn.base.BaseEstimator

For estimating regularized generalized linear models (GLM) along a regularization path with warm restarts.

Parameters
distr

Distribution family- can be “gaussian”, “poisson”, “neg-binomial”, or “gamma”. Case sensitive.

alpha

The weighting between L1 penalty (alpha=1.) and L2 penalty (alpha=0.) term of the loss function

Tau

optional array of shape [n_features, n_features]; the Tikhonov matrix for ridge regression. If not provided, Tau will default to the identity matrix.

reg_lambda

Regularization parameter \(\lambda\) of penalty term

n_lambdas

Number of lambdas along the regularization path. Defaults to 25.

cv

Number of cross-validation repeats

learning_rate

Governs the magnitude of parameter updates for the gradient descent algorithm

max_iter

Maximum number of iterations for the solver

tol

Convergence threshold or stopping criteria. Optimization loop will stop when relative change in parameter norm is below the threshold.

eta

A threshold parameter that linearizes the exp() function above eta.

clip_coeffs

Absolute value below which to set coefficients to zero.

score_metric

Scoring metric. Options: - “deviance”: Uses the difference between the saturated (perfectly predictive) model and the true model. - “pseudo_r2”: Uses the coefficient of determination b/w the true and predicted values.

fit_intercept

Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function

random_seed

Seed of the random number generator used to initialize the solution. Default: 888

theta

Shape parameter of the negative binomial distribution (number of successes before the first failure). It is used only if ‘distr’ is equal to “neg-binomial”, otherwise it is ignored.

verbose

If True, returns logging information as program runs. Recommended to set to False for any parallelized processes.

beta0_#

The intercept

beta_#

Learned parameters

glm_#

The GLM object with the best score

reg_lambda_opt#

The value of reg_lambda for the best GLM model

n_iter#

Number of iterations

__repr__()[source]#

Return repr(self).

fit(X: numpy.ndarray, y: numpy.ndarray)[source]#

The fit function.

Parameters
X

2D array of shape [n_samples, n_features]; input data

y

1D array of shape [n_samples,]; target data

Returns

Fitted instance of class GLM

Return type

self

predict(X: numpy.ndarray) numpy.ndarray[source]#

Using the best scoring model, predict target values.

Parameters
X

Array of shape [n_samples, n_features]; input data for prediction

Returns

Predicted targets based on the model with optimal reg_lambda, of shape [n_samples,]

Return type

yhat

fit_predict(X: numpy.ndarray, y: numpy.ndarray)[source]#

Fit the model and, after finding the best model, predict on the same data using that model.

Parameters
X

array of shape [n_samples, n_features]; input data to fit and predict

y

array of shape [n_samples,]; target values for regression

Returns

Predicted targets based on the model with optimal reg_lambda, of shape [n_samples,]

Return type

yhat

score(X: numpy.ndarray, y: numpy.ndarray)[source]#

Score model by computing either the deviance or R^2 for predicted values.

Parameters
X

array of shape [n_samples, n_features]; input data to fit and predict

y

array of shape [n_samples,]; target values for regression

Returns

Value of chosen metric (any pos number for deviance, 0-1 for R^2) for the optimal reg_lambda

Return type

score

spateo.tools.ST_regression.generalized_lm.fit_glm(X: Union[numpy.ndarray, pandas.DataFrame], adata: anndata.AnnData, y_feat, calc_first_moment: bool = True, log_transform: bool = True, gs_params: Union[None, dict] = None, n_gs_cv: Union[None, int] = None, return_model: bool = True, **kwargs) Tuple[numpy.ndarray, numpy.ndarray, float, numpy.ndarray, Union[None, GLMCV]][source]#

Wrapper for fitting a generalized elastic net linear model to large biological data, with automated finding of optimum lambda regularization parameter and optional further grid search for parameter optimization.

Parameters
X

Array or DataFrame containing data for fitting- all columns in this array will be used as independent variables

adata

AnnData object from which dependent variable gene expression values will be taken from

y_feat

Name of the feature in ‘adata’ corresponding to the dependent variable

log_transform

If True, will log transform expression. Defaults to True.

calc_first_moment

If True, will alleviate dropout effects by computing the first moment of each gene across cells, consistent with the method used by the original RNA velocity method (La Manno et al., 2018). Defaults to True.

gs_params

Optional dictionary where keys are variable names for either the classifier or the regressor and values are lists of potential values for which to find the best combination using grid search. Classifier parameters should be given in the following form: ‘classifier__{parameter name}’.

n_gs_cv

Number of folds for cross-validation, will only be used if gs_params is not None. If None, will default to a 5-fold cross-validation.

return_model

If True, returns fitted model. Defaults to True.

kwargs

Additional named arguments that will be provided to :class GLMCV. Valid options are: - distr: Distribution family- can be “gaussian”, “poisson”, “neg-binomial”, or “gamma”. Case sensitive. - alpha: The weighting between L1 penalty (alpha=1.) and L2 penalty (alpha=0.) term of the loss function - Tau: optional array of shape [n_features, n_features]; the Tikhonov matrix for ridge regression. If not

provided, Tau will default to the identity matrix.

  • reg_lambda: Regularization parameter \(\lambda\) of penalty term

  • n_lambdas: Number of lambdas along the regularization path. Only used if ‘reg_lambda’ is not given.

  • cv: Number of cross-validation repeats

  • learning_rate: Governs the magnitude of parameter updates for the gradient descent algorithm

  • max_iter: Maximum number of iterations for the solver

  • tol: Convergence threshold or stopping criteria. Optimization loop will stop when relative change in

    parameter norm is below the threshold.

  • eta: A threshold parameter that linearizes the exp() function above eta.

  • score_metric: Scoring metric. Options:
    • ”deviance”: Uses the difference between the saturated (perfectly predictive) model and the true model.

    • ”pseudo_r2”: Uses the coefficient of determination b/w the true and predicted values.

  • fit_intercept: Specifies if a constant (a.k.a. bias or intercept) should be added to the decision function

  • random_seed: Seed of the random number generator used to initialize the solution. Default: 888

  • theta: Shape parameter of the negative binomial distribution (number of successes before the first

    failure). It is used only if ‘distr’ is equal to “neg-binomial”, otherwise it is ignored.

Returns

Array of shape [n_parameters, 1], contains weight for each parameter rex: Array of shape [n_samples, 1]. Reconstructed independent variable values. reg: Instance of regression model. Returned only if ‘return_model’ is True.

Return type

Beta

spateo.tools.ST_regression.generalized_lm.calc_1nd_moment(X, W, normalize_W=True)[source]#