Documentation¶

Information Criterion¶

pypunisher.metrics.criterion.aic(model, X_train, y_train)[source]¶

Compute the Akaike Information Criterion (AIC)

AIC’s objective is to prevent model overfitting by adding a penalty term which penalizes more complex models. Its formal definition is:

\[-2ln(L) + 2k\]

where \(L\) is the maximized value of the likelihood function and \(k\) if the number of parameters. A smaller AIC value suggests that the model is a better fit for the data, relative to competing models.

Parameters:

Parameters:	model (fitted sklearn model object) – A fitted sklearn model. X_train (2d ndarray) – The data used to train model. y_train (1d numpy array) – the response variable.
Returns:	aic (float) AIC value if sample size is sufficient. If n/k < 40 where n is the number of observations and k is the number of features, AICc gets returned to adjust for small sample size.

model (fitted sklearn model object) – A fitted sklearn model.
X_train (2d ndarray) – The data used to train model.
y_train (1d numpy array) – the response variable.

Returns:

aic (float): AIC value if sample size is sufficient. If n/k < 40 where n is the number of observations and k is the number of features, AICc gets returned to adjust for small sample size.

References

https://en.wikipedia.org/wiki/Akaike_information_criterion

pypunisher.metrics.criterion.bic(model, X_train, y_train)[source]¶

Compute the Bayesian Information Criterion (BIC)

BIC’s objective is to prevent model over-fitting by adding a penalty term which penalizes more complex models. Its formal definition is:

\[-2ln(L) + ln(n)k\]

where \(L\) is the maximized value of the likelihood function and \(k\) if the number of parameters. A smaller BIC value suggests that the model is a better fit for the data, relative to competing models.

Parameters:

Parameters:	model (fitted sklearn model object) – A fitted sklearn model. X_train (2d ndarray) – The data used to train model. y_train (1d numpy array) – the response variable.
Returns:	bic (float) Bayesian Information Criterion value.

model (fitted sklearn model object) – A fitted sklearn model.
X_train (2d ndarray) – The data used to train model.
y_train (1d numpy array) – the response variable.

Returns:

bic (float): Bayesian Information Criterion value.

References

https://en.wikipedia.org/wiki/Bayesian_information_criterion

Forward and Backward Selection Algorithms¶

class pypunisher.selection_engines.selection.Selection(model, X_train, y_train, X_val, y_val, criterion=None, verbose=True)[source]¶

Forward and Backward Selection Algorithms.

Parameters:

Parameters:	model (sklearn model) – any sklearn model with .fit(), .predict() and .score() methods. X_train (2d ndarray) – a 2D numpy array of (observations, features). y_train (1d ndarray) – a 1D array of target classes for X_train. X_val (2d ndarray) – a 2D numpy array of (observations, features). y_val (1d ndarray) – a 1D array of target classes for X_validate. criterion (str or None) – model selection criterion. ’aic’: use Akaike Information Criterion. ’bic’: use Bayesian Information Criterion. None: Use the model’s default (i.e., call `.score()`). verbose (bool) – if True, print additional information as selection occurs. Defaults to True.

model (sklearn model) – any sklearn model with .fit(), .predict() and .score() methods.
X_train (2d ndarray) – a 2D numpy array of (observations, features).
y_train (1d ndarray) – a 1D array of target classes for X_train.
X_val (2d ndarray) – a 2D numpy array of (observations, features).
y_val (1d ndarray) – a 1D array of target classes for X_validate.
criterion (str or None) –
model selection criterion.
- ’aic’: use Akaike Information Criterion.
- ’bic’: use Bayesian Information Criterion.
- None: Use the model’s default (i.e., call .score()).
verbose (bool) – if True, print additional information as selection occurs. Defaults to True.

forward(n_features=0.5, min_change=None, **kwargs)[source]¶

Perform Forward Selection on a Sklearn model.

Parameters:

Parameters:	n_features (int) – the max. number of features to allow. Note: `min_change` must be None in order for `n_features` to operate. Floats will be regarded as proportions of the total that must lie on (0, 1). min_change (int or float) – The smallest change to be considered significant. Note: n_features must be None in order for `min_change` to operate. kwargs (Keyword Args) – _do_not_skip: Explore loop exhaustion. For internal use only. Not intended for outside use.
Returns:	S (list) The column indices of `X_train` (and `X_val`) that denote the chosen features.
Raises:	if `n_features` and `min_change` are both non-None.

n_features (int) – the max. number of features to allow. Note: min_change must be None in order for n_features to operate. Floats will be regarded as proportions of the total that must lie on (0, 1).
min_change (int or float) – The smallest change to be considered significant. Note: n_features must be None in order for min_change to operate.
kwargs (Keyword Args) –
- _do_not_skip:
  
  Explore loop exhaustion. For internal use only. Not intended for outside use.

Returns:

S (list): The column indices of X_train (and X_val) that denote the chosen features.

Raises:

if n_features and min_change are both non-None.

backward(n_features=0.5, min_change=None, **kwargs)[source]¶

Perform Backward Selection on a Sklearn model.

Parameters:

Parameters:	n_features (int or float) – The number of features to select. Floats will be regarded as proportions of the total that must lie on (0, 1). `min_change` must be None for `n_features` to operate. min_change (int or float) – The smallest change to be considered significant. n_features must be None for `min_change` to operate. kwargs (Keyword Args) – _do_not_skip (bool): Explore loop exhaustion. For internal use only. Not intended for outside use. _last_score_punt (bool): Relax defeated_last_iter_score decision boundary. For internal use only. Not intended for outside use.
Returns:	S (list) The column indices of X_train (and X_val) that denote the chosen features.
Raises:	if `n_features` and `min_change` are both non-None.

n_features (int or float) – The number of features to select. Floats will be regarded as proportions of the total that must lie on (0, 1). min_change must be None for n_features to operate.
min_change (int or float) – The smallest change to be considered significant. n_features must be None for min_change to operate.
kwargs (Keyword Args) –
- _do_not_skip (bool):
  
  Explore loop exhaustion. For internal use only. Not intended for outside use.
- _last_score_punt (bool):
  
  Relax defeated_last_iter_score decision boundary. For internal use only. Not intended for outside use.

Returns:

S (list): The column indices of X_train (and X_val) that denote the chosen features.

Raises:

if n_features and min_change are both non-None.