heamy package

Submodules

heamy.cache module

class heamy.cache.Cache(hashval, prefix='', cache_dir='.cache/heamy/')[source]

Bases: object

available
retrieve(key)[source]

Retrieves a cached array if possible.

store(key, data)[source]

Takes an array and stores it in the cache.

heamy.cache.np_hash(x)[source]
heamy.cache.numpy_buffer(ndarray)[source]

Creates a buffer from c_contiguous numpy ndarray.

heamy.dataset module

class heamy.dataset.Dataset(X_train=None, y_train=None, X_test=None, y_test=None, preprocessor=None, use_cache=True)[source]

Bases: object

Dataset wrapper.

Parameters:

X_train : pd.DataFrame or np.ndarray, optional

y_train : pd.DataFrame, pd.Series or np.ndarray, optional

X_test : pd.DataFrame or np.ndarray, optional

y_test : pd.DataFrame, pd.Series or np.ndarray, optional

preprocessor : function, optional

A callable function that returns preprocessed data.

use_cache : bool, default True

If use_cache=True then preprocessing step will be cached until function code is changed.

Examples

>>> # function-based definition
>>> from sklearn.datasets import load_boston
>>> def boston_dataset():
>>>     data = load_boston()
>>>     X, y = data['data'], data['target']
>>>     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=111)
>>>     return X_train, y_train, X_test, y_test
>>> dataset = Dataset(preprocessor=boston_dataset)
>>> # class-based definition
>>> class BostonDataset(Dataset):
>>> def preprocess(self):
>>>     data = load_boston()
>>>     X, y = data['data'], data['target']
>>>     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=111)
>>>     return X_train, y_train, X_test, y_test
X_test
X_train
hash

Return md5 hash for current dataset.

kfold(k=5, stratify=False, shuffle=True, seed=33)[source]

K-Folds cross validation iterator.

Parameters:

k : int, default 5

stratify : bool, default False

shuffle : bool, default True

seed : int, default 33

Yields:

X_train, y_train, X_test, y_test, train_index, test_index

load()[source]
loaded
merge(ds, inplace=False, axis=1)[source]

Merge two datasets.

Parameters:

axis : {0,1}

ds : Dataset

inplace : bool, default False

Returns:

Dataset

name
split(test_size=0.1, stratify=False, inplace=False, seed=33, indices=None)[source]

Splits train set into two parts (train/test).

Parameters:

test_size : float, default 0.1

stratify : bool, default False

inplace : bool, default False

If True then dataset’s train/test sets will be replaced with new data.

seed : int, default 33

indices : list(np.ndarray, np.ndarray), default None

Two numpy arrays that contain indices for train/test slicing.

Returns:

X_train : np.ndarray

y_train : np.ndarray

X_test : np.ndarray

y_test : np.ndarray

Examples

>>> train_index = np.array(range(250))
>>> test_index = np.array(range(250,333))
>>> res = dataset.split(indices=(train_index,test_index))
>>> res = dataset.split(test_size=0.3,seed=1111)
to_csc()[source]

Convert Dataset to scipy’s Compressed Sparse Column matrix.

to_csr()[source]

Convert Dataset to scipy’s Compressed Sparse Row matrix.

to_dense()[source]

Convert sparse Dataset to dense matrix.

y_test
y_train

heamy.estimator module

class heamy.estimator.BaseEstimator(dataset, estimator=None, parameters=None, name=None, use_cache=True)[source]

Bases: object

blend(proportion=0.2, stratify=False, seed=100, indices=None)[source]

Blend a single model. You should rarely be using this method. Use ModelsPipeline.blend instead.

Parameters:

proportion : float, default 0.2

Test size holdout.

stratify : bool, default False

seed : int, default 100

indices : list(np.ndarray,np.ndarray), default None

Two numpy arrays that contain indices for train/test slicing. (train_index,test_index)

Returns:

Dataset

estimator_name
hash
name
predict()[source]
problem = None
stack(k=5, stratify=False, shuffle=True, seed=100, full_test=True)[source]

Stack a single model. You should rarely be using this method. Use ModelsPipeline.stack instead.

Parameters:

k : int, default 5

stratify : bool, default False

shuffle : bool, default True

seed : int, default 100

full_test : bool, default True

If True then evaluate test dataset on the full data otherwise take the mean of every fold.

Returns:

Dataset with out of fold predictions.

validate(scorer=None, k=1, test_size=0.1, stratify=False, shuffle=True, seed=100, indices=None)[source]

Evaluate score by cross-validation.

Parameters:

scorer : function(y_true,y_pred), default None

Scikit-learn like metric that returns a score.

k : int, default 1

The number of folds for validation.

If k=1 then randomly split X_train into two parts otherwise use K-fold approach.

test_size : float, default 0.1

Size of the test holdout if k=1.

stratify : bool, default False

shuffle : bool, default True

seed : int, default 100

indices : list(np.array,np.array), default None

Two numpy arrays that contain indices for train/test slicing. (train_index,test_index)

Returns:

y_true: list

Actual labels.

y_pred: list

Predicted labels.

Examples

>>> # Custom indices
>>> train_index = np.array(range(250))
>>> test_index = np.array(range(250,333))
>>> res = model_rf.validate(mean_absolute_error,indices=(train_index,test_index))
class heamy.estimator.Classifier(dataset, estimator=None, parameters=None, name=None, use_cache=True, probability=True)[source]

Bases: heamy.estimator.BaseEstimator

Wrapper for classification problems.

Parameters:

dataset : Dataset object

estimator : a callable scikit-learn like interface, custom function/class, optional

parameters : dict, optional

Arguments for estimator object.

name : str, optional

The unique name of Estimator object.

use_cache : bool, optional

if True then validate/predict/stack/blend results will be cached.

problem = 'classification'
class heamy.estimator.Regressor(dataset, estimator=None, parameters=None, name=None, use_cache=True)[source]

Bases: heamy.estimator.BaseEstimator

Wrapper for regression problems.

Parameters:

dataset : Dataset object

estimator : a callable scikit-learn like interface, custom function/class, optional

parameters : dict, optional

Arguments for estimator object.

name : str, optional

The unique name of Estimator object.

use_cache : bool, optional

if True then validate/predict/stack/blend results will be cached.

problem = 'regression'

heamy.feature module

heamy.pipeline module

class heamy.pipeline.ModelsPipeline(*args)[source]

Bases: object

Combines sequence of models.

add(model)[source]

Adds a single model.

Parameters:model : Estimator
apply(func)[source]

Applies function along models output.

Parameters:

func : function

Arbitrary function with one argument.

Returns:

PipeApply

Examples

>>> pipeline = ModelsPipeline(model_rf,model_lr)
>>> pipeline.apply(lambda x: np.max(x,axis=0)).execute()
blend(proportion=0.2, stratify=False, seed=100, indices=None, add_diff=False)[source]

Blends sequence of models.

Parameters:

proportion : float, default 0.2

stratify : bool, default False

seed : int, default False

indices : list(np.ndarray,np.ndarray), default None

Two numpy arrays that contain indices for train/test slicing.

add_diff : bool, default False

Returns:

DataFrame

Examples

>>> pipeline = ModelsPipeline(model_rf,model_lr)
>>> pipeline.blend(seed=15)
>>> # Custom indices
>>> train_index = np.array(range(250))
>>> test_index = np.array(range(250,333))
>>> res = model_rf.blend(indicies=(train_index,test_index))
find_weights(scorer, test_size=0.2, method='SLSQP')[source]

Finds optimal weights for weighted average of models.

Parameters:

scorer : function

Scikit-learn like metric.

test_size : float, default 0.2

method : str

Type of solver. Should be one of:

  • ‘Nelder-Mead’
  • ‘Powell’
  • ‘CG’
  • ‘BFGS’
  • ‘Newton-CG’
  • ‘L-BFGS-B’
  • ‘TNC’
  • ‘COBYLA’
  • ‘SLSQP’
  • ‘dogleg’
  • ‘trust-ncg’
Returns:

list

gmean()[source]

Returns the gmean of the models predictions.

Returns:PipeApply
max()[source]

Returns the max of the models predictions.

Returns:PipeApply
mean()[source]

Returns the mean of the models predictions.

Returns:PipeApply

Examples

>>> # Execute
>>> pipeline = ModelsPipeline(model_rf,model_lr)
>>> pipeline.mean().execute()
>>> # Validate
>>> pipeline = ModelsPipeline(model_rf,model_lr)
>>> pipeline.mean().validate()
min()[source]

Returns the min of the models predictions.

Returns:PipeApply
stack(k=5, stratify=False, shuffle=True, seed=100, full_test=True, add_diff=False)[source]

Stacks sequence of models.

Parameters:

k : int, default 5

Number of folds.

stratify : bool, default False

shuffle : bool, default True

seed : int, default 100

full_test : bool, default True

If True then evaluate test dataset on the full data otherwise take the mean of every fold.

add_diff : bool, default False

Returns:

DataFrame

Examples

>>> pipeline = ModelsPipeline(model_rf,model_lr)
>>> stack_ds = pipeline.stack(k=10, seed=111)
weight(weights)[source]

Applies weighted mean to models.

Parameters:weights : list
Returns:np.ndarray

Examples

>>> pipeline = ModelsPipeline(model_rf,model_lr)
>>> pipeline.weight([0.8,0.2])
class heamy.pipeline.PipeApply(function, models)[source]

Bases: object

execute()[source]
validate(scorer=None, k=1, test_size=0.1, stratify=False, shuffle=True, seed=100, indices=None)[source]

Module contents