heamy package¶

Subpackages¶

heamy.utils package

Submodules¶

heamy.cache module¶

class heamy.cache.Cache(hashval, prefix='', cache_dir='.cache/heamy/')[source]¶

Bases: object

available¶

retrieve(key)[source]¶: Retrieves a cached array if possible.

store(key, data)[source]¶: Takes an array and stores it in the cache.

heamy.cache.np_hash(x)[source]¶

heamy.cache.numpy_buffer(ndarray)[source]¶: Creates a buffer from c_contiguous numpy ndarray.

heamy.dataset module¶

class heamy.dataset.Dataset(X_train=None, y_train=None, X_test=None, y_test=None, preprocessor=None, use_cache=True)[source]¶

Bases: object

Dataset wrapper.

Parameters:

X_train : pd.DataFrame or np.ndarray, optional

y_train : pd.DataFrame, pd.Series or np.ndarray, optional

X_test : pd.DataFrame or np.ndarray, optional

y_test : pd.DataFrame, pd.Series or np.ndarray, optional

preprocessor : function, optional

A callable function that returns preprocessed data.

use_cache : bool, default True

If use_cache=True then preprocessing step will be cached until function code is changed.

Examples

>>> # function-based definition
>>> from sklearn.datasets import load_boston
>>> def boston_dataset():
>>>     data = load_boston()
>>>     X, y = data['data'], data['target']
>>>     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=111)
>>>     return X_train, y_train, X_test, y_test
>>> dataset = Dataset(preprocessor=boston_dataset)

>>> # class-based definition
>>> class BostonDataset(Dataset):
>>> def preprocess(self):
>>>     data = load_boston()
>>>     X, y = data['data'], data['target']
>>>     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=111)
>>>     return X_train, y_train, X_test, y_test

X_test¶

X_train¶

hash¶: Return md5 hash for current dataset.

kfold(k=5, stratify=False, shuffle=True, seed=33)[source]¶

K-Folds cross validation iterator.

Parameters:

k : int, default 5

stratify : bool, default False

shuffle : bool, default True

seed : int, default 33

Yields:

X_train, y_train, X_test, y_test, train_index, test_index

load()[source]¶

loaded¶

merge(ds, inplace=False, axis=1)[source]¶

Merge two datasets.

Parameters:

axis : {0,1}

ds : Dataset

inplace : bool, default False

Returns:

Dataset

name¶

split(test_size=0.1, stratify=False, inplace=False, seed=33, indices=None)[source]¶

Splits train set into two parts (train/test).

Parameters:

test_size : float, default 0.1

stratify : bool, default False

inplace : bool, default False

If True then dataset’s train/test sets will be replaced with new data.

seed : int, default 33

indices : list(np.ndarray, np.ndarray), default None

Two numpy arrays that contain indices for train/test slicing.

Returns:

X_train : np.ndarray

y_train : np.ndarray

X_test : np.ndarray

y_test : np.ndarray

Examples

>>> train_index = np.array(range(250))
>>> test_index = np.array(range(250,333))
>>> res = dataset.split(indices=(train_index,test_index))

>>> res = dataset.split(test_size=0.3,seed=1111)

to_csc()[source]¶: Convert Dataset to scipy’s Compressed Sparse Column matrix.

to_csr()[source]¶: Convert Dataset to scipy’s Compressed Sparse Row matrix.

to_dense()[source]¶: Convert sparse Dataset to dense matrix.

y_test¶

y_train¶

heamy.estimator module¶

class heamy.estimator.BaseEstimator(dataset, estimator=None, parameters=None, name=None, use_cache=True)[source]¶

Bases: object

blend(proportion=0.2, stratify=False, seed=100, indices=None)[source]¶

Blend a single model. You should rarely be using this method. Use ModelsPipeline.blend instead.

Parameters:

proportion : float, default 0.2

Test size holdout.

stratify : bool, default False

seed : int, default 100

indices : list(np.ndarray,np.ndarray), default None

Two numpy arrays that contain indices for train/test slicing. (train_index,test_index)

Returns:

Dataset

estimator_name¶

hash¶

name¶

predict()[source]¶

problem = None¶

stack(k=5, stratify=False, shuffle=True, seed=100, full_test=True)[source]¶

Stack a single model. You should rarely be using this method. Use ModelsPipeline.stack instead.

Parameters:

k : int, default 5

stratify : bool, default False

shuffle : bool, default True

seed : int, default 100

full_test : bool, default True

If True then evaluate test dataset on the full data otherwise take the mean of every fold.

Returns:

Dataset with out of fold predictions.

validate(scorer=None, k=1, test_size=0.1, stratify=False, shuffle=True, seed=100, indices=None)[source]¶

Evaluate score by cross-validation.

Parameters:

scorer : function(y_true,y_pred), default None

Scikit-learn like metric that returns a score.

k : int, default 1

The number of folds for validation.

If k=1 then randomly split X_train into two parts otherwise use K-fold approach.

test_size : float, default 0.1

Size of the test holdout if k=1.

stratify : bool, default False

shuffle : bool, default True

seed : int, default 100

indices : list(np.array,np.array), default None

Two numpy arrays that contain indices for train/test slicing. (train_index,test_index)

Returns:

y_true: list

Actual labels.

y_pred: list

Predicted labels.

Examples

>>> # Custom indices
>>> train_index = np.array(range(250))
>>> test_index = np.array(range(250,333))
>>> res = model_rf.validate(mean_absolute_error,indices=(train_index,test_index))

class heamy.estimator.Classifier(dataset, estimator=None, parameters=None, name=None, use_cache=True, probability=True)[source]¶

Bases: heamy.estimator.BaseEstimator

Wrapper for classification problems.

Parameters:

dataset : Dataset object

estimator : a callable scikit-learn like interface, custom function/class, optional

parameters : dict, optional

Arguments for estimator object.

name : str, optional

The unique name of Estimator object.

use_cache : bool, optional

if True then validate/predict/stack/blend results will be cached.

problem = 'classification'¶

class heamy.estimator.Regressor(dataset, estimator=None, parameters=None, name=None, use_cache=True)[source]¶

Bases: heamy.estimator.BaseEstimator

Wrapper for regression problems.

Parameters:

dataset : Dataset object

estimator : a callable scikit-learn like interface, custom function/class, optional

parameters : dict, optional

Arguments for estimator object.

name : str, optional

The unique name of Estimator object.

use_cache : bool, optional

if True then validate/predict/stack/blend results will be cached.

problem = 'regression'¶

heamy.feature module¶

heamy.pipeline module¶

class heamy.pipeline.ModelsPipeline(*args)[source]¶

Bases: object

Combines sequence of models.

add(model)[source]¶

Adds a single model.

Parameters:	model : Estimator

apply(func)[source]¶

Applies function along models output.

Parameters:

func : function

Arbitrary function with one argument.

Returns:

PipeApply

Examples

>>> pipeline = ModelsPipeline(model_rf,model_lr)
>>> pipeline.apply(lambda x: np.max(x,axis=0)).execute()

blend(proportion=0.2, stratify=False, seed=100, indices=None, add_diff=False)[source]¶

Blends sequence of models.

Parameters:

proportion : float, default 0.2

stratify : bool, default False

seed : int, default False

indices : list(np.ndarray,np.ndarray), default None

Two numpy arrays that contain indices for train/test slicing.

add_diff : bool, default False

Returns:

DataFrame

Examples

>>> pipeline = ModelsPipeline(model_rf,model_lr)
>>> pipeline.blend(seed=15)

>>> # Custom indices
>>> train_index = np.array(range(250))
>>> test_index = np.array(range(250,333))
>>> res = model_rf.blend(indicies=(train_index,test_index))

find_weights(scorer, test_size=0.2, method='SLSQP')[source]¶

Finds optimal weights for weighted average of models.

Parameters:

scorer : function

Scikit-learn like metric.

test_size : float, default 0.2

method : str

Type of solver. Should be one of:

‘Nelder-Mead’

‘Powell’

‘CG’

‘BFGS’

‘Newton-CG’

‘L-BFGS-B’

‘TNC’

‘COBYLA’

‘SLSQP’

‘dogleg’

‘trust-ncg’

Returns:

list

gmean()[source]¶

Returns the gmean of the models predictions.

Returns:	PipeApply

max()[source]¶

Returns the max of the models predictions.

Returns:	PipeApply

mean()[source]¶

Returns the mean of the models predictions.

Returns:	PipeApply

Examples

>>> # Execute
>>> pipeline = ModelsPipeline(model_rf,model_lr)
>>> pipeline.mean().execute()

>>> # Validate
>>> pipeline = ModelsPipeline(model_rf,model_lr)
>>> pipeline.mean().validate()

min()[source]¶

Returns the min of the models predictions.

Returns:	PipeApply

stack(k=5, stratify=False, shuffle=True, seed=100, full_test=True, add_diff=False)[source]¶

Stacks sequence of models.

Parameters:

k : int, default 5

Number of folds.

stratify : bool, default False

shuffle : bool, default True

seed : int, default 100

full_test : bool, default True

If True then evaluate test dataset on the full data otherwise take the mean of every fold.

add_diff : bool, default False

Returns:

DataFrame

Examples

>>> pipeline = ModelsPipeline(model_rf,model_lr)
>>> stack_ds = pipeline.stack(k=10, seed=111)

weight(weights)[source]¶

Applies weighted mean to models.

Parameters:	weights : list
Returns:	np.ndarray

Examples

>>> pipeline = ModelsPipeline(model_rf,model_lr)
>>> pipeline.weight([0.8,0.2])

class heamy.pipeline.PipeApply(function, models)[source]¶

Bases: object

execute()[source]¶

validate(scorer=None, k=1, test_size=0.1, stratify=False, shuffle=True, seed=100, indices=None)[source]¶

heamy package¶

Subpackages¶

Submodules¶

heamy.cache module¶

heamy.dataset module¶

heamy.estimator module¶

heamy.feature module¶

heamy.pipeline module¶

Module contents¶