heamy.dataset module¶

class heamy.dataset.Dataset(X_train=None, y_train=None, X_test=None, y_test=None, preprocessor=None, use_cache=True)[source]¶

Dataset wrapper.

Parameters:

X_train : pd.DataFrame or np.ndarray, optional

y_train : pd.DataFrame, pd.Series or np.ndarray, optional

X_test : pd.DataFrame or np.ndarray, optional

y_test : pd.DataFrame, pd.Series or np.ndarray, optional

preprocessor : function, optional

A callable function that returns preprocessed data.

use_cache : bool, default True

If use_cache=True then preprocessing step will be cached until function code is changed.

Examples

>>> # function-based definition
>>> from sklearn.datasets import load_boston
>>> def boston_dataset():
>>>     data = load_boston()
>>>     X, y = data['data'], data['target']
>>>     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=111)
>>>     return X_train, y_train, X_test, y_test
>>> dataset = Dataset(preprocessor=boston_dataset)

>>> # class-based definition
>>> class BostonDataset(Dataset):
>>> def preprocess(self):
>>>     data = load_boston()
>>>     X, y = data['data'], data['target']
>>>     X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=111)
>>>     return X_train, y_train, X_test, y_test

hash¶: Return md5 hash for current dataset.

kfold(k=5, stratify=False, shuffle=True, seed=33)[source]¶

K-Folds cross validation iterator.

Parameters:

k : int, default 5

stratify : bool, default False

shuffle : bool, default True

seed : int, default 33

Yields:

X_train, y_train, X_test, y_test, train_index, test_index

merge(ds, inplace=False, axis=1)[source]¶

Merge two datasets.

Parameters:

axis : {0,1}

ds : Dataset

inplace : bool, default False

Returns:

Dataset

split(test_size=0.1, stratify=False, inplace=False, seed=33, indices=None)[source]¶

Splits train set into two parts (train/test).

Parameters:

test_size : float, default 0.1

stratify : bool, default False

inplace : bool, default False

If True then dataset’s train/test sets will be replaced with new data.

seed : int, default 33

indices : list(np.ndarray, np.ndarray), default None

Two numpy arrays that contain indices for train/test slicing.

Returns:

X_train : np.ndarray

y_train : np.ndarray

X_test : np.ndarray

y_test : np.ndarray

Examples

>>> train_index = np.array(range(250))
>>> test_index = np.array(range(250,333))
>>> res = dataset.split(indices=(train_index,test_index))

>>> res = dataset.split(test_size=0.3,seed=1111)

to_csc()[source]¶: Convert Dataset to scipy’s Compressed Sparse Column matrix.

to_csr()[source]¶: Convert Dataset to scipy’s Compressed Sparse Row matrix.

to_dense()[source]¶: Convert sparse Dataset to dense matrix.