heamy.dataset module¶
-
class
heamy.dataset.
Dataset
(X_train=None, y_train=None, X_test=None, y_test=None, preprocessor=None, use_cache=True)[source]¶ Dataset wrapper.
Parameters: X_train : pd.DataFrame or np.ndarray, optional
y_train : pd.DataFrame, pd.Series or np.ndarray, optional
X_test : pd.DataFrame or np.ndarray, optional
y_test : pd.DataFrame, pd.Series or np.ndarray, optional
preprocessor : function, optional
A callable function that returns preprocessed data.
use_cache : bool, default True
If use_cache=True then preprocessing step will be cached until function code is changed.
Examples
>>> # function-based definition >>> from sklearn.datasets import load_boston >>> def boston_dataset(): >>> data = load_boston() >>> X, y = data['data'], data['target'] >>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=111) >>> return X_train, y_train, X_test, y_test >>> dataset = Dataset(preprocessor=boston_dataset)
>>> # class-based definition >>> class BostonDataset(Dataset): >>> def preprocess(self): >>> data = load_boston() >>> X, y = data['data'], data['target'] >>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=111) >>> return X_train, y_train, X_test, y_test
-
hash
¶ Return md5 hash for current dataset.
-
kfold
(k=5, stratify=False, shuffle=True, seed=33)[source]¶ K-Folds cross validation iterator.
Parameters: k : int, default 5
stratify : bool, default False
shuffle : bool, default True
seed : int, default 33
Yields: X_train, y_train, X_test, y_test, train_index, test_index
-
merge
(ds, inplace=False, axis=1)[source]¶ Merge two datasets.
Parameters: axis : {0,1}
ds : Dataset
inplace : bool, default False
Returns: Dataset
-
split
(test_size=0.1, stratify=False, inplace=False, seed=33, indices=None)[source]¶ Splits train set into two parts (train/test).
Parameters: test_size : float, default 0.1
stratify : bool, default False
inplace : bool, default False
If True then dataset’s train/test sets will be replaced with new data.
seed : int, default 33
indices : list(np.ndarray, np.ndarray), default None
Two numpy arrays that contain indices for train/test slicing.
Returns: X_train : np.ndarray
y_train : np.ndarray
X_test : np.ndarray
y_test : np.ndarray
Examples
>>> train_index = np.array(range(250)) >>> test_index = np.array(range(250,333)) >>> res = dataset.split(indices=(train_index,test_index))
>>> res = dataset.split(test_size=0.3,seed=1111)
-