heamy package¶
Subpackages¶
Submodules¶
heamy.cache module¶
heamy.dataset module¶
-
class
heamy.dataset.
Dataset
(X_train=None, y_train=None, X_test=None, y_test=None, preprocessor=None, use_cache=True)[source]¶ Bases:
object
Dataset wrapper.
Parameters: X_train : pd.DataFrame or np.ndarray, optional
y_train : pd.DataFrame, pd.Series or np.ndarray, optional
X_test : pd.DataFrame or np.ndarray, optional
y_test : pd.DataFrame, pd.Series or np.ndarray, optional
preprocessor : function, optional
A callable function that returns preprocessed data.
use_cache : bool, default True
If use_cache=True then preprocessing step will be cached until function code is changed.
Examples
>>> # function-based definition >>> from sklearn.datasets import load_boston >>> def boston_dataset(): >>> data = load_boston() >>> X, y = data['data'], data['target'] >>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=111) >>> return X_train, y_train, X_test, y_test >>> dataset = Dataset(preprocessor=boston_dataset)
>>> # class-based definition >>> class BostonDataset(Dataset): >>> def preprocess(self): >>> data = load_boston() >>> X, y = data['data'], data['target'] >>> X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.1, random_state=111) >>> return X_train, y_train, X_test, y_test
-
X_test
¶
-
X_train
¶
-
hash
¶ Return md5 hash for current dataset.
-
kfold
(k=5, stratify=False, shuffle=True, seed=33)[source]¶ K-Folds cross validation iterator.
Parameters: k : int, default 5
stratify : bool, default False
shuffle : bool, default True
seed : int, default 33
Yields: X_train, y_train, X_test, y_test, train_index, test_index
-
loaded
¶
-
merge
(ds, inplace=False, axis=1)[source]¶ Merge two datasets.
Parameters: axis : {0,1}
ds : Dataset
inplace : bool, default False
Returns: Dataset
-
name
¶
-
split
(test_size=0.1, stratify=False, inplace=False, seed=33, indices=None)[source]¶ Splits train set into two parts (train/test).
Parameters: test_size : float, default 0.1
stratify : bool, default False
inplace : bool, default False
If True then dataset’s train/test sets will be replaced with new data.
seed : int, default 33
indices : list(np.ndarray, np.ndarray), default None
Two numpy arrays that contain indices for train/test slicing.
Returns: X_train : np.ndarray
y_train : np.ndarray
X_test : np.ndarray
y_test : np.ndarray
Examples
>>> train_index = np.array(range(250)) >>> test_index = np.array(range(250,333)) >>> res = dataset.split(indices=(train_index,test_index))
>>> res = dataset.split(test_size=0.3,seed=1111)
-
y_test
¶
-
y_train
¶
-
heamy.estimator module¶
-
class
heamy.estimator.
BaseEstimator
(dataset, estimator=None, parameters=None, name=None, use_cache=True)[source]¶ Bases:
object
-
blend
(proportion=0.2, stratify=False, seed=100, indices=None)[source]¶ Blend a single model. You should rarely be using this method. Use ModelsPipeline.blend instead.
Parameters: proportion : float, default 0.2
Test size holdout.
stratify : bool, default False
seed : int, default 100
indices : list(np.ndarray,np.ndarray), default None
Two numpy arrays that contain indices for train/test slicing. (train_index,test_index)
Returns: Dataset
-
estimator_name
¶
-
hash
¶
-
name
¶
-
problem
= None¶
-
stack
(k=5, stratify=False, shuffle=True, seed=100, full_test=True)[source]¶ Stack a single model. You should rarely be using this method. Use ModelsPipeline.stack instead.
Parameters: k : int, default 5
stratify : bool, default False
shuffle : bool, default True
seed : int, default 100
full_test : bool, default True
If True then evaluate test dataset on the full data otherwise take the mean of every fold.
Returns: Dataset with out of fold predictions.
-
validate
(scorer=None, k=1, test_size=0.1, stratify=False, shuffle=True, seed=100, indices=None)[source]¶ Evaluate score by cross-validation.
Parameters: scorer : function(y_true,y_pred), default None
Scikit-learn like metric that returns a score.
k : int, default 1
The number of folds for validation.
If k=1 then randomly split X_train into two parts otherwise use K-fold approach.
test_size : float, default 0.1
Size of the test holdout if k=1.
stratify : bool, default False
shuffle : bool, default True
seed : int, default 100
indices : list(np.array,np.array), default None
Two numpy arrays that contain indices for train/test slicing. (train_index,test_index)
Returns: y_true: list
Actual labels.
y_pred: list
Predicted labels.
Examples
>>> # Custom indices >>> train_index = np.array(range(250)) >>> test_index = np.array(range(250,333)) >>> res = model_rf.validate(mean_absolute_error,indices=(train_index,test_index))
-
-
class
heamy.estimator.
Classifier
(dataset, estimator=None, parameters=None, name=None, use_cache=True, probability=True)[source]¶ Bases:
heamy.estimator.BaseEstimator
Wrapper for classification problems.
Parameters: dataset : Dataset object
estimator : a callable scikit-learn like interface, custom function/class, optional
parameters : dict, optional
Arguments for estimator object.
name : str, optional
The unique name of Estimator object.
use_cache : bool, optional
if True then validate/predict/stack/blend results will be cached.
-
problem
= 'classification'¶
-
-
class
heamy.estimator.
Regressor
(dataset, estimator=None, parameters=None, name=None, use_cache=True)[source]¶ Bases:
heamy.estimator.BaseEstimator
Wrapper for regression problems.
Parameters: dataset : Dataset object
estimator : a callable scikit-learn like interface, custom function/class, optional
parameters : dict, optional
Arguments for estimator object.
name : str, optional
The unique name of Estimator object.
use_cache : bool, optional
if True then validate/predict/stack/blend results will be cached.
-
problem
= 'regression'¶
-
heamy.feature module¶
heamy.pipeline module¶
-
class
heamy.pipeline.
ModelsPipeline
(*args)[source]¶ Bases:
object
Combines sequence of models.
-
apply
(func)[source]¶ Applies function along models output.
Parameters: func : function
Arbitrary function with one argument.
Returns: PipeApply
Examples
>>> pipeline = ModelsPipeline(model_rf,model_lr) >>> pipeline.apply(lambda x: np.max(x,axis=0)).execute()
-
blend
(proportion=0.2, stratify=False, seed=100, indices=None, add_diff=False)[source]¶ Blends sequence of models.
Parameters: proportion : float, default 0.2
stratify : bool, default False
seed : int, default False
indices : list(np.ndarray,np.ndarray), default None
Two numpy arrays that contain indices for train/test slicing.
add_diff : bool, default False
Returns: DataFrame
Examples
>>> pipeline = ModelsPipeline(model_rf,model_lr) >>> pipeline.blend(seed=15)
>>> # Custom indices >>> train_index = np.array(range(250)) >>> test_index = np.array(range(250,333)) >>> res = model_rf.blend(indicies=(train_index,test_index))
-
find_weights
(scorer, test_size=0.2, method='SLSQP')[source]¶ Finds optimal weights for weighted average of models.
Parameters: scorer : function
Scikit-learn like metric.
test_size : float, default 0.2
method : str
Type of solver. Should be one of:
- ‘Nelder-Mead’
- ‘Powell’
- ‘CG’
- ‘BFGS’
- ‘Newton-CG’
- ‘L-BFGS-B’
- ‘TNC’
- ‘COBYLA’
- ‘SLSQP’
- ‘dogleg’
- ‘trust-ncg’
Returns: list
-
mean
()[source]¶ Returns the mean of the models predictions.
Returns: PipeApply Examples
>>> # Execute >>> pipeline = ModelsPipeline(model_rf,model_lr) >>> pipeline.mean().execute()
>>> # Validate >>> pipeline = ModelsPipeline(model_rf,model_lr) >>> pipeline.mean().validate()
-
stack
(k=5, stratify=False, shuffle=True, seed=100, full_test=True, add_diff=False)[source]¶ Stacks sequence of models.
Parameters: k : int, default 5
Number of folds.
stratify : bool, default False
shuffle : bool, default True
seed : int, default 100
full_test : bool, default True
If True then evaluate test dataset on the full data otherwise take the mean of every fold.
add_diff : bool, default False
Returns: DataFrame
Examples
>>> pipeline = ModelsPipeline(model_rf,model_lr) >>> stack_ds = pipeline.stack(k=10, seed=111)
-