hypernets.tabular package

Submodules

hypernets.tabular.cache module

class hypernets.tabular.cache.CacheCallback[source]

Bases: object

on_apply(fn, cached_data, *args, **kwargs)[source]

is fired before applying cached data. raise Exception to skip applying

on_enter(fn, *args, **kwargs)[source]

is fired before checking cache. raise Exception to disable cache

on_leave(fn, *args, **kwargs)[source]

is fired before leaving fn call. raise Exception to skip store cache

on_store(fn, cached_data, *args, **kwargs)[source]

is fired before storing cache. raise Exception to skip store cache

exception hypernets.tabular.cache.SkipCache[source]

Bases: Exception

hypernets.tabular.cache.cache(strategy=None, arg_keys=None, attr_keys=None, attrs_to_restore=None, transformer=None, callbacks=None, cache_dir=None)[source]
hypernets.tabular.cache.clear(cache_dir=None, fn=None)[source]
hypernets.tabular.cache.decorate(fn, *, cache_dir, strategy, arg_keys=None, attr_keys=None, attrs_to_restore=None, transformer=None, callbacks=None)[source]

hypernets.tabular.cfg module

hypernets.tabular.collinearity module

Adapted from https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance_multicollinear.html handling multicollinearity is by performing hierarchical clustering on the features’ Spearman rank-order correlations, picking a threshold, and keeping a single feature from each cluster.

class hypernets.tabular.collinearity.MultiCollinearityDetector[source]

Bases: object

detect(X, method=None)[source]

hypernets.tabular.column_selector module

class hypernets.tabular.column_selector.AutoCategoryColumnSelector(pattern=None, *, dtype_include=None, dtype_exclude=None, cat_exponent=0.5)[source]

Bases: hypernets.tabular.column_selector.ColumnSelector

class hypernets.tabular.column_selector.ColumnSelector(pattern=None, *, dtype_include=None, dtype_exclude=None)[source]

Bases: sklearn.compose._column_transformer.make_column_selector

Create a callable to select columns to be used with ColumnTransformer.

make_column_selector() can select columns based on datatype or the columns name with a regex. When using multiple selection criteria, all criteria must match for a column to be selected.

Parameters:
  • pattern (str, default=None) – Name of columns containing this regex pattern will be included. If None, column selection will not be selected based on pattern.
  • dtype_include (column dtype or list of column dtypes, default=None) – A selection of dtypes to include. For more details, see pandas.DataFrame.select_dtypes().
  • dtype_exclude (column dtype or list of column dtypes, default=None) – A selection of dtypes to exclude. For more details, see pandas.DataFrame.select_dtypes().
Returns:

selector – Callable for column selection to be used by a ColumnTransformer.

Return type:

callable

See also

ColumnTransformer
Class that allows combining the outputs of multiple transformer objects used on column subsets of the data into a single feature space.

Examples

>>> from sklearn.preprocessing import StandardScaler, OneHotEncoder
>>> from sklearn.compose import make_column_transformer
>>> from sklearn.compose import make_column_selector
>>> import pandas as pd  # doctest: +SKIP
>>> X = pd.DataFrame({'city': ['London', 'London', 'Paris', 'Sallisaw'],
...                   'rating': [5, 3, 4, 5]})  # doctest: +SKIP
>>> ct = make_column_transformer(
...       (StandardScaler(),
...        make_column_selector(dtype_include=np.number)),  # rating
...       (OneHotEncoder(),
...        make_column_selector(dtype_include=object)))  # city
>>> ct.fit_transform(X)  # doctest: +SKIP
array([[ 0.90453403,  1.        ,  0.        ,  0.        ],
       [-1.50755672,  1.        ,  0.        ,  0.        ],
       [-0.30151134,  0.        ,  1.        ,  0.        ],
       [ 0.90453403,  0.        ,  0.        ,  1.        ]])
class hypernets.tabular.column_selector.CompositedColumnSelector(selectors)[source]

Bases: object

class hypernets.tabular.column_selector.LatLongColumnSelector[source]

Bases: object

class hypernets.tabular.column_selector.MinMaxColumnSelector(min=None, max=None)[source]

Bases: object

class hypernets.tabular.column_selector.TextColumnSelector(pattern=None, *, dtype_include=None, dtype_exclude=None, word_count_threshold=10)[source]

Bases: hypernets.tabular.column_selector.ColumnSelector

hypernets.tabular.column_selector.calc_skewness_kurtosis(X_1, X_2, columns=None, smooth_fn=<ufunc 'log'>)[source]
hypernets.tabular.column_selector.column_min_max(X, min_value=None, max_value=None)[source]
hypernets.tabular.column_selector.column_skewness_kurtosis(X, skew_threshold=0.5, kurtosis_threshold=0.5, columns=None)[source]
hypernets.tabular.column_selector.column_skewness_kurtosis_diff(X_1, X_2, diff_threshold=5, columns=None, smooth_fn=<ufunc 'log'>, skewness_weights=1, kurtosis_weights=0)[source]

hypernets.tabular.data_cleaner module

class hypernets.tabular.data_cleaner.DataCleaner(nan_chars=None, correct_object_dtype=True, drop_constant_columns=True, drop_duplicated_columns=False, drop_label_nan_rows=True, drop_idness_columns=True, replace_inf_values=nan, drop_columns=None, reserve_columns=None, reduce_mem_usage=False, int_convert_to='float')[source]

Bases: object

append_drop_columns(columns)[source]
clean_data(X, y, *, df_meta=None, reduce_mem_usage)[source]
fit_transform(X, y=None, copy_data=True)[source]
static get_helper(X, y)[source]
get_params()[source]
transform(X, y=None, copy_data=True)[source]

hypernets.tabular.data_hasher module

class hypernets.tabular.data_hasher.DataHasher(method='md5')[source]

Bases: object

hypernets.tabular.dataframe_mapper module

Adapted from: https://github.com/scikit-learn-contrib/sklearn-pandas 1. Fix the problem of confusion of column names 2. Support columns is a callable object

class hypernets.tabular.dataframe_mapper.DataFrameMapper(features, default=False, df_out=False, input_df=False, df_out_dtype_transforms=None)[source]

Bases: sklearn.base.BaseEstimator

Map Pandas data frame column subsets to their own sklearn transformation.

features : a list of tuples with features definitions.
The first element is the pandas column selector. This can be a string (for one column) or a list of strings. The second element is an object that supports sklearn’s transform interface, or a list of such objects. The third element is optional and, if present, must be a dictionary with the options to apply to the transformation. Example: {‘alias’: ‘day_of_week’}
default : default transformer to apply to the columns not
explicitly selected in the mapper. If False (default), discard them. If None, pass them through untouched. Any other transformer will be applied to all the unselected columns as a whole, taken as a 2d-array.
df_out : return a pandas data frame, with each column named using
the pandas column that created it (if there’s only one input and output) or the input columns joined with ‘_’ if there’s multiple inputs, and the name concatenated with ‘_1’, ‘_2’ etc if there’s multiple outputs.
input_df : If True pass the selected columns to the transformers
as a pandas DataFrame or Series. Otherwise pass them as a numpy array. Defaults to False.
fitted_features_
Type:list of tuple(column_name list, fitted transformer, options)
fit(X, y=None)[source]
fit_transform(X, y=None, *fit_args)[source]
transform(X)[source]
class hypernets.tabular.dataframe_mapper.TransformerPipeline(steps)[source]

Bases: sklearn.pipeline.Pipeline

Pipeline that expects all steps to be transformers taking a single X argument, an optional y argument, and having fit and transform methods.

Code is copied from sklearn’s Pipeline

fit(X, y=None, **fit_params)[source]

Fit the model

Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.

Parameters:
  • X (iterable) – Training data. Must fulfill input requirements of first step of the pipeline.
  • y (iterable, default=None) – Training targets. Must fulfill label requirements for all steps of the pipeline.
  • **fit_params (dict of string -> object) – Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p.
Returns:

self – This estimator

Return type:

Pipeline

fit_transform(X, y=None, **fit_params)[source]

Fit the model and transform with the final estimator

Fits all the transforms one after the other and transforms the data, then uses fit_transform on transformed data with the final estimator.

Parameters:
  • X (iterable) – Training data. Must fulfill input requirements of first step of the pipeline.
  • y (iterable, default=None) – Training targets. Must fulfill label requirements for all steps of the pipeline.
  • **fit_params (dict of string -> object) – Parameters passed to the fit method of each step, where each parameter name is prefixed such that parameter p for step s has key s__p.
Returns:

Xt – Transformed samples

Return type:

array-like of shape (n_samples, n_transformed_features)

hypernets.tabular.dataframe_mapper.make_transformer_pipeline(*steps)[source]

Construct a TransformerPipeline from the given estimators.

hypernets.tabular.drift_detection module

class hypernets.tabular.drift_detection.DriftDetector(preprocessor=None, estimator=None, random_state=None)[source]

Bases: object

fit(X_train, X_test, sample_balance=True, max_test_samples=None, cv=5)[source]
predict_proba(X)[source]
train_test_split(X, y, test_size=0.25, remain_for_train=0.3)[source]
class hypernets.tabular.drift_detection.FeatureSelectionCallback[source]

Bases: object

on_remove_shift_variable(shift_score, remove_features)[source]
on_round_end(round_no, auc, features, remove_features, elapsed)[source]
on_round_start(round_no, features)[source]
on_task_break(round_no, auc, features)[source]
on_task_finished(round_no, auc, features)[source]
class hypernets.tabular.drift_detection.FeatureSelectorWithDriftDetection(remove_shift_variable=True, variable_shift_threshold=0.7, variable_shift_scorer=None, auc_threshold=0.55, min_features=10, remove_size=0.1, sample_balance=True, max_test_samples=None, cv=5, random_state=None, callbacks=None)[source]

Bases: object

static get_detector(preprocessor=None, estimator=None, random_state=None)[source]
parallelizable = True
select(X_train, X_test, *, preprocessor=None, estimator=None, copy_data=False)[source]

hypernets.tabular.estimator_detector module

class hypernets.tabular.estimator_detector.EstimatorDetector(name_or_cls, task, *, init_kwargs=None, fit_kwargs=None, n_samples=100, n_features=5)[source]

Bases: object

create_estimator(estimator_cls)[source]
fit_estimator(estimator, X, y)[source]
get_estimator_cls()[source]
prepare_data()[source]

hypernets.tabular.metrics module

class hypernets.tabular.metrics.Metrics[source]

Bases: object

calc_score(y_preds, y_proba=None, metrics=('accuracy', ), task='binary', pos_label=1, classes=None, average=None)
evaluate(X, y, metrics, *, task=None, pos_label=None, classes=None, average=None, threshold=0.5, n_jobs=-1)
metric_to_scoring(task='binary', pos_label=None)
predict(X, *, task=None, classes=None, threshold=0.5, n_jobs=None)
predict_proba(X, *, n_jobs=None)
proba2predict(*, task=None, threshold=0.5, classes=None)
hypernets.tabular.metrics.calc_score(y_true, y_preds, y_proba=None, metrics=('accuracy', ), task='binary', pos_label=1, classes=None, average=None)[source]
hypernets.tabular.metrics.evaluate(estimator, X, y, metrics, *, task=None, pos_label=None, classes=None, average=None, threshold=0.5, n_jobs=-1)[source]
hypernets.tabular.metrics.metric_to_scoring(metric, task='binary', pos_label=None)[source]
hypernets.tabular.metrics.predict(estimator, X, *, task=None, classes=None, threshold=0.5, n_jobs=None)[source]
hypernets.tabular.metrics.predict_proba(estimator, X, *, n_jobs=None)[source]
hypernets.tabular.metrics.proba2predict(proba, *, task=None, threshold=0.5, classes=None)[source]

hypernets.tabular.persistence module

hypernets.tabular.pseudo_labeling module

class hypernets.tabular.pseudo_labeling.PseudoLabeling(strategy, threshold=None, quantile=None, number=None)[source]

Bases: object

DEFAULT_STRATEGY_SETTINGS = {'default_number': 0.2, 'default_quantile': 0.8, 'default_strategy': 'threshold', 'default_threshold': 0.8}
static detect_strategy(strategy, threshold=None, quantile=None, number=None)[source]
np = <module 'numpy' from '/home/docs/checkouts/readthedocs.org/user_builds/hypernets/envs/latest/lib/python3.6/site-packages/numpy/__init__.py'>
select(X_test, classes, proba)[source]

hypernets.tabular.sklearn_ex module

class hypernets.tabular.sklearn_ex.AsTypeTransformer(*, dtype)[source]

Bases: sklearn.base.BaseEstimator

fit(X, y=None)[source]
fit_transform(X, y=None)[source]
transform(X)[source]
class hypernets.tabular.sklearn_ex.CategorizeEncoder(columns=None, remain_numeric=True)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

fit(X, y=None)[source]
transform(X)[source]
class hypernets.tabular.sklearn_ex.ColumnEncoder[source]

Bases: sklearn.base.BaseEstimator

Encode each column in the dataset with a separate encoder.

create_encoder(X, y)[source]
fit(X, y=None, **kwargs)[source]
fit_transform(X, y=None, *, copy=True, **kwargs)[source]
transform(X, *, copy=True)[source]
class hypernets.tabular.sklearn_ex.ConstantImputer(missing_values=nan, fill_value=None, copy=True)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

fit(X, y=None)[source]
transform(X, y=None)[source]
class hypernets.tabular.sklearn_ex.DataFrameWrapper(transform, columns=None)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

fit(X, y=None)[source]
transform(X)[source]
class hypernets.tabular.sklearn_ex.DatetimeEncoder(columns=None, include=None, exclude=None, extra=None, drop_constants=True)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

all_items = {'day': 'day', 'dayofyear': 'dayofyear', 'hour': 'hour', 'minute': 'minute', 'month': 'month', 'second': 'second', 'timestamp': <function DatetimeEncoder.<lambda>>, 'week': 'week', 'weekday': 'weekday', 'year': 'year'}
default_include = ['month', 'day', 'hour', 'minute', 'week', 'weekday', 'dayofyear']
fit(X, y=None)[source]
static to_dataframe(X)[source]
transform(X, y=None)[source]
transform_column(Xc)[source]
class hypernets.tabular.sklearn_ex.FeatureImportanceSelection(importances, quantile, min_features=3)[source]

Bases: sklearn.base.BaseEstimator

feature_usage()[source]
fit(X, y=None, **kwargs)[source]
fit_transform(X, y=None, **kwargs)[source]
important_features
transform(X)[source]
class hypernets.tabular.sklearn_ex.FeatureImportancesSelectionTransformer(task=None, strategy=None, threshold=None, quantile=None, number=None, data_clean=True)[source]

Bases: sklearn.base.BaseEstimator

fit(X, y)[source]
transform(X)[source]
class hypernets.tabular.sklearn_ex.FeatureSelectionTransformer(task=None, max_train_samples=10000, max_test_samples=10000, max_cols=10000, ratio_select_cols=0.1, n_max_cols=100, n_min_cols=10, reserved_cols=None)[source]

Bases: sklearn.base.BaseEstimator

feature_score(F_train, y_train, F_test, y_test)[source]
fit(X, y)[source]
get_categorical_features(X)[source]
transform(X)[source]
class hypernets.tabular.sklearn_ex.FloatOutputImputer(*, missing_values=nan, strategy='mean', fill_value=None, verbose=0, copy=True, add_indicator=False)[source]

Bases: sklearn.impute._base.SimpleImputer

transform(X)[source]

Impute all missing values in X.

Parameters:X ({array-like, sparse matrix}, shape (n_samples, n_features)) – The input data to complete.
class hypernets.tabular.sklearn_ex.GaussRankScaler[source]

Bases: sklearn.base.BaseEstimator

fit_transform(X, y=None)[source]
class hypernets.tabular.sklearn_ex.LgbmLeavesEncoder(cat_vars, cont_vars, task, **params)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

fit(X, y)[source]
transform(X)[source]
class hypernets.tabular.sklearn_ex.LocalizedTfidfVectorizer(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer='word', stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.float64'>, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)[source]

Bases: sklearn.feature_extraction.text.TfidfVectorizer

decode(doc)[source]

Decode the input into a string of unicode symbols.

The decoding strategy depends on the vectorizer parameters.

Parameters:doc (str) – The string to decode.
Returns:doc – A string of unicode symbols.
Return type:str
class hypernets.tabular.sklearn_ex.LogStandardScaler(copy=True, with_mean=True, with_std=True)[source]

Bases: sklearn.base.BaseEstimator

fit(X, y=None)[source]
transform(X)[source]
class hypernets.tabular.sklearn_ex.MultiKBinsDiscretizer(columns=None, bins=None, strategy='quantile')[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

fit(X, y=None)[source]
transform(X)[source]
class hypernets.tabular.sklearn_ex.MultiLabelEncoder(columns=None, dtype=None)[source]

Bases: sklearn.base.BaseEstimator

fit(X, y=None)[source]
fit_transform(X, *args)[source]
transform(X)[source]
class hypernets.tabular.sklearn_ex.MultiTargetEncoder(n_folds=4, smooth=None, seed=42, split_method='interleaved', dtype=None)[source]

Bases: hypernets.tabular.sklearn_ex.ColumnEncoder

create_encoder(X, y)[source]
fit(X, y=None, **kwargs)[source]
fit_transform(X, y=None, **kwargs)[source]
label_encoder_cls

alias of sklearn.preprocessing._label.LabelEncoder

target_encoder_cls

alias of SlimTargetEncoder

class hypernets.tabular.sklearn_ex.MultiVarLenFeatureEncoder(features)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

fit(X, y=None)[source]
transform(X)[source]
class hypernets.tabular.sklearn_ex.PassThroughEstimator[source]

Bases: sklearn.base.BaseEstimator

fit(X, y=None)[source]
fit_transform(X, y=None)[source]
transform(X)[source]
class hypernets.tabular.sklearn_ex.SafeLabelEncoder[source]

Bases: sklearn.preprocessing._label.LabelEncoder

transform(y)[source]

Transform labels to normalized encoding.

Parameters:y (array-like of shape (n_samples,)) – Target values.
Returns:y
Return type:array-like of shape (n_samples,)
class hypernets.tabular.sklearn_ex.SafeOneHotEncoder(*, categories='auto', drop=None, sparse=True, dtype=<class 'numpy.float64'>, handle_unknown='error')[source]

Bases: sklearn.preprocessing._encoders.OneHotEncoder

get_feature_names(input_features=None)[source]

Override this method to remove non-alphanumeric chars from feature names

class hypernets.tabular.sklearn_ex.SafeOrdinalEncoder(*, categories='auto', dtype=<class 'numpy.float64'>, handle_unknown='error', unknown_value=None)[source]

Bases: sklearn.preprocessing._encoders.OrdinalEncoder

Adapted from sklearn OrdinalEncodern Encode categorical features as an integer array.

The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are converted to ordinal integers. This results in a single column of integers (0 to n_categories - 1) per feature.

Read more in the User Guide.

New in version 0.20.

Parameters:
  • categories ('auto' or a list of array-like, default='auto') –

    Categories (unique values) per feature:

    • ’auto’ : Determine categories automatically from the training data.
    • list : categories[i] holds the categories expected in the ith column. The passed categories should not mix strings and numeric values, and should be sorted in case of numeric values.

    The used categories can be found in the categories_ attribute.

  • dtype (number type, default np.float64) – Desired dtype of output.
  • handle_unknown ({'error', 'use_encoded_value'}, default='error') –

    When set to ‘error’ an error will be raised in case an unknown categorical feature is present during transform. When set to ‘use_encoded_value’, the encoded value of unknown categories will be set to the value given for the parameter unknown_value. In inverse_transform(), an unknown category will be denoted as None.

    New in version 0.24.

  • unknown_value (int or np.nan, default=None) –

    When the parameter handle_unknown is set to ‘use_encoded_value’, this parameter is required and will set the encoded value of unknown categories. It has to be distinct from the values used to encode any of the categories in fit. If set to np.nan, the dtype parameter must be a float dtype.

    New in version 0.24.

categories_

The categories of each feature determined during fit (in order of the features in X and corresponding with the output of transform). This does not include categories that weren’t seen during fit.

Type:list of arrays

See also

OneHotEncoder
Performs a one-hot encoding of categorical features.
LabelEncoder
Encodes target labels with values between 0 and n_classes-1.

Examples

Given a dataset with two features, we let the encoder find the unique values per feature and transform the data to an ordinal encoding.

>>> from sklearn.preprocessing import OrdinalEncoder
>>> enc = OrdinalEncoder()
>>> X = [['Male', 1], ['Female', 3], ['Female', 2]]
>>> enc.fit(X)
OrdinalEncoder()
>>> enc.categories_
[array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)]
>>> enc.transform([['Female', 3], ['Male', 1]])
array([[0., 2.],
       [1., 0.]])
>>> enc.inverse_transform([[1, 0], [0, 1]])
array([['Male', 1],
       ['Female', 2]], dtype=object)
inverse_transform(X)[source]

Convert the data back to the original representation.

Parameters:X ({array-like, sparse matrix} of shape (n_samples, n_features)) – The transformed data.
Returns:X_tr – Inverse transformed array.
Return type:ndarray of shape (n_samples, n_features)
transform(X, y=None)[source]

Transform X to ordinal codes.

Parameters:X (array-like of shape (n_samples, n_features)) – The data to encode.
Returns:X_out – Transformed input.
Return type:ndarray of shape (n_samples, n_features)
class hypernets.tabular.sklearn_ex.SafeSimpleImputer(*, missing_values=nan, strategy='mean', fill_value=None, verbose=0, copy=True, add_indicator=False)[source]

Bases: sklearn.impute._base.SimpleImputer

passthrough bool columns

fit(X, y=None)[source]

Fit the imputer on X.

Parameters:X ({array-like, sparse matrix}, shape (n_samples, n_features)) – Input data, where n_samples is the number of samples and n_features is the number of features.
Returns:self
Return type:SimpleImputer
transform(X)[source]

Impute all missing values in X.

Parameters:X ({array-like, sparse matrix}, shape (n_samples, n_features)) – The input data to complete.
class hypernets.tabular.sklearn_ex.SkewnessKurtosisTransformer(transform_fn=None, skew_threshold=0.5, kurtosis_threshold=0.5)[source]

Bases: sklearn.base.BaseEstimator

fit(X, y=None)[source]
transform(X)[source]
class hypernets.tabular.sklearn_ex.SlimTargetEncoder(n_folds=4, smooth=0, seed=42, split_method='interleaved', dtype=None, output_2d=False)[source]

Bases: hypernets.tabular.sklearn_ex.TargetEncoder

The slimmed TargetEncoder with ‘train’ and ‘train_encode’ attribute were set to None.

fit(X, y)[source]

Fit a TargetEncoder instance to a set of categories

Parameters:
  • x (cudf.Series or cudf.DataFrame or cupy.ndarray) – categories to be encoded. It’s elements may or may not be unique
  • y (cudf.Series or cupy.ndarray) – Series containing the target variable.
Returns:

self – A fitted instance of itself to allow method chaining

Return type:

TargetEncoder

fit_transform(X, y)[source]

Simultaneously fit and transform an input

This is functionally equivalent to (but faster than) TargetEncoder().fit(y).transform(y)

split_method
transform(X)[source]

Transform an input into its categorical keys.

This is intended for test data. For fitting and transforming the training data, prefer fit_transform.

Parameters:x (cudf.Series) – Input keys to be transformed. Its values doesn’t have to match the categories given to fit
Returns:encoded – The ordinally encoded input series
Return type:cupy.ndarray
class hypernets.tabular.sklearn_ex.TargetEncoder(n_folds=4, smooth=0, seed=42, split_method='interleaved')[source]

Bases: sklearn.base.BaseEstimator

Adapted from cuml.preprocessing.TargetEncoder

fit(x, y)[source]

Fit a TargetEncoder instance to a set of categories

Parameters:
  • x (cudf.Series or cudf.DataFrame or cupy.ndarray) – categories to be encoded. It’s elements may or may not be unique
  • y (cudf.Series or cupy.ndarray) – Series containing the target variable.
Returns:

self – A fitted instance of itself to allow method chaining

Return type:

TargetEncoder

fit_transform(x, y)[source]

Simultaneously fit and transform an input

This is functionally equivalent to (but faster than) TargetEncoder().fit(y).transform(y)

transform(x)[source]

Transform an input into its categorical keys.

This is intended for test data. For fitting and transforming the training data, prefer fit_transform.

Parameters:x (cudf.Series) – Input keys to be transformed. Its values doesn’t have to match the categories given to fit
Returns:encoded – The ordinally encoded input series
Return type:cupy.ndarray
class hypernets.tabular.sklearn_ex.TfidfEncoder(columns=None, flatten=False, **kwargs)[source]

Bases: sklearn.base.BaseEstimator, sklearn.base.TransformerMixin

create_encoder()[source]
fit(X, y=None)[source]
transform(X, y=None)[source]
class hypernets.tabular.sklearn_ex.VarLenFeatureEncoder(sep='|')[source]

Bases: object

fit(X: pandas.core.series.Series)[source]
max_element_length
n_classes
static pad_sequences(sequences, maxlen=None, dtype='int32', padding='pre', truncating='pre', value=0.0)[source]

Adapted from tensorflow.python.keras.preprocessing.sequence.pad_sequences

transform(X: pandas.core.series.Series)[source]
hypernets.tabular.sklearn_ex.root_mean_squared_error(y_true, y_pred, sample_weight=None, multioutput='uniform_average', squared=True)[source]
hypernets.tabular.sklearn_ex.subsample(X, y, max_samples, train_samples, task, random_state=9527)[source]

hypernets.tabular.toolbox module

class hypernets.tabular.toolbox.ToolBox[source]

Bases: object

STRATEGY_NUMBER = 'number'
STRATEGY_QUANTILE = 'quantile'
STRATEGY_THRESHOLD = 'threshold'
classmethod accept(*args)[source]
acceptable_types = (<class 'numpy.ndarray'>, <class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.series.Series'>)
static array_to_df(arr, *, columns=None, index=None, meta=None)[source]
static collapse_last_dim(arr, keep_dim=True)[source]

Collapse the last dimension :param arr: data array :param keep_dim: keep the last dim as one or not :return:

classmethod collinearity_detector()[source]
column_selector = <module 'hypernets.tabular.column_selector' from '/home/docs/checkouts/readthedocs.org/user_builds/hypernets/checkouts/latest/hypernets/tabular/column_selector.py'>
compute_class_weight(*, classes, y)

Estimate class weights for unbalanced datasets.

Parameters:
  • class_weight (dict, 'balanced' or None) – If ‘balanced’, class weights will be given by n_samples / (n_classes * np.bincount(y)). If a dictionary is given, keys are classes and values are corresponding class weights. If None is given, the class weights will be uniform.
  • classes (ndarray) – Array of the classes occurring in the data, as given by np.unique(y_org) with y_org the original class labels.
  • y (array-like of shape (n_samples,)) – Array of original class labels per sample.
Returns:

class_weight_vect – Array with class_weight_vect[i] the weight for i-th class.

Return type:

ndarray of shape (n_classes,)

References

The “balanced” heuristic is inspired by Logistic Regression in Rare Events Data, King, Zen, 2001.

static compute_sample_weight(y)[source]
static concat_df(dfs, axis=0, repartition=False, random_state=9527, **kwargs)[source]
classmethod data_cleaner(nan_chars=None, correct_object_dtype=True, drop_constant_columns=True, drop_duplicated_columns=False, drop_label_nan_rows=True, drop_idness_columns=True, replace_inf_values=nan, drop_columns=None, reserve_columns=None, reduce_mem_usage=False, int_convert_to='float')[source]
classmethod data_hasher(method='md5')[source]
static detect_strategy(strategy, *, threshold=None, quantile=None, number=None, default_strategy, default_threshold, default_quantile, default_number)[source]
classmethod detect_strategy_of_feature_selection_by_importance(strategy, *, threshold=None, quantile=None, number=None)[source]
static df_to_array(df)[source]
classmethod drift_detector(preprocessor=None, estimator=None, random_state=None)[source]
classmethod estimator_detector(name_or_cls, task, *, init_kwargs=None, fit_kwargs=None, n_samples=100, n_features=5)[source]
classmethod feature_selector_with_drift_detection(remove_shift_variable=True, variable_shift_threshold=0.7, variable_shift_scorer=None, auc_threshold=0.55, min_features=10, remove_size=0.1, sample_balance=True, max_test_samples=None, cv=5, random_state=None, callbacks=None)[source]
classmethod feature_selector_with_feature_importances(strategy=None, threshold=None, quantile=None, number=None)[source]
static fix_binary_predict_proba_result(proba)[source]
static from_local(*data)[source]
static gc()[source]
classmethod general_estimator(X, y=None, estimator=None, task=None)[source]
classmethod general_preprocessor(X, y=None)[source]
static get_shape(X, allow_none=False)[source]
classmethod greedy_ensemble(task, estimators, need_fit=False, n_folds=5, method='soft', random_state=9527, scoring='neg_log_loss', ensemble_size=0)[source]
classmethod hstack_array(arrs)[source]
classmethod infer_task_type(y, excludes=None)[source]
classmethod kfold(n_splits=5, *, shuffle=False, random_state=None)[source]
static load_data(data_path, *, reset_index=False, reader_mapping=None, **kwargs)[source]
static mean_oof(probas)[source]
static memory_free()[source]
static memory_total()[source]
static memory_usage(*data)[source]
static merge_oof(oofs)[source]
Parameters:oofs – list of tuple(idx,proba)
Returns:merged proba
metrics

alias of hypernets.tabular.metrics.Metrics

static nunique_df(df)[source]
static parquet()[source]
static permutation_importance(estimator, X, y, *, scoring=None, n_repeats=5, n_jobs=None, random_state=None, sample_weight=None, max_samples=1.0)[source]

see: sklearn.inspection.permutation_importance

classmethod permutation_importance_batch(estimators, X, y, scoring=None, n_repeats=5, n_jobs=None, random_state=None)[source]

Evaluate the importance of features of a set of estimators

Parameters:
  • estimators (list) – A set of estimators that has already been fitted and is compatible with scorer.
  • X (ndarray or DataFrame, shape (n_samples, n_features)) – Data on which permutation importance will be computed.
  • y (array-like or None, shape (n_samples, ) or (n_samples, n_classes)) – Targets for supervised or None for unsupervised.
  • scoring (string, callable or None, default=None) – Scorer to use. It can be a single string (see scoring_parameter) or a callable (see scoring). If None, the estimator’s default scorer is used.
  • n_repeats (int, default=5) – Number of times to permute a feature.
  • n_jobs (int or None, default=None) – The number of jobs to use for the computation. None means 1 unless in a joblib.parallel_backend context. -1 means using all processors. See Glossary for more details.
  • random_state (int, RandomState instance, or None, default=None) – Pseudo-random number generator to control the permutations of each feature. See random_state.
Returns:

result – Dictionary-like object, with attributes:

importances_mean : ndarray, shape (n_features, )

Mean of feature importance over n_repeats.

importances_std : ndarray, shape (n_features, )

Standard deviation over n_repeats.

importances : ndarray, shape (n_features, n_repeats)

Raw permutation importance scores.

Return type:

Bunch

classmethod pseudo_labeling(strategy, threshold=None, quantile=None, number=None)[source]
static reset_index(df)[source]
static select_1d(arr, indices)[source]

Select by indices from the first axis(0).

static select_df(df, indices)[source]

Select dataframe by row indices.

classmethod select_feature_by_importance(feature_importance, strategy=None, threshold=None, quantile=None, number=None)[source]
static select_valid_oof(y, oof)[source]
static stack_array(arrs, axis=0)[source]
classmethod statified_kfold(n_splits=5, *, shuffle=False, random_state=None)[source]
static take_array(arr, indices, axis=None)[source]
static to_local(*data)[source]
train_test_split(*, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)

Split arrays or matrices into random train and test subsets

Quick utility that wraps input validation and next(ShuffleSplit().split(X, y)) and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner.

Read more in the User Guide.

Parameters:
  • *arrays (sequence of indexables with same length / shape[0]) – Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.
  • test_size (float or int, default=None) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the test split. If int, represents the absolute number of test samples. If None, the value is set to the complement of the train size. If train_size is also None, it will be set to 0.25.
  • train_size (float or int, default=None) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.
  • random_state (int, RandomState instance or None, default=None) – Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls. See Glossary.
shuffle : bool, default=True
Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.
stratify : array-like, default=None
If not None, data is split in a stratified fashion, using this as the class labels. Read more in the User Guide.
Returns:splitting – List containing train-test split of inputs.

New in version 0.16: If the input is sparse, the output will be a scipy.sparse.csr_matrix. Else, output type is the same as the input type.

Return type:list, length=2 * len(arrays)

Examples

>>> import numpy as np
>>> from sklearn.model_selection import train_test_split
>>> X, y = np.arange(10).reshape((5, 2)), range(5)
>>> X
array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])
>>> list(y)
[0, 1, 2, 3, 4]
>>> X_train, X_test, y_train, y_test = train_test_split(
...     X, y, test_size=0.33, random_state=42)
...
>>> X_train
array([[4, 5],
       [0, 1],
       [6, 7]])
>>> y_train
[2, 0, 3]
>>> X_test
array([[2, 3],
       [8, 9]])
>>> y_test
[1, 4]
>>> train_test_split(y, shuffle=False)
[[0, 1, 2], [3, 4]]
static unique(y)[source]
static value_counts(ar)[source]
classmethod vstack_array(arrs)[source]

Module contents