hypernets.tabular package¶
Subpackages¶
Submodules¶
hypernets.tabular.cache module¶
-
class
hypernets.tabular.cache.
CacheCallback
[source]¶ Bases:
object
-
on_apply
(fn, cached_data, *args, **kwargs)[source]¶ is fired before applying cached data. raise Exception to skip applying
-
on_enter
(fn, *args, **kwargs)[source]¶ is fired before checking cache. raise Exception to disable cache
-
hypernets.tabular.cfg module¶
hypernets.tabular.collinearity module¶
Adapted from https://scikit-learn.org/stable/auto_examples/inspection/plot_permutation_importance_multicollinear.html handling multicollinearity is by performing hierarchical clustering on the features’ Spearman rank-order correlations, picking a threshold, and keeping a single feature from each cluster.
hypernets.tabular.column_selector module¶
-
class
hypernets.tabular.column_selector.
AutoCategoryColumnSelector
(pattern=None, *, dtype_include=None, dtype_exclude=None, cat_exponent=0.5)[source]¶
-
class
hypernets.tabular.column_selector.
ColumnSelector
(pattern=None, *, dtype_include=None, dtype_exclude=None)[source]¶ Bases:
sklearn.compose._column_transformer.make_column_selector
Create a callable to select columns to be used with
ColumnTransformer
.make_column_selector()
can select columns based on datatype or the columns name with a regex. When using multiple selection criteria, all criteria must match for a column to be selected.Parameters: - pattern (str, default=None) – Name of columns containing this regex pattern will be included. If None, column selection will not be selected based on pattern.
- dtype_include (column dtype or list of column dtypes, default=None) – A selection of dtypes to include. For more details, see
pandas.DataFrame.select_dtypes()
. - dtype_exclude (column dtype or list of column dtypes, default=None) – A selection of dtypes to exclude. For more details, see
pandas.DataFrame.select_dtypes()
.
Returns: selector – Callable for column selection to be used by a
ColumnTransformer
.Return type: callable
See also
ColumnTransformer
- Class that allows combining the outputs of multiple transformer objects used on column subsets of the data into a single feature space.
Examples
>>> from sklearn.preprocessing import StandardScaler, OneHotEncoder >>> from sklearn.compose import make_column_transformer >>> from sklearn.compose import make_column_selector >>> import pandas as pd # doctest: +SKIP >>> X = pd.DataFrame({'city': ['London', 'London', 'Paris', 'Sallisaw'], ... 'rating': [5, 3, 4, 5]}) # doctest: +SKIP >>> ct = make_column_transformer( ... (StandardScaler(), ... make_column_selector(dtype_include=np.number)), # rating ... (OneHotEncoder(), ... make_column_selector(dtype_include=object))) # city >>> ct.fit_transform(X) # doctest: +SKIP array([[ 0.90453403, 1. , 0. , 0. ], [-1.50755672, 1. , 0. , 0. ], [-0.30151134, 0. , 1. , 0. ], [ 0.90453403, 0. , 0. , 1. ]])
-
class
hypernets.tabular.column_selector.
MinMaxColumnSelector
(min=None, max=None)[source]¶ Bases:
object
-
class
hypernets.tabular.column_selector.
TextColumnSelector
(pattern=None, *, dtype_include=None, dtype_exclude=None, word_count_threshold=10)[source]¶
-
hypernets.tabular.column_selector.
calc_skewness_kurtosis
(X_1, X_2, columns=None, smooth_fn=<ufunc 'log'>)[source]¶
hypernets.tabular.data_cleaner module¶
-
class
hypernets.tabular.data_cleaner.
DataCleaner
(nan_chars=None, correct_object_dtype=True, drop_constant_columns=True, drop_duplicated_columns=False, drop_label_nan_rows=True, drop_idness_columns=True, replace_inf_values=nan, drop_columns=None, reserve_columns=None, reduce_mem_usage=False, int_convert_to='float')[source]¶ Bases:
object
hypernets.tabular.data_hasher module¶
hypernets.tabular.dataframe_mapper module¶
Adapted from: https://github.com/scikit-learn-contrib/sklearn-pandas 1. Fix the problem of confusion of column names 2. Support columns is a callable object
-
class
hypernets.tabular.dataframe_mapper.
DataFrameMapper
(features, default=False, df_out=False, input_df=False, df_out_dtype_transforms=None)[source]¶ Bases:
sklearn.base.BaseEstimator
Map Pandas data frame column subsets to their own sklearn transformation.
- features : a list of tuples with features definitions.
- The first element is the pandas column selector. This can be a string (for one column) or a list of strings. The second element is an object that supports sklearn’s transform interface, or a list of such objects. The third element is optional and, if present, must be a dictionary with the options to apply to the transformation. Example: {‘alias’: ‘day_of_week’}
- default : default transformer to apply to the columns not
- explicitly selected in the mapper. If False (default), discard them. If None, pass them through untouched. Any other transformer will be applied to all the unselected columns as a whole, taken as a 2d-array.
- df_out : return a pandas data frame, with each column named using
- the pandas column that created it (if there’s only one input and output) or the input columns joined with ‘_’ if there’s multiple inputs, and the name concatenated with ‘_1’, ‘_2’ etc if there’s multiple outputs.
- input_df : If
True
pass the selected columns to the transformers - as a pandas DataFrame or Series. Otherwise pass them as a
numpy array. Defaults to
False
.
-
fitted_features_
¶ Type: list of tuple(column_name list, fitted transformer, options)
-
class
hypernets.tabular.dataframe_mapper.
TransformerPipeline
(steps)[source]¶ Bases:
sklearn.pipeline.Pipeline
Pipeline that expects all steps to be transformers taking a single X argument, an optional y argument, and having fit and transform methods.
Code is copied from sklearn’s Pipeline
-
fit
(X, y=None, **fit_params)[source]¶ Fit the model
Fit all the transforms one after the other and transform the data, then fit the transformed data using the final estimator.
Parameters: - X (iterable) – Training data. Must fulfill input requirements of first step of the pipeline.
- y (iterable, default=None) – Training targets. Must fulfill label requirements for all steps of the pipeline.
- **fit_params (dict of string -> object) – Parameters passed to the
fit
method of each step, where each parameter name is prefixed such that parameterp
for steps
has keys__p
.
Returns: self – This estimator
Return type:
-
fit_transform
(X, y=None, **fit_params)[source]¶ Fit the model and transform with the final estimator
Fits all the transforms one after the other and transforms the data, then uses fit_transform on transformed data with the final estimator.
Parameters: - X (iterable) – Training data. Must fulfill input requirements of first step of the pipeline.
- y (iterable, default=None) – Training targets. Must fulfill label requirements for all steps of the pipeline.
- **fit_params (dict of string -> object) – Parameters passed to the
fit
method of each step, where each parameter name is prefixed such that parameterp
for steps
has keys__p
.
Returns: Xt – Transformed samples
Return type: array-like of shape (n_samples, n_transformed_features)
-
hypernets.tabular.drift_detection module¶
-
class
hypernets.tabular.drift_detection.
DriftDetector
(preprocessor=None, estimator=None, random_state=None)[source]¶ Bases:
object
-
class
hypernets.tabular.drift_detection.
FeatureSelectorWithDriftDetection
(remove_shift_variable=True, variable_shift_threshold=0.7, variable_shift_scorer=None, auc_threshold=0.55, min_features=10, remove_size=0.1, sample_balance=True, max_test_samples=None, cv=5, random_state=None, callbacks=None)[source]¶ Bases:
object
-
parallelizable
= True¶
-
hypernets.tabular.estimator_detector module¶
hypernets.tabular.metrics module¶
-
class
hypernets.tabular.metrics.
Metrics
[source]¶ Bases:
object
-
calc_score
(y_preds, y_proba=None, metrics=('accuracy', ), task='binary', pos_label=1, classes=None, average=None)¶
-
evaluate
(X, y, metrics, *, task=None, pos_label=None, classes=None, average=None, threshold=0.5, n_jobs=-1)¶
-
metric_to_scoring
(task='binary', pos_label=None)¶
-
predict
(X, *, task=None, classes=None, threshold=0.5, n_jobs=None)¶
-
predict_proba
(X, *, n_jobs=None)¶
-
proba2predict
(*, task=None, threshold=0.5, classes=None)¶
-
-
hypernets.tabular.metrics.
calc_score
(y_true, y_preds, y_proba=None, metrics=('accuracy', ), task='binary', pos_label=1, classes=None, average=None)[source]¶
-
hypernets.tabular.metrics.
evaluate
(estimator, X, y, metrics, *, task=None, pos_label=None, classes=None, average=None, threshold=0.5, n_jobs=-1)[source]¶
hypernets.tabular.persistence module¶
hypernets.tabular.pseudo_labeling module¶
-
class
hypernets.tabular.pseudo_labeling.
PseudoLabeling
(strategy, threshold=None, quantile=None, number=None)[source]¶ Bases:
object
-
DEFAULT_STRATEGY_SETTINGS
= {'default_number': 0.2, 'default_quantile': 0.8, 'default_strategy': 'threshold', 'default_threshold': 0.8}¶
-
np
= <module 'numpy' from '/home/docs/checkouts/readthedocs.org/user_builds/hypernets/envs/latest/lib/python3.6/site-packages/numpy/__init__.py'>¶
-
hypernets.tabular.sklearn_ex module¶
-
class
hypernets.tabular.sklearn_ex.
AsTypeTransformer
(*, dtype)[source]¶ Bases:
sklearn.base.BaseEstimator
-
class
hypernets.tabular.sklearn_ex.
CategorizeEncoder
(columns=None, remain_numeric=True)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
-
class
hypernets.tabular.sklearn_ex.
ColumnEncoder
[source]¶ Bases:
sklearn.base.BaseEstimator
Encode each column in the dataset with a separate encoder.
-
class
hypernets.tabular.sklearn_ex.
ConstantImputer
(missing_values=nan, fill_value=None, copy=True)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
-
class
hypernets.tabular.sklearn_ex.
DataFrameWrapper
(transform, columns=None)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
-
class
hypernets.tabular.sklearn_ex.
DatetimeEncoder
(columns=None, include=None, exclude=None, extra=None, drop_constants=True)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
-
all_items
= {'day': 'day', 'dayofyear': 'dayofyear', 'hour': 'hour', 'minute': 'minute', 'month': 'month', 'second': 'second', 'timestamp': <function DatetimeEncoder.<lambda>>, 'week': 'week', 'weekday': 'weekday', 'year': 'year'}¶
-
default_include
= ['month', 'day', 'hour', 'minute', 'week', 'weekday', 'dayofyear']¶
-
-
class
hypernets.tabular.sklearn_ex.
FeatureImportanceSelection
(importances, quantile, min_features=3)[source]¶ Bases:
sklearn.base.BaseEstimator
-
important_features
¶
-
-
class
hypernets.tabular.sklearn_ex.
FeatureImportancesSelectionTransformer
(task=None, strategy=None, threshold=None, quantile=None, number=None, data_clean=True)[source]¶ Bases:
sklearn.base.BaseEstimator
-
class
hypernets.tabular.sklearn_ex.
FeatureSelectionTransformer
(task=None, max_train_samples=10000, max_test_samples=10000, max_cols=10000, ratio_select_cols=0.1, n_max_cols=100, n_min_cols=10, reserved_cols=None)[source]¶ Bases:
sklearn.base.BaseEstimator
-
class
hypernets.tabular.sklearn_ex.
FloatOutputImputer
(*, missing_values=nan, strategy='mean', fill_value=None, verbose=0, copy=True, add_indicator=False)[source]¶ Bases:
sklearn.impute._base.SimpleImputer
-
class
hypernets.tabular.sklearn_ex.
LgbmLeavesEncoder
(cat_vars, cont_vars, task, **params)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
-
class
hypernets.tabular.sklearn_ex.
LocalizedTfidfVectorizer
(*, input='content', encoding='utf-8', decode_error='strict', strip_accents=None, lowercase=True, preprocessor=None, tokenizer=None, analyzer='word', stop_words=None, token_pattern='(?u)\b\w\w+\b', ngram_range=(1, 1), max_df=1.0, min_df=1, max_features=None, vocabulary=None, binary=False, dtype=<class 'numpy.float64'>, norm='l2', use_idf=True, smooth_idf=True, sublinear_tf=False)[source]¶ Bases:
sklearn.feature_extraction.text.TfidfVectorizer
-
class
hypernets.tabular.sklearn_ex.
LogStandardScaler
(copy=True, with_mean=True, with_std=True)[source]¶ Bases:
sklearn.base.BaseEstimator
-
class
hypernets.tabular.sklearn_ex.
MultiKBinsDiscretizer
(columns=None, bins=None, strategy='quantile')[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
-
class
hypernets.tabular.sklearn_ex.
MultiLabelEncoder
(columns=None, dtype=None)[source]¶ Bases:
sklearn.base.BaseEstimator
-
class
hypernets.tabular.sklearn_ex.
MultiTargetEncoder
(n_folds=4, smooth=None, seed=42, split_method='interleaved', dtype=None)[source]¶ Bases:
hypernets.tabular.sklearn_ex.ColumnEncoder
-
label_encoder_cls
¶ alias of
sklearn.preprocessing._label.LabelEncoder
-
target_encoder_cls
¶ alias of
SlimTargetEncoder
-
-
class
hypernets.tabular.sklearn_ex.
MultiVarLenFeatureEncoder
(features)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
-
class
hypernets.tabular.sklearn_ex.
SafeLabelEncoder
[source]¶ Bases:
sklearn.preprocessing._label.LabelEncoder
-
class
hypernets.tabular.sklearn_ex.
SafeOneHotEncoder
(*, categories='auto', drop=None, sparse=True, dtype=<class 'numpy.float64'>, handle_unknown='error')[source]¶ Bases:
sklearn.preprocessing._encoders.OneHotEncoder
-
class
hypernets.tabular.sklearn_ex.
SafeOrdinalEncoder
(*, categories='auto', dtype=<class 'numpy.float64'>, handle_unknown='error', unknown_value=None)[source]¶ Bases:
sklearn.preprocessing._encoders.OrdinalEncoder
Adapted from sklearn OrdinalEncodern Encode categorical features as an integer array.
The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are converted to ordinal integers. This results in a single column of integers (0 to n_categories - 1) per feature.
Read more in the User Guide.
New in version 0.20.
Parameters: - categories ('auto' or a list of array-like, default='auto') –
Categories (unique values) per feature:
- ’auto’ : Determine categories automatically from the training data.
- list :
categories[i]
holds the categories expected in the ith column. The passed categories should not mix strings and numeric values, and should be sorted in case of numeric values.
The used categories can be found in the
categories_
attribute. - dtype (number type, default np.float64) – Desired dtype of output.
- handle_unknown ({'error', 'use_encoded_value'}, default='error') –
When set to ‘error’ an error will be raised in case an unknown categorical feature is present during transform. When set to ‘use_encoded_value’, the encoded value of unknown categories will be set to the value given for the parameter unknown_value. In
inverse_transform()
, an unknown category will be denoted as None.New in version 0.24.
- unknown_value (int or np.nan, default=None) –
When the parameter handle_unknown is set to ‘use_encoded_value’, this parameter is required and will set the encoded value of unknown categories. It has to be distinct from the values used to encode any of the categories in fit. If set to np.nan, the dtype parameter must be a float dtype.
New in version 0.24.
-
categories_
¶ The categories of each feature determined during
fit
(in order of the features in X and corresponding with the output oftransform
). This does not include categories that weren’t seen duringfit
.Type: list of arrays
See also
OneHotEncoder
- Performs a one-hot encoding of categorical features.
LabelEncoder
- Encodes target labels with values between 0 and
n_classes-1
.
Examples
Given a dataset with two features, we let the encoder find the unique values per feature and transform the data to an ordinal encoding.
>>> from sklearn.preprocessing import OrdinalEncoder >>> enc = OrdinalEncoder() >>> X = [['Male', 1], ['Female', 3], ['Female', 2]] >>> enc.fit(X) OrdinalEncoder() >>> enc.categories_ [array(['Female', 'Male'], dtype=object), array([1, 2, 3], dtype=object)] >>> enc.transform([['Female', 3], ['Male', 1]]) array([[0., 2.], [1., 0.]])
>>> enc.inverse_transform([[1, 0], [0, 1]]) array([['Male', 1], ['Female', 2]], dtype=object)
- categories ('auto' or a list of array-like, default='auto') –
-
class
hypernets.tabular.sklearn_ex.
SafeSimpleImputer
(*, missing_values=nan, strategy='mean', fill_value=None, verbose=0, copy=True, add_indicator=False)[source]¶ Bases:
sklearn.impute._base.SimpleImputer
passthrough bool columns
-
fit
(X, y=None)[source]¶ Fit the imputer on X.
Parameters: X ({array-like, sparse matrix}, shape (n_samples, n_features)) – Input data, where n_samples
is the number of samples andn_features
is the number of features.Returns: self Return type: SimpleImputer
-
-
class
hypernets.tabular.sklearn_ex.
SkewnessKurtosisTransformer
(transform_fn=None, skew_threshold=0.5, kurtosis_threshold=0.5)[source]¶ Bases:
sklearn.base.BaseEstimator
-
class
hypernets.tabular.sklearn_ex.
SlimTargetEncoder
(n_folds=4, smooth=0, seed=42, split_method='interleaved', dtype=None, output_2d=False)[source]¶ Bases:
hypernets.tabular.sklearn_ex.TargetEncoder
The slimmed TargetEncoder with ‘train’ and ‘train_encode’ attribute were set to None.
-
fit
(X, y)[source]¶ Fit a TargetEncoder instance to a set of categories
Parameters: - x (cudf.Series or cudf.DataFrame or cupy.ndarray) – categories to be encoded. It’s elements may or may not be unique
- y (cudf.Series or cupy.ndarray) – Series containing the target variable.
Returns: self – A fitted instance of itself to allow method chaining
Return type:
-
fit_transform
(X, y)[source]¶ Simultaneously fit and transform an input
This is functionally equivalent to (but faster than) TargetEncoder().fit(y).transform(y)
-
split_method
¶
-
transform
(X)[source]¶ Transform an input into its categorical keys.
This is intended for test data. For fitting and transforming the training data, prefer fit_transform.
Parameters: x (cudf.Series) – Input keys to be transformed. Its values doesn’t have to match the categories given to fit Returns: encoded – The ordinally encoded input series Return type: cupy.ndarray
-
-
class
hypernets.tabular.sklearn_ex.
TargetEncoder
(n_folds=4, smooth=0, seed=42, split_method='interleaved')[source]¶ Bases:
sklearn.base.BaseEstimator
Adapted from cuml.preprocessing.TargetEncoder
-
fit
(x, y)[source]¶ Fit a TargetEncoder instance to a set of categories
Parameters: - x (cudf.Series or cudf.DataFrame or cupy.ndarray) – categories to be encoded. It’s elements may or may not be unique
- y (cudf.Series or cupy.ndarray) – Series containing the target variable.
Returns: self – A fitted instance of itself to allow method chaining
Return type:
-
fit_transform
(x, y)[source]¶ Simultaneously fit and transform an input
This is functionally equivalent to (but faster than) TargetEncoder().fit(y).transform(y)
-
transform
(x)[source]¶ Transform an input into its categorical keys.
This is intended for test data. For fitting and transforming the training data, prefer fit_transform.
Parameters: x (cudf.Series) – Input keys to be transformed. Its values doesn’t have to match the categories given to fit Returns: encoded – The ordinally encoded input series Return type: cupy.ndarray
-
-
class
hypernets.tabular.sklearn_ex.
TfidfEncoder
(columns=None, flatten=False, **kwargs)[source]¶ Bases:
sklearn.base.BaseEstimator
,sklearn.base.TransformerMixin
-
class
hypernets.tabular.sklearn_ex.
VarLenFeatureEncoder
(sep='|')[source]¶ Bases:
object
-
max_element_length
¶
-
n_classes
¶
-
hypernets.tabular.toolbox module¶
-
class
hypernets.tabular.toolbox.
ToolBox
[source]¶ Bases:
object
-
STRATEGY_NUMBER
= 'number'¶
-
STRATEGY_QUANTILE
= 'quantile'¶
-
STRATEGY_THRESHOLD
= 'threshold'¶
-
acceptable_types
= (<class 'numpy.ndarray'>, <class 'pandas.core.frame.DataFrame'>, <class 'pandas.core.series.Series'>)¶
-
static
collapse_last_dim
(arr, keep_dim=True)[source]¶ Collapse the last dimension :param arr: data array :param keep_dim: keep the last dim as one or not :return:
-
column_selector
= <module 'hypernets.tabular.column_selector' from '/home/docs/checkouts/readthedocs.org/user_builds/hypernets/checkouts/latest/hypernets/tabular/column_selector.py'>¶
-
compute_class_weight
(*, classes, y)¶ Estimate class weights for unbalanced datasets.
Parameters: - class_weight (dict, 'balanced' or None) – If ‘balanced’, class weights will be given by
n_samples / (n_classes * np.bincount(y))
. If a dictionary is given, keys are classes and values are corresponding class weights. If None is given, the class weights will be uniform. - classes (ndarray) – Array of the classes occurring in the data, as given by
np.unique(y_org)
withy_org
the original class labels. - y (array-like of shape (n_samples,)) – Array of original class labels per sample.
Returns: class_weight_vect – Array with class_weight_vect[i] the weight for i-th class.
Return type: ndarray of shape (n_classes,)
References
The “balanced” heuristic is inspired by Logistic Regression in Rare Events Data, King, Zen, 2001.
- class_weight (dict, 'balanced' or None) – If ‘balanced’, class weights will be given by
-
classmethod
data_cleaner
(nan_chars=None, correct_object_dtype=True, drop_constant_columns=True, drop_duplicated_columns=False, drop_label_nan_rows=True, drop_idness_columns=True, replace_inf_values=nan, drop_columns=None, reserve_columns=None, reduce_mem_usage=False, int_convert_to='float')[source]¶
-
static
detect_strategy
(strategy, *, threshold=None, quantile=None, number=None, default_strategy, default_threshold, default_quantile, default_number)[source]¶
-
classmethod
detect_strategy_of_feature_selection_by_importance
(strategy, *, threshold=None, quantile=None, number=None)[source]¶
-
classmethod
estimator_detector
(name_or_cls, task, *, init_kwargs=None, fit_kwargs=None, n_samples=100, n_features=5)[source]¶
-
classmethod
feature_selector_with_drift_detection
(remove_shift_variable=True, variable_shift_threshold=0.7, variable_shift_scorer=None, auc_threshold=0.55, min_features=10, remove_size=0.1, sample_balance=True, max_test_samples=None, cv=5, random_state=None, callbacks=None)[source]¶
-
classmethod
feature_selector_with_feature_importances
(strategy=None, threshold=None, quantile=None, number=None)[source]¶
-
classmethod
greedy_ensemble
(task, estimators, need_fit=False, n_folds=5, method='soft', random_state=9527, scoring='neg_log_loss', ensemble_size=0)[source]¶
-
metrics
¶ alias of
hypernets.tabular.metrics.Metrics
-
static
permutation_importance
(estimator, X, y, *, scoring=None, n_repeats=5, n_jobs=None, random_state=None, sample_weight=None, max_samples=1.0)[source]¶ see: sklearn.inspection.permutation_importance
-
classmethod
permutation_importance_batch
(estimators, X, y, scoring=None, n_repeats=5, n_jobs=None, random_state=None)[source]¶ Evaluate the importance of features of a set of estimators
Parameters: - estimators (list) – A set of estimators that has already been fitted and is compatible with scorer.
- X (ndarray or DataFrame, shape (n_samples, n_features)) – Data on which permutation importance will be computed.
- y (array-like or None, shape (n_samples, ) or (n_samples, n_classes)) – Targets for supervised or None for unsupervised.
- scoring (string, callable or None, default=None) – Scorer to use. It can be a single string (see scoring_parameter) or a callable (see scoring). If None, the estimator’s default scorer is used.
- n_repeats (int, default=5) – Number of times to permute a feature.
- n_jobs (int or None, default=None) – The number of jobs to use for the computation.
None means 1 unless in a
joblib.parallel_backend
context. -1 means using all processors. See Glossary for more details. - random_state (int, RandomState instance, or None, default=None) – Pseudo-random number generator to control the permutations of each feature. See random_state.
Returns: result – Dictionary-like object, with attributes:
- importances_mean : ndarray, shape (n_features, )
Mean of feature importance over n_repeats.
- importances_std : ndarray, shape (n_features, )
Standard deviation over n_repeats.
- importances : ndarray, shape (n_features, n_repeats)
Raw permutation importance scores.
Return type: Bunch
-
classmethod
select_feature_by_importance
(feature_importance, strategy=None, threshold=None, quantile=None, number=None)[source]¶
-
train_test_split
(*, test_size=None, train_size=None, random_state=None, shuffle=True, stratify=None)¶ Split arrays or matrices into random train and test subsets
Quick utility that wraps input validation and
next(ShuffleSplit().split(X, y))
and application to input data into a single call for splitting (and optionally subsampling) data in a oneliner.Read more in the User Guide.
Parameters: - *arrays (sequence of indexables with same length / shape[0]) – Allowed inputs are lists, numpy arrays, scipy-sparse matrices or pandas dataframes.
- test_size (float or int, default=None) – If float, should be between 0.0 and 1.0 and represent the proportion
of the dataset to include in the test split. If int, represents the
absolute number of test samples. If None, the value is set to the
complement of the train size. If
train_size
is also None, it will be set to 0.25. - train_size (float or int, default=None) – If float, should be between 0.0 and 1.0 and represent the proportion of the dataset to include in the train split. If int, represents the absolute number of train samples. If None, the value is automatically set to the complement of the test size.
- random_state (int, RandomState instance or None, default=None) – Controls the shuffling applied to the data before applying the split. Pass an int for reproducible output across multiple function calls. See Glossary.
- shuffle : bool, default=True
- Whether or not to shuffle the data before splitting. If shuffle=False then stratify must be None.
- stratify : array-like, default=None
- If not None, data is split in a stratified fashion, using this as the class labels. Read more in the User Guide.
Returns: splitting – List containing train-test split of inputs. New in version 0.16: If the input is sparse, the output will be a
scipy.sparse.csr_matrix
. Else, output type is the same as the input type.Return type: list, length=2 * len(arrays) Examples
>>> import numpy as np >>> from sklearn.model_selection import train_test_split >>> X, y = np.arange(10).reshape((5, 2)), range(5) >>> X array([[0, 1], [2, 3], [4, 5], [6, 7], [8, 9]]) >>> list(y) [0, 1, 2, 3, 4]
>>> X_train, X_test, y_train, y_test = train_test_split( ... X, y, test_size=0.33, random_state=42) ... >>> X_train array([[4, 5], [0, 1], [6, 7]]) >>> y_train [2, 0, 3] >>> X_test array([[2, 3], [8, 9]]) >>> y_test [1, 4]
>>> train_test_split(y, shuffle=False) [[0, 1, 2], [3, 4]]
-