Transformers that prepare data for other estimators. This module is styled after scikit-learn's preprocessing module: https://scikit-learn.org/stable/modules/preprocessing.html.
Classes
KBinsDiscretizer
KBinsDiscretizer(
n_bins: int = 5, strategy: typing.Literal["uniform", "quantile"] = "quantile"
)
Bin continuous data into intervals.
Parameters | |
---|---|
Name | Description |
n_bins |
int, default 5
The number of bins to produce. Raises ValueError if |
strategy |
{'uniform', 'quantile'}, default='quantile'
Strategy used to define the widths of the bins. 'uniform': All bins in each feature have identical widths. 'quantile': All bins in each feature have the same number of points. |
LabelEncoder
LabelEncoder(
min_frequency: typing.Optional[int] = None,
max_categories: typing.Optional[int] = None,
)
Encode target labels with value between 0 and n_classes-1.
This transformer should be used to encode target values, i.e. y
, and
not the input X
.
Parameters | |
---|---|
Name | Description |
min_frequency |
Optional[int], default None
Specifies the minimum frequency below which a category will be considered infrequent. Default None. int: categories with a smaller cardinality will be considered infrequent as ßindex 0. |
max_categories |
Optional[int], default None
Specifies an upper limit to the number of output features for each input feature when considering infrequent categories. If there are infrequent categories, max_categories includes the category representing the infrequent categories along with the frequent categories. Default None. Set limit to 1,000,000. |
MaxAbsScaler
MaxAbsScaler()
Scale each feature by its maximum absolute value.
This estimator scales and translates each feature individually such that the maximal absolute value of each feature in the training set will be 1.0. It does not shift/center the data, and thus does not destroy any sparsity.
MinMaxScaler
MinMaxScaler()
Transform features by scaling each feature to a given range.
This estimator scales and translates each feature individually such that it is in the given range on the training set, e.g. between zero and one.
OneHotEncoder
OneHotEncoder(
drop: typing.Optional[typing.Literal["most_frequent"]] = None,
min_frequency: typing.Optional[int] = None,
max_categories: typing.Optional[int] = None,
)
Encode categorical features as a one-hot format.
The input to this transformer should be an array-like of integers or strings, denoting the values taken on by categorical (discrete) features. The features are encoded using a one-hot (aka 'one-of-K' or 'dummy') encoding scheme.
Note that this method deviates from Scikit-Learn; instead of producing sparse
binary columns, the encoding is a single column of STRUCT<index INT64, value DOUBLE>
.
Examples:
Given a dataset with two features, we let the encoder find the unique
values per feature and transform the data to a binary one-hot encoding.
>>> from bigframes.ml.preprocessing import OneHotEncoder
>>> import bigframes.pandas as bpd
>>> bpd.options.display.progress_bar = None
>>> enc = OneHotEncoder()
>>> X = bpd.DataFrame({"a": ["Male", "Female", "Female"], "b": ["1", "3", "2"]})
>>> enc.fit(X)
OneHotEncoder()
>>> print(enc.transform(bpd.DataFrame({"a": ["Female", "Male"], "b": ["1", "4"]})))
onehotencoded_a onehotencoded_b
0 [{'index': 1, 'value': 1.0}] [{'index': 1, 'value': 1.0}]
1 [{'index': 2, 'value': 1.0}] [{'index': 0, 'value': 1.0}]
<BLANKLINE>
[2 rows x 2 columns]
Parameters | |
---|---|
Name | Description |
drop |
Optional[Literal["most_frequent"]], default None
Specifies a methodology to use to drop one of the categories per feature. This is useful in situations where perfectly collinear features cause problems, such as when feeding the resulting data into an unregularized linear regression model. However, dropping one category breaks the symmetry of the original representation and can therefore induce a bias in downstream models, for instance for penalized linear classification or regression models. Default None: retain all the categories. "most_frequent": Drop the most frequent category found in the string expression. Selecting this value causes the function to use dummy encoding. |
min_frequency |
Optional[int], default None
Specifies the minimum frequency below which a category will be considered infrequent. Default None. int: categories with a smaller cardinality will be considered infrequent as index 0. |
max_categories |
Optional[int], default None
Specifies an upper limit to the number of output features for each input feature when considering infrequent categories. If there are infrequent categories, max_categories includes the category representing the infrequent categories along with the frequent categories. Default None. Set limit to 1,000,000. |
StandardScaler
StandardScaler()
Standardize features by removing the mean and scaling to unit variance.
The standard score of a sample x
is calculated as:z = (x - u) / s
where u
is the mean of the training samples or zero if with_mean=False
,
and s
is the standard deviation of the training samples or one if
with_std=False
.
Centering and scaling happen independently on each feature by computing
the relevant statistics on the samples in the training set. Mean and
standard deviation are then stored to be used on later data using
transform
.
Standardization of a dataset is a common requirement for many machine learning estimators: they might behave badly if the individual features do not more or less look like standard normally distributed data (e.g. Gaussian with 0 mean and unit variance).
Examples:
.. code-block::
from bigframes.ml.preprocessing import StandardScaler
import bigframes.pandas as bpd
scaler = StandardScaler()
data = bpd.DataFrame({"a": [0, 0, 1, 1], "b":[0, 0, 1, 1]})
scaler.fit(data)
print(scaler.transform(data))
print(scaler.transform(bpd.DataFrame({"a": [2], "b":[2]})))