如何在Sklearn Pipeline中运行CatBoost

Xovee2024-07-01 12:33

介绍

CatBoost的一大特点是可以很好的处理类别特征（Categorical Features）。当我们将其结合到Sklearn的Pipeline中时，会发生如下报错：

shell 复制代码

_catboost.CatBoostError: 'data' is numpy array of floating point numerical type, it means no categorical features, but 'cat_features' parameter specifies nonzero number of categorical features

因为CatBoost需要检查输入训练数据pandas.DataFrame中对应的cat_features。如果我们使用Pipeline后，输入给.fit()的数据是被修改过的，DataFrame中的columns的名字变为了数字。

解决方案

我们提前在数据上使用Pipeline，然后将原始数据转换为Pipeline处理后的数据，然后检索出其中包含的类别特征，将其传输给Catboost。

python 复制代码

# define your pipeline
pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', model),
])

preprocessor.fit(X_train)
transformed_X_train = pd.DataFrame(preprocessor.transform(X_train)).convert_dtypes()

new_cat_feature_idx = [transformed_X_train.columns.get_loc(col) for col in transformed_X_train.select_dtypes(include=['int64', 'bool']).columns]

pipeline.fit(X_train, y_train, classifier__cat_features=new_cat_feature_idx)