Python机器学习笔记（十六、数据表示与特征工程-分类变量）

在机器学习的过程中，我们一般假设数据是由浮点数组成的二维数组，每一列描述数据点的**连续特征（continuous feature）。**但真实情况下，数据并不一定按照这种方式收集。

一种常见的特征类型是分类特征（categorical feature） ，也叫离散特征（discrete feature）。通常这种特征并不是数值。

分类特征与连续特征之间的区别类似于分类和回归的区别，只是前者在输入端而不是输出端。已经见过的连续特征的例子包括图像像素明暗程度和鸢尾花的尺寸测量。分类特征的例子包括某种产品的品牌、颜色或销售部门（图书、服装、硬件）。这些都是描述一件产品的属性，它们不以连续的方式变化。一件产品要么属于服装部门，要么属于图书部门。在图书和服装之间没有中间部门，不同的分类之间也没有顺序（图书不大于服装也不小于服装，硬件不在图书和服装之间，等等）。

无论数据包含哪种类型的特征，数据表示方式都会对机器学习模型的性能产生巨大影响。数据缩放非常重要，换句话说，如果没有缩放数据，那么用厘米还是英寸表示测量数据的结果将会不同。另外用额外的特征扩充（augment）数据也很有用，比如添加特征的交互项（乘积）或更一般的多项式。

寻找最佳数据表示这个问题被称为特征工程（feature engineering），它是数据科学家和机器学习从业者在尝试解决现实世界问题时的主要任务之一。用正确的方式表示数据，对监督模型性能的影响比所选择的精确参数还要大。

一、分类变量

通过示例数据集和代码来学习。用到的数据集是1994年us成年人收入数据集adult。任务是预测一名工人的收入是高于50000美元还是低于50000美元。adult数据集的特征包括工人的年龄、雇用方式、教育水平、性别、每周工作时长、职业等。数据集示例如下：

我们用pandas加载数据集，输出前五行看看：

python 复制代码

import pandas as pd

data = pd.read_csv("data/adult.data", 
                   header=None, index_col=False, 
                   names=['age', 'workclass', 'fnlwgt', 'education',  'education-num', 
                          'marital-status', 'occupation', 'relationship', 'race', 'gender', 
                          'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 
                          'income'])
data = data[['age', 'workclass', 'education', 'gender', 'hours-per-week',
             'occupation', 'income']]
print(data.head())

输出结果：

| | age | workclass | education | gender | hours-per-week | occupation | income |
| 0 | 39 | State-gov | Bachelors | Male | 40 | Adm-clerical | <=50K |
| 1 | 50 | Self-emp-not-inc | Bachelors | Male | 13 | Exec-managerial | <=50K |
| 2 | 38 | Private | HS-grad | Male | 40 | Handlers-cleaners | <=50K |
| 3 | 53 | Private | 11th | Male | 40 | Handlers-cleaners | <=50K |

4	28	Private	Bachelors	Female	40	Prof-specialty	<=50K

这是个分类任务，两个类别是收入小于等于50k和大于50k。但如果是预测具体收入那就变成了一个回归任务，这样问题将变得更加困难，我们做分类任务，理解50k的分界线本身也很有趣。

在这个数据集中，age（年龄）和 hours-per-week（每周工作时长）是连续特征，但workclass（工作类型）、education（教育程度）、gender（性别）、occupation（职业）都是分类特征。它们都来自一系列固定的可能取值（而不是一个范围），表示的是定性属性（而不是数量）。

首先，假设我们想要在这个数据上学习一个 Logistic 回归分类器。Logistic回归利用下列公式进行预测，预测值为 ŷ：

其中w[i]和b是从训练集中学到的系数，x[i]是输入特征。当x[i]是数字时这个公式才有意义，但如果 x[2]是"Masters"或"Bachelors"的话，这个公式没有意义。在应用 Logistic 回归时，我们需要换一种方式来表示数据。

1. One-Hot编码（虚拟变量）

表示分类变量最常用的方法是使用 one-hot 编码（one-hot-encoding）或 N 取一编码（one-out-of-N encoding） ，也叫虚拟变量（dummy variable）。其思想是：将一个分类变量替换为一个或多个新特征，新特征取值为 0 和 1。对于线性二分类（以及 scikit-learn 中其他所有模型）的公式而言，0 和 1 这两个值是有意义的，我们可以像这样对每个类别引入一个新特征，从而表示任意数量的类别。

例如：workclass 特征的可能取值包括 "Government Employee"、"Private Employee"、 "Self Employed" 和 "Self Employed Incorporated"。为了编码这4个可能的取值，我们创建了4个新特征，分别叫作"Government Employee"、"Private Employee"、"Self Employed" 和 "Self Employed Incorporated"。如果一个人的workclass取其中某个值，那么对应的特征取值为1，其他特征均取值为 0。因此，对每个数据点来说，4个新特征中只有一个的取值为1。这就是它叫作one-hot编码或 N取一编码的原因。其原理如下图所示。利用4个新特征对一个特征进行编码。在机器学习算法中使用此数据时，将会删除原始的workclass特征，仅保留0-1征。

注意：one-hot编码与统计学中使用的虚拟编码（dummy encoding）非常相似，但并不完全相同。为简单起见，我们将每个类别编码为不同的二元特征。在统计学中，通常将具有k个可能取值的分类特征编码为k-1个特征（都等于零表示最后一个可能取值）。这么做是为了简化分析。

可以通过pandas或scikit-learn将数据转换为分类变量的one-hot编码，使用pandas要稍微简单一些，先使用这种方法。加载完数据集后，先检查每列是否都是有意义的分类数据，例如：对于性别列，有人会将男性填写为male，有人会填写为man，而我们希望用同一个类别表示。可以使用pandas的Series中的value_counts函数，显示唯一值和其出现的次数：

python 复制代码

import pandas as pd

data = pd.read_csv("data/adult.data", 
                   header=None, index_col=False, 
                   names=['age', 'workclass', 'fnlwgt', 'education',  'education-num', 
                          'marital-status', 'occupation', 'relationship', 'race', 'gender', 
                          'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 
                          'income'])
data = data[['age', 'workclass', 'education', 'gender', 'hours-per-week',
             'occupation', 'income']]

# 检查gender列
print(data.gender.value_counts())

输出：

gender

Male 21790

Female 10771

Name: count, dtype: int64

这个数据集中性别有两个值：Male 和 Female，这说明数据格式已经很好，可以用 one-hot 编码来表示。在实际的应用中，应该查看并检查所有列的值。为简洁起见，这里跳过这一步。用 pandas 编码数据有一种非常简单的方法，就是使用get_dummies函数。get_dummies 函数自动变换所有具有对象类型（比如字符串）的列或所有分类的列（之前学习pandas做过笔记）：

python 复制代码

import pandas as pd

data = pd.read_csv("data/adult.data", 
                   header=None, index_col=False, 
                   names=['age', 'workclass', 'fnlwgt', 'education',  'education-num', 
                          'marital-status', 'occupation', 'relationship', 'race', 'gender', 
                          'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 
                          'income'])
data = data[['age', 'workclass', 'education', 'gender', 'hours-per-week',
             'occupation', 'income']]
print("Original features:\n", list(data.columns), "\n")
data_dummies = pd.get_dummies(data)
print("Features after get_dummies:\n", list(data_dummies.columns))

输出结果：

Original features:

'age', 'workclass', 'education', 'gender', 'hours-per-week', 'occupation', 'income'

Features after get_dummies:

'age', 'hours-per-week', 'workclass_ ?', 'workclass_ Federal-gov', 'workclass_ Local-gov', 'workclass_ Never-worked', 'workclass_ Private', 'workclass_ Self-emp-inc', 'workclass_ Self-emp-not-inc', 'workclass_ State-gov', 'workclass_ Without-pay', 'education_ 10th', 'education_ 11th', 'education_ 12th', 'education_ 1st-4th', 'education_ 5th-6th', 'education_ 7th-8th', 'education_ 9th', 'education_ Assoc-acdm', 'education_ Assoc-voc', 'education_ Bachelors', 'education_ Doctorate', 'education_ HS-grad', 'education_ Masters', 'education_ Preschool', 'education_ Prof-school', 'education_ Some-college', 'gender_ Female', 'gender_ Male', 'occupation_ ?', 'occupation_ Adm-clerical', 'occupation_ Armed-Forces', 'occupation_ Craft-repair', 'occupation_ Exec-managerial', 'occupation_ Farming-fishing', 'occupation_ Handlers-cleaners', 'occupation_ Machine-op-inspct', 'occupation_ Other-service', 'occupation_ Priv-house-serv', 'occupation_ Prof-specialty', 'occupation_ Protective-serv', 'occupation_ Sales', 'occupation_ Tech-support', 'occupation_ Transport-moving', 'income_ \<=50K', 'income_ \>50K'

从输出可以看到：连续特征age和hours-per-week没有发生变化，而分类特征的每个可能取值都被扩展为一个新特征：

python 复制代码

import pandas as pd

data = pd.read_csv("data/adult.data", 
                   header=None, index_col=False, 
                   names=['age', 'workclass', 'fnlwgt', 'education',  'education-num', 
                          'marital-status', 'occupation', 'relationship', 'race', 'gender', 
                          'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 
                          'income'])
data = data[['age', 'workclass', 'education', 'gender', 'hours-per-week',
             'occupation', 'income']]

data_dummies = pd.get_dummies(data)
# 输出新特征表示看看
print(data_dummies.head())

输出：

4	28	40	False	False	...	False	False	True	False

5 rows × 46 columns

下面可以使用 values 属性将 data_dummies 从DataFrame转换为 NumPy 数组，然后在其上训练一个机器学习模型。在训练模型之前，注意要把目标变量（现在被编码为两个income列）从数据中分离出来（将输出变量或输出变量的一些导出属性包含在特征表示中，是构建监督机器学习模型时一个非常常见的错误）。

注意：pandas中的列索引包括范围的结尾，因此 'age':'occupation_Transport-moving' 中包括 occupation_Transport-moving。这与 NumPy 数组的切片不同，后者不包括范围的结尾，例如 np.arange(11)[0:10] 不包括索引编号为10的元素。

我们仅提取包含特征的列，也就是从age到occupation_Transport-moving的所有列。这一范围包含所有特征，但不包含目标：

python 复制代码

import pandas as pd

data = pd.read_csv("data/adult.data", 
                   header=None, index_col=False, 
                   names=['age', 'workclass', 'fnlwgt', 'education',  'education-num', 
                          'marital-status', 'occupation', 'relationship', 'race', 'gender', 
                          'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 
                          'income'])
data = data[['age', 'workclass', 'education', 'gender', 'hours-per-week',
             'occupation', 'income']]

data_dummies = pd.get_dummies(data)
features = data_dummies.loc[:, 'age':'occupation_ Transport-moving']
# 提取NumPy数组
X = features.values
y = data_dummies['income_ >50K'].values
print("X.shape: {} y.shape: {}".format(X.shape, y.shape))

输出：X.shape: (32561, 44) y.shape: (32561,)

现在数据的表示方式可以被scikit-learn处理，我们继续下一步：

python 复制代码

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

data = pd.read_csv("data/adult.data", 
                   header=None, index_col=False, 
                   names=['age', 'workclass', 'fnlwgt', 'education',  'education-num', 
                          'marital-status', 'occupation', 'relationship', 'race', 'gender', 
                          'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 
                          'income'])
data = data[['age', 'workclass', 'education', 'gender', 'hours-per-week',
             'occupation', 'income']]

data_dummies = pd.get_dummies(data)
features = data_dummies.loc[:, 'age':'occupation_ Transport-moving']
# 提取NumPy数组
X = features.values
y = data_dummies['income_ >50K'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
print("Test score: {:.2f}".format(logreg.score(X_test, y_test)))

输出结果：

ConvergenceWarning: lbfgs failed to converge (status=1):

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:

https://scikit-learn.org/stable/modules/preprocessing.html

Please also refer to the documentation for alternative solver options:

https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression

n_iter_i = _check_optimize_result(

Test score: 0.81

输出了结果，但有一个警告信息：算法没有在默认的最大迭代次数内收敛到一个稳定的解，因此发出了这个警告。

解释警告信息：

‌ConvergenceWarning‌: 表示算法在迭代过程中未能收敛。
‌lbfgs failed to converge (status=1) ‌: lbfgs 是一种优化算法，这里报告说它未能成功收敛。
‌STOP: TOTAL NO. of ITERATIONS REACHED LIMIT‌: 迭代次数达到了设置的上限。
‌**Increase the number of iterations (max_iter)**‌: 建议增加最大迭代次数。
‌scale the data‌: 建议对数据进行缩放处理，以改善算法的收敛性。

我们试试增加迭代次数看对结果是否有影响：

python 复制代码

import pandas as pd
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

data = pd.read_csv("data/adult.data", 
                   header=None, index_col=False, 
                   names=['age', 'workclass', 'fnlwgt', 'education',  'education-num', 
                          'marital-status', 'occupation', 'relationship', 'race', 'gender', 
                          'capital-gain', 'capital-loss', 'hours-per-week', 'native-country', 
                          'income'])
data = data[['age', 'workclass', 'education', 'gender', 'hours-per-week',
             'occupation', 'income']]

data_dummies = pd.get_dummies(data)
features = data_dummies.loc[:, 'age':'occupation_ Transport-moving']
# 提取NumPy数组
X = features.values
y = data_dummies['income_ >50K'].values

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
logreg = LogisticRegression(solver='lbfgs', max_iter=10000)
logreg.fit(X_train, y_train)
print("Test score: {:.2f}".format(logreg.score(X_test, y_test)))

增加迭代次数到10000，输出了结果：Test score: 0.81 没有警告信息了。

在这个例子中，我们对同时包含训练数据和测试数据的DataFrame调用 get_ dummies。这一点很重要，可以确保训练集和测试集中分类变量的表示方式相同。假设我们的训练集和测试集位于两个不同的DataFrame中，如果workclass特征的 "Private Employee" 取值没有出现在测试集中，那么 pandas会认为这个特征只有3个可能的取值，因此只会创建3个新的虚拟特征。现在训练集和测试集的特征个数不相同，我们就无法将在训练集上学到的模型应用到测试集上。更糟糕的是，假设workclass特征在训练集中有"Government Employee"和"Private Employee"两个值，而在测试集中有"Self Employed"和"Self Employed Incorporated"两个值。在两种情况下，pandas都会创建两个新的虚拟特征，所以编码后的DataFrame的特征个数相同，但在训练集和测试集中的两个虚拟特征含义完全不同。训练集中表示"Government Employee"的那一列在测试集中对应的是"Self Employed"。如果我们在这个数据上构建机器学习模型，那么它的表现会很差，因为它认为每一列表示的是相同的内容，而实际上表示的却是非常不同的内容。要想解决这个问题，可以在同时包含训练数据点和测试数据点的DataFrame上调用get_dummies，也可以确保调用get_dummies后训练集和测试集的列名称相同，以保证它们具有相同的语义。

数字可以编码分类变量

在adult数据集的例子中，分类变量被编码为字符串。一方面，可能会有拼写错误；另一方面，它明确地将一个变量标记为分类变量。无论是为了便于存储还是因为数据的收集方式，分类变量通常被编码为整数。例如，假设adult数据集中的人口普查数据是利用问卷收集的，workclass的回答被记录为0（在第一个框打勾）、1（在第二个框打勾）、2（在第三个框打勾）等等。现在该列包含数字 0 到 8，而不是像 "Private" 这样的字符串。如果有人观察表示数据集的表格，很难一眼看出这个变量应该被视为连续变量还是分类变量。但是，如果知道这些数字表示的是就业状况，那么很明显它们是不同的状态，不应该用单个连续变量来建模。

pandas的get_dummies函数将所有数字看作是连续的，不会为其创建虚拟变量。为了解决这个问题，可以使用scikit-learn的OneHotEncoder，指定哪些变量是连续的、哪些变量是离散的，也可以将DataFrame中的数值列转换为字符串。用个代码示例来说明这一点，我们创建一个两列的DataFrame对象，其中一列包含字符串，另一列包含整数：

python 复制代码

import pandas as pd

# 创建一个DataFrame，包含一个整数特征和一个分类字符串特征
demo_df = pd.DataFrame({'Integer Feature': [0, 1, 2, 1], 
                        'Categorical Feature': ['socks', 'fox', 'socks', 'box']})
print(demo_df)

输出：

使用 get_dummies 只会编码字符串特征，不会改变整数特征，如下所示：

python 复制代码

import pandas as pd

# 创建一个DataFrame，包含一个整数特征和一个分类字符串特征
demo_df = pd.DataFrame({'Integer Feature': [0, 1, 2, 1], 
                        'Categorical Feature': ['socks', 'fox', 'socks', 'box']})

features = pd.get_dummies(demo_df)
print(features)

输出：

3	1	True	False	False

如果想为"Integer Feature"这一列创建虚拟变量，可以使用columns参数显式地给出想要编码的列。于是两个特征都会被当作分类特征处理，如下：

python 复制代码

import pandas as pd

# 创建一个DataFrame，包含一个整数特征和一个分类字符串特征
demo_df = pd.DataFrame({'Integer Feature': [0, 1, 2, 1], 
                        'Categorical Feature': ['socks', 'fox', 'socks', 'box']})

demo_df['Integer Feature'] = demo_df['Integer Feature'].astype(str)
features = pd.get_dummies(demo_df, columns=['Integer Feature', 'Categorical Feature'])
print(features)

输出：

3	False	True	False	True	False	False