案例:某产品召回预测
案例背景
本案例中所使用的数据为某产品召回前调查,出于敏感信息保密原则,具体字段名称做了一定的替换。主要的字段有四个渠道的消费和时长以及和客服沟通的次数等
数据读取与划分
python
# 导入相应的包
import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import precision_score, recall_score, roc_auc_score, roc_curve, accuracy_score
# 读入数据
data = pd.read_csv('./case_random_forest.csv', encoding='gbk')
# 查看数据具体信息
data.head()
| | 业务1使用次数 | 渠道1时长 | 渠道1访问次数 | 渠道1消费 | 渠道2时长 | 渠道2访问次数 | 渠道2消费 | 渠道3时长 | 渠道3访问次数 | 渠道3消费 | 渠道4时长 | 渠道4访问次数 | 渠道4消费 | 与客服沟通次数 | isrun |
| 0 | 25 | 265.1 | 110 | 45.07 | 197.4 | 99 | 16.78 | 244.7 | 91 | 11.01 | 10.0 | 3 | 2.70 | 1 | False |
| 1 | 26 | 161.6 | 123 | 27.47 | 195.5 | 103 | 16.62 | 254.4 | 103 | 11.45 | 13.7 | 3 | 3.70 | 1 | False |
| 2 | 0 | 243.4 | 114 | 41.38 | 121.2 | 110 | 10.30 | 162.6 | 104 | 7.32 | 12.2 | 5 | 3.29 | 0 | False |
| 3 | 0 | 299.4 | 71 | 50.90 | 61.9 | 88 | 5.26 | 196.9 | 89 | 8.86 | 6.6 | 7 | 1.78 | 2 | False |
| 4 | 0 | 166.7 | 113 | 28.34 | 148.3 | 122 | 12.61 | 186.9 | 121 | 8.41 | 10.1 | 3 | 2.73 | 3 | False |
|---|
python
# 划分数据集,按照训练集测试集7:3进行划分
labels = np.array(data.pop("isrun"))
train, test, train_labels, test_labels = train_test_split(data, labels,
stratify = labels,
test_size = 0.3,
random_state = 114)
train = train.fillna(train.mean())
test = test.fillna(test.mean()) # Features for feature importances
features = list(train.columns)
模型搭建与训练
python
# 建立模型
trees = RandomForestClassifier(n_estimators=20, random_state=114, criterion='gini', max_features = 'sqrt')
trees.fit(train, train_labels)
RandomForestClassifier(max_features='sqrt', n_estimators=20, random_state=114)
python
# 查看建立的随机森林的具体信息,主要查看每棵树的节点个数以及平均深度
n_nodes = []
max_depths = []
for ind_tree in trees.estimators_:
n_nodes.append(ind_tree.tree_.node_count)
max_depths.append(ind_tree.tree_.max_depth)
print(f'Average number of nodes {int(np.mean(n_nodes))}')
print(f'Average maximum depth {int(np.mean(max_depths))}')
Average number of nodes 379
Average maximum depth 19
python
# 查看评价标准
probs = trees.predict_proba(test)[:, 1]
predictions = trees.predict(test)
print(f'Test ACC Score: {accuracy_score(predictions, test_labels)}')
print(f'Test ROC AUC Score: {roc_auc_score(test_labels, probs)}')
Test ACC Score: 0.906
Test ROC AUC Score: 0.821895543456342
python
# 找出影响最大的变量
fi_model = pd.DataFrame({'feature': features,
'importance': trees.feature_importances_}).\
sort_values('importance', ascending = False)
fi_model.head(10)
| | feature | importance |
| 1 | 渠道1时长 | 0.149806 |
| 3 | 渠道1消费 | 0.148304 |
| 13 | 与客服沟通次数 | 0.132728 |
| 4 | 渠道2时长 | 0.078845 |
| 6 | 渠道2消费 | 0.078460 |
| 7 | 渠道3时长 | 0.056919 |
| 12 | 渠道4消费 | 0.051537 |
| 2 | 渠道1访问次数 | 0.048397 |
| 8 | 渠道3访问次数 | 0.047986 |
| 10 | 渠道4时长 | 0.045726 |
|---|