day10 python机器学习全流程实践

在机器学习的实践中,数据预处理与模型构建是极为关键的环节。本文将回顾数据预处理的全流程,并基于处理后的数据完成简单的机器学习建模与评估,暂不涉及复杂的调参过程。

一、预处理流程回顾

机器学习的成功,很大程度上依赖于高质量的数据。以下是数据预处理的标准流程:

  1. 导入库:引入必要的 Python 库,用于数据处理、分析、可视化以及建模。
  2. 读取数据与理解 :读取数据集,通过info()head()方法初步了解数据的基本信息与结构。
  3. 缺失值处理:识别并处理数据中的缺失值。
  4. 异常值处理:检测并处理异常数据点。
  5. 离散值处理:将离散型数据转换为适合模型处理的格式。
  6. 特征工程:包括特征缩放、衍生新特征以及特征选择等操作。
  7. 划分数据集:将数据划分为训练集和测试集,用于模型训练与评估。

1.1 导入所需的包

复制代码
import pandas as pd  # 用于数据处理和分析,可处理表格数据
import numpy as np   # 用于数值计算,提供高效的数组操作
import matplotlib.pyplot as plt  # 用于绘制各种类型的图表
import seaborn as sns  # 基于matplotlib的高级绘图库,能绘制更美观的统计图形

# 设置中文字体(解决中文显示问题)
plt.rcParams['font.sans-serif'] = ['SimHei']  # Windows系统常用黑体字体
plt.rcParams['axes.unicode_minus'] = False    # 正常显示负号

1.2 查看数据信息

复制代码
data = pd.read_csv('data.csv')    # 读取数据
print("数据基本信息:")
data.info()
print("\n数据前5行预览:")
print(data.head())

数据基本信息

复制代码
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 18 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Id                            7500 non-null   int64  
 1   Home Ownership                7500 non-null   object 
 2   Annual Income                 5943 non-null   float64
 3   Years in current job          7129 non-null   object 
 4   Tax Liens                     7500 non-null   float64
 5   Number of Open Accounts       7500 non-null   float64
 6   Years of Credit History       7500 non-null   float64
 7   Maximum Open Credit           7500 non-null   float64
 8   Number of Credit Problems     7500 non-null   float64
 9   Months since last delinquent  3419 non-null   float64
 10  Bankruptcies                  7486 non-null   float64
 11  Purpose                       7500 non-null   object 
 12  Term                          7500 non-null   object 
 13  Current Loan Amount           7500 non-null   float64
 14  Current Credit Balance        7500 non-null   float64
 15  Monthly Debt                  7500 non-null   float64
 16  Credit Score                  5943 non-null   float64
 17  Credit Default                7500 non-null   int64  
dtypes: float64(12), int64(2), object(4)
memory usage: 1.0+ MB

数据前 5 行预览

复制代码
   Id Home Ownership  Annual Income Years in current job  Tax Liens  \
0   0       Own Home       482087.0                  NaN        0.0   
1   1       Own Home      1025487.0            10+ years        0.0   
2   2  Home Mortgage       751412.0              8 years        0.0   
3   3       Own Home       805068.0              6 years        0.0   
4   4           Rent       776264.0              8 years        0.0   

   Number of Open Accounts  Years of Credit History  Maximum Open Credit  \
0                     11.0                     26.3             685960.0   
1                     15.0                     15.3            1181730.0   
2                     11.0                     35.0            1182434.0   
3                      8.0                     22.5             147400.0   
4                     13.0                     13.6             385836.0   

   Number of Credit Problems  Months since last delinquent  Bankruptcies  \
0                        1.0                           NaN           1.0   
1                        0.0                           NaN           0.0   
2                        0.0                           NaN           0.0   
3                        1.0                           NaN           1.0   
4                        1.0                           NaN           0.0   

              Purpose        Term  Current Loan Amount  \
0  debt consolidation  Short Term           99999999.0   
1  debt consolidation   Long Term             264968.0   
2  debt consolidation  Short Term           99999999.0   
3  debt consolidation  Short Term             121396.0   
4  debt consolidation  Short Term             125840.0   

   Current Credit Balance  Monthly Debt  Credit Score  Credit Default  
0                 47386.0        7914.0         749.0               0  
1                394972.0       18373.0         737.0               1  
2                308389.0       13651.0         742.0               0  
3                 95855.0       11338.0         694.0               0  
4                 93309.0        7180.0         719.0               0  

1.3 缺失值处理

  • Annual Income:存在 1557 个缺失值,可根据 "Home Ownership" 等相关特征的平均收入进行填充。
  • Years in current job:存在 371 个缺失值,需先将字符串类型转换为数值类型,再用众数或中位数填充。
  • Months since last delinquent:缺失值较多(4081 个),可根据其对目标变量的影响程度,选择多重填补法或直接删除缺失行。
  • Credit Score:存在 1557 个缺失值,处理方式与 "Annual Income" 类似。

1.4 数据类型转换

  • Years in current job:将字符串类型转换为数值类型。
  • Home Ownership、Purpose、Term:根据特征性质,选择独热编码或标签编码。

1.5 异常值处理

对于数值型特征,如 "Annual Income" 和 "Current Loan Amount",可通过箱线图检测异常值,并根据实际情况决定是否处理。

1.6 特征缩放

对数值型特征进行 Min-Max 标准化或 Z-score 标准化,统一特征的取值范围。

1.7 特征工程

  • 衍生新特征:例如计算 "负债收入比"(Debt-to-Income Ratio)。
  • 特征选择:通过相关性分析等方法,筛选与目标变量相关性高的特征。

二、数据预处理实操

2.1 处理 object 类型变量

复制代码
# 筛选字符串变量 
discrete_features = data.select_dtypes(include=['object']).columns.tolist()
print(discrete_features)

# 查看每个字符串变量的唯一值
for feature in discrete_features:
    print(f"\n{feature}的唯一值:")
    print(data[feature].value_counts())

处理结果

  • Home Ownership:进行标签编码

    mapping = {
    'Own Home': 1,
    'Rent': 2,
    'Have Mortgage': 3,
    'Home Mortgage': 4
    }

    data['Home Ownership']=data['Home Ownership'].map(mapping)
    data.head()

  • Years in current job:进行标签编码

    years_in_job_mapping = {
    '< 1 year': 1,
    '1 year': 2,
    '2 years': 3,
    '3 years': 4,
    '4 years': 5,
    '5 years': 6,
    '6 years': 7,
    '7 years': 8,
    '8 years': 9,
    '9 years': 10,
    '10+ years': 11
    }
    data['Years in current job'] = data['Years in current job'].map(years_in_job_mapping)

  • Purpose:进行独热编码

    data = pd.get_dummies(data, columns=['Purpose'])

    将独热编码后的bool类型转换为数值

    for col in data.columns:
    if 'Purpose' in col:
    data[col] = data[col].astype(int)

  • Term:进行 0-1 映射

    term_mapping = {
    'Short Term': 0,
    'Long Term': 1
    }
    data['Term'] = data['Term'].map(term_mapping)
    data.rename(columns={'Term': 'Long Term'}, inplace=True)

2.2 处理数值型变量

复制代码
# 筛选数值型特征
continuous_features = data.select_dtypes(include=['int64', 'float64']).columns.tolist()

# 用中位数填补缺失值
for feature in continuous_features:
    median_value = data[feature].median()
    data[feature].fillna(median_value, inplace=True)

处理后的数据信息:

复制代码
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 7500 entries, 0 to 7499
Data columns (total 32 columns):
 #   Column                        Non-Null Count  Dtype  
---  ------                        --------------  -----  
 0   Id                            7500 non-null   int64  
 1   Home Ownership                7500 non-null   int64  
 2   Annual Income                 7500 non-null   float64
 3   Years in current job          7500 non-null   float64
 4   Tax Liens                     7500 non-null   float64
 5   Number of Open Accounts       7500 non-null   float64
 6   Years of Credit History       7500 non-null   float64
 7   Maximum Open Credit           7500 non-null   float64
 8   Number of Credit Problems     7500 non-null   float64
 9   Months since last delinquent  7500 non-null   float64
 10  Bankruptcies                  7500 non-null   float64
 11  Long Term                     7500 non-null   int64  
 12  Current Loan Amount           7500 non-null   float64
 13  Current Credit Balance        7500 non-null   float64
 14  Monthly Debt                  7500 non-null   float64
 15  Credit Score                  7500 non-null   float64
 16  Credit Default                7500 non-null   int64  
 17  Purpose_business loan         7500 non-null   int32  
 18  Purpose_buy a car             7500 non-null   int32  
 19  Purpose_buy house             7500 non-null   int32  
 20  Purpose_debt consolidation    7500 non-null   int32  
 21  Purpose_educational expenses  7500 non-null   int32  
 22  Purpose_home improvements     7500 non-null   int32  
 23  Purpose_major purchase        7500 non-null   int32  
 24  Purpose_medical bills         7500 non-null   int32  
 25  Purpose_moving                7500 non-null   int32  
 26  Purpose_other                 7500 non-null   int32  
 27  Purpose_renewable energy      7500 non-null   int32  
 28  Purpose_small business        7500 non-null   int32  
 29  Purpose_take a trip           7500 non-null   int32  
 30  Purpose_vacation              7500 non-null   int32  
 31  Purpose_wedding               7500 non-null   int32  
dtypes: float64(13), int32(15), int64(4)
memory usage: 1.4 MB

三、机器学习模型建模与评估

3.1 数据划分

复制代码
from sklearn.model_selection import train_test_split
X = data.drop(['Credit Default'], axis=1)  # 特征
y = data['Credit Default']  # 标签
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print(f"训练集形状: {X_train.shape}, 测试集形状: {X_test.shape}")

结果

复制代码
训练集形状: (6000, 31), 测试集形状: (1500, 31)

3.2 模型训练与评估

使用多种常见的分类模型进行训练与评估,包括 SVM、KNN、逻辑回归、朴素贝叶斯、决策树、随机森林、XGBoost 和 LightGBM。

复制代码
from sklearn.svm import SVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import GaussianNB
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
import xgboost as xgb
import lightgbm as lgb
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix
import warnings
warnings.filterwarnings("ignore")

# SVM模型
svm_model = SVC(random_state=42)
svm_model.fit(X_train, y_train)
svm_pred = svm_model.predict(X_test)
print("\nSVM 分类报告:")
print(classification_report(y_test, svm_pred))
print("SVM 混淆矩阵:")
print(confusion_matrix(y_test, svm_pred))
print("SVM 模型评估指标:")
print(f"准确率: {accuracy_score(y_test, svm_pred):.4f}")
print(f"精确率: {precision_score(y_test, svm_pred):.4f}")
print(f"召回率: {recall_score(y_test, svm_pred):.4f}")
print(f"F1 值: {f1_score(y_test, svm_pred):.4f}")

# KNN模型
knn_model = KNeighborsClassifier()
knn_model.fit(X_train, y_train)
knn_pred = knn_model.predict(X_test)
print("\nKNN 分类报告:")
print(classification_report(y_test, knn_pred))
print("KNN 混淆矩阵:")
print(confusion_matrix(y_test, knn_pred))
print("KNN 模型评估指标:")
print(f"准确率: {accuracy_score(y_test, knn_pred):.4f}")
print(f"精确率: {precision_score(y_test, knn_pred):.4f}")
print(f"召回率: {recall_score(y_test, knn_pred):.4f}")
print(f"F1 值: {f1_score(y_test, knn_pred):.4f}")

# 逻辑回归模型
logreg_model = LogisticRegression(random_state=42)
logreg_model.fit(X_train, y_train)
logreg_pred = logreg_model.predict(X_test)
print("\n逻辑回归 分类报告:")
print(classification_report(y_test, logreg_pred))
print("逻辑回归 混淆矩阵:")
print(confusion_matrix(y_test, logreg_pred))
print("逻辑回归 模型评估指标:")
print(f"准确率: {accuracy_score(y_test, logreg_pred):.4f}")
print(f"精确率: {precision_score(y_test, logreg

@浙大疏锦行

相关推荐
像风一样_29 分钟前
机器学习-入门-决策树(1)
人工智能·决策树·机器学习
飞火流星0202730 分钟前
Weka通过10天的内存指标数据计算内存指标动态阈值
人工智能·机器学习·数据挖掘·weka·计算指标动态阈值·使用统计方法计算动态阈值
Nuyoah.35 分钟前
《Vue3学习手记7》
javascript·vue.js·学习
xiaoniu66739 分钟前
毕业设计-基于预训练语言模型与深度神经网络的Web入侵检测系统
人工智能·语言模型·dnn
豆芽8191 小时前
感受野(Receptive Field)
人工智能·python·深度学习·yolo·计算机视觉
冰茶_1 小时前
WPF之Button控件详解
大数据·学习·microsoft·c#·wpf
赛卡1 小时前
IPOF方法学应用案例:动态电压频率调整(DVFS)在AIoT芯片中的应用
开发语言·人工智能·python·硬件工程·软件工程·系统工程·ipof
蒙双眼看世界1 小时前
AI应用实战:Excel表的操作工具
人工智能
MrZWCui1 小时前
iOS—仿tableView自定义闹钟列表
学习·macos·ios·objective-c
决战软件之巅1 小时前
Python3 基础语法
python