企业级实战项目：基于 pycaret 自动化预测公司是否破产

本文系数据挖掘实战系列文章，我跟大家分享一个数据挖掘实战，与以往的数据实战不同的是，用自动机器学习方法完成模型构建与调优部分工作，深入理解由此带来的便利与效果。

1. Introduction

本文是一篇数据挖掘实战案例，详细探索了从台湾经济杂志收集的1999年到2009年的数据，看看在数据探索过程中，可以洞察出哪些有用的信息，判断哪一个模型能够最准确地预测公司是否破产。

公司破产的定义是根据台湾证券交易所的商业规则而定的。

该建模将尝试使用自动机器学习库pycaret来构建机器学习模型，pycaret是一个用python编写的开源低代码机器学习库，它将机器学习工作流程自动化。如果你想探索这个库并更好地理解它的功能。推荐查看

设置环境并读取数据

python 复制代码

import pandas as pd  
import numpy as np  
import math  
import matplotlib.pyplot as plt  
import seaborn as sns  
  
bankruptcy_df = pd.read_csv("Bankruptcy.csv")    

bankruptcy_df.head()

技术交流&源码获取

技术要学会交流、分享，不建议闭门造车。一个人可以走的很快、一堆人可以走的更远。

好的文章离不开粉丝的分享、推荐，资料干货、资料分享、数据、技术交流提升，均可加交流群获取，群友已超过2000人，添加时最好的备注方式为：来源+兴趣方向，方便找到志同道合的朋友。

本文数据&源码，技术交流、按照如下方式获取：

方式①、添加微信号：dkl88194，备注：资料

方式②、微信搜索公众号：Python学习与数据挖掘，后台回复：资料

资料1

资料2

我们打造了《100个超强算法模型》，特点：从0到1轻松学习，原理、代码、案例应有尽有，所有的算法模型都是按照这样的节奏进行表述，所以是一套完完整整的案例库。

很多初学者是有这么一个痛点，就是案例，案例的完整性直接影响同学的兴致。因此，我整理了 100个最常见的算法模型，在你的学习路上助推一把！

2. 理解数据

python 复制代码

bankruptcy_df.info()

python 复制代码

<class 'pandas.core.frame.DataFrame'>  
RangeIndex: 6819 entries, 0 to 6818  
Data columns (total 96 columns):  
 #   Column                                                    Non-Null Count  Dtype    
  
---  ------                                                    --------------  -----  
  
 0   Bankrupt?                                                 6819 non-null   int64    
 1    ROA(C) before interest and depreciation before interest  6819 non-null   float64  
 2    ROA(A) before interest and % after tax                   6819 non-null   float64  
 3    ROA(B) before interest and depreciation after tax        6819 non-null   float64  
 4    Operating Gross Margin                                   6819 non-null   float64  
 5    Realized Sales Gross Margin                              6819 non-null   float64  
 6    Operating Profit Rate                                    6819 non-null   float64  
 7    Pre-tax net Interest Rate                                6819 non-null   float64  
 8    After-tax net Interest Rate                              6819 non-null   float64  
 9    Non-industry income and expenditure/revenue              6819 non-null   float64  
 10   Continuous interest rate (after tax)                     6819 non-null   float64  
 11   Operating Expense Rate                                   6819 non-null   float64  
 12   Research and development expense rate                    6819 non-null   float64  
 13   Cash flow rate                                           6819 non-null   float64  
 14   Interest-bearing debt interest rate                      6819 non-null   float64  
 15   Tax rate (A)                                             6819 non-null   float64  
 16   Net Value Per Share (B)                                  6819 non-null   float64  
 17   Net Value Per Share (A)                                  6819 non-null   float64  
 18   Net Value Per Share (C)                                  6819 non-null   float64  
 19   Persistent EPS in the Last Four Seasons                  6819 non-null   float64  
 20   Cash Flow Per Share                                      6819 non-null   float64  
 21   Revenue Per Share (Yuan ¥)                               6819 non-null   float64  
 22   Operating Profit Per Share (Yuan ¥)                      6819 non-null   float64  
 23   Per Share Net profit before tax (Yuan ¥)                 6819 non-null   float64  
 24   Realized Sales Gross Profit Growth Rate                  6819 non-null   float64  
 25   Operating Profit Growth Rate                             6819 non-null   float64  
 26   After-tax Net Profit Growth Rate                         6819 non-null   float64  
 27   Regular Net Profit Growth Rate                           6819 non-null   float64  
 28   Continuous Net Profit Growth Rate                        6819 non-null   float64  
 29   Total Asset Growth Rate                                  6819 non-null   float64  
 30   Net Value Growth Rate                                    6819 non-null   float64  
 31   Total Asset Return Growth Rate Ratio                     6819 non-null   float64  
 32   Cash Reinvestment %                                      6819 non-null   float64  
 33   Current Ratio                                            6819 non-null   float64  
 34   Quick Ratio                                              6819 non-null   float64  
 35   Interest Expense Ratio                                   6819 non-null   float64  
 36   Total debt/Total net worth                               6819 non-null   float64  
 37   Debt ratio %                                             6819 non-null   float64  
 38   Net worth/Assets                                         6819 non-null   float64  
 39   Long-term fund suitability ratio (A)                     6819 non-null   float64  
 40   Borrowing dependency                                     6819 non-null   float64  
 41   Contingent liabilities/Net worth                         6819 non-null   float64  
 42   Operating profit/Paid-in capital                         6819 non-null   float64  
 43   Net profit before tax/Paid-in capital                    6819 non-null   float64  
 44   Inventory and accounts receivable/Net value              6819 non-null   float64  
 45   Total Asset Turnover                                     6819 non-null   float64  
 46   Accounts Receivable Turnover                             6819 non-null   float64  
 47   Average Collection Days                                  6819 non-null   float64  
 48   Inventory Turnover Rate (times)                          6819 non-null   float64  
 49   Fixed Assets Turnover Frequency                          6819 non-null   float64  
 50   Net Worth Turnover Rate (times)                          6819 non-null   float64  
 51   Revenue per person                                       6819 non-null   float64  
 52   Operating profit per person                              6819 non-null   float64  
 53   Allocation rate per person                               6819 non-null   float64  
 54   Working Capital to Total Assets                          6819 non-null   float64  
 55   Quick Assets/Total Assets                                6819 non-null   float64  
 56   Current Assets/Total Assets                              6819 non-null   float64  
 57   Cash/Total Assets                                        6819 non-null   float64  
 58   Quick Assets/Current Liability                           6819 non-null   float64  
 59   Cash/Current Liability                                   6819 non-null   float64  
 60   Current Liability to Assets                              6819 non-null   float64  
 61   Operating Funds to Liability                             6819 non-null   float64  
 62   Inventory/Working Capital                                6819 non-null   float64  
 63   Inventory/Current Liability                              6819 non-null   float64  
 64   Current Liabilities/Liability                            6819 non-null   float64  
 65   Working Capital/Equity                                   6819 non-null   float64  
 66   Current Liabilities/Equity                               6819 non-null   float64  
 67   Long-term Liability to Current Assets                    6819 non-null   float64  
 68   Retained Earnings to Total Assets                        6819 non-null   float64  
 69   Total income/Total expense                               6819 non-null   float64  
 70   Total expense/Assets                                     6819 non-null   float64  
 71   Current Asset Turnover Rate                              6819 non-null   float64  
 72   Quick Asset Turnover Rate                                6819 non-null   float64  
 73   Working capitcal Turnover Rate                           6819 non-null   float64  
 74   Cash Turnover Rate                                       6819 non-null   float64  
 75   Cash Flow to Sales                                       6819 non-null   float64  
 76   Fixed Assets to Assets                                   6819 non-null   float64  
 77   Current Liability to Liability                           6819 non-null   float64  
 78   Current Liability to Equity                              6819 non-null   float64  
 79   Equity to Long-term Liability                            6819 non-null   float64  
 80   Cash Flow to Total Assets                                6819 non-null   float64  
 81   Cash Flow to Liability                                   6819 non-null   float64  
 82   CFO to Assets                                            6819 non-null   float64  
 83   Cash Flow to Equity                                      6819 non-null   float64  
 84   Current Liability to Current Assets                      6819 non-null   float64  
 85   Liability-Assets Flag                                    6819 non-null   int64    
 86   Net Income to Total Assets                               6819 non-null   float64  
 87   Total assets to GNP price                                6819 non-null   float64  
 88   No-credit Interval                                       6819 non-null   float64  
 89   Gross Profit to Sales                                    6819 non-null   float64  
 90   Net Income to Stockholder's Equity                       6819 non-null   float64  
 91   Liability to Equity                                      6819 non-null   float64  
 92   Degree of Financial Leverage (DFL)                       6819 non-null   float64  
 93   Interest Coverage Ratio (Interest expense to EBIT)       6819 non-null   float64  
 94   Net Income Flag                                          6819 non-null   int64    
 95   Equity to Liability                                      6819 non-null   float64  
dtypes: float64(93), int64(3)  
memory usage: 5.0 MB

python 复制代码

bankruptcy_df.shape

python 复制代码

(6819, 96)

python 复制代码

bankruptcy_df.describe()

3. 数据探索与清洗

3.1 缺失值处理

python 复制代码

bankruptcy_df.columns[bankruptcy_df.isna().any()]

python 复制代码

Index([], dtype='object')

从结果看，改数据集非常完整，没有缺失值！

.any() 指的是有没有(缺失值)，而与之对应的.all()指的是是否都是(缺失值)

调整数据列名

python 复制代码

def clean_col_names(col_name):  
    col_name = (  
        col_name.strip()  
        .replace("?", "_")  
        .replace("(", "_")  
        .replace(")", "_")  
        .replace(" ", "_")  
        .replace("/", "_")  
        .replace("-", "_")  
        .replace("__", "_")  
        .replace("'", "")  
        .lower()  
    )  
    return col_name  
  
bank_columns = list(bankruptcy_df.columns)  
bank_columns = [clean_col_names(col_name) for col_name in bank_columns]  
bankruptcy_df.columns = bank_columns  
display(bankruptcy_df.columns)

python 复制代码

Index(['bankrupt_', 'roa_c_before_interest_and_depreciation_before_interest',  
       'roa_a_before_interest_and_%_after_tax',  
       'roa_b_before_interest_and_depreciation_after_tax',  
       'operating_gross_margin', 'realized_sales_gross_margin',  
       'operating_profit_rate', 'pre_tax_net_interest_rate',  
       'after_tax_net_interest_rate',  
       'non_industry_income_and_expenditure_revenue',  
       'continuous_interest_rate_after_tax_', 'operating_expense_rate',  
       'research_and_development_expense_rate', 'cash_flow_rate',  
       'interest_bearing_debt_interest_rate', 'tax_rate_a_',  
       'net_value_per_share_b_', 'net_value_per_share_a_',  
       'net_value_per_share_c_', 'persistent_eps_in_the_last_four_seasons',  
       'cash_flow_per_share', 'revenue_per_share_yuan_¥_',  
       'operating_profit_per_share_yuan_¥_',  
       'per_share_net_profit_before_tax_yuan_¥_',  
       'realized_sales_gross_profit_growth_rate',  
       'operating_profit_growth_rate', 'after_tax_net_profit_growth_rate',  
       'regular_net_profit_growth_rate', 'continuous_net_profit_growth_rate',  
       'total_asset_growth_rate', 'net_value_growth_rate',  
       'total_asset_return_growth_rate_ratio', 'cash_reinvestment_%',  
       'current_ratio', 'quick_ratio', 'interest_expense_ratio',  
       'total_debt_total_net_worth', 'debt_ratio_%', 'net_worth_assets',  
       'long_term_fund_suitability_ratio_a_', 'borrowing_dependency',  
       'contingent_liabilities_net_worth', 'operating_profit_paid_in_capital',  
       'net_profit_before_tax_paid_in_capital',  
       'inventory_and_accounts_receivable_net_value', 'total_asset_turnover',  
       'accounts_receivable_turnover', 'average_collection_days',  
       'inventory_turnover_rate_times_', 'fixed_assets_turnover_frequency',  
       'net_worth_turnover_rate_times_', 'revenue_per_person',  
       'operating_profit_per_person', 'allocation_rate_per_person',  
       'working_capital_to_total_assets', 'quick_assets_total_assets',  
       'current_assets_total_assets', 'cash_total_assets',  
       'quick_assets_current_liability', 'cash_current_liability',  
       'current_liability_to_assets', 'operating_funds_to_liability',  
       'inventory_working_capital', 'inventory_current_liability',  
       'current_liabilities_liability', 'working_capital_equity',  
       'current_liabilities_equity', 'long_term_liability_to_current_assets',  
       'retained_earnings_to_total_assets', 'total_income_total_expense',  
       'total_expense_assets', 'current_asset_turnover_rate',  
       'quick_asset_turnover_rate', 'working_capitcal_turnover_rate',  
       'cash_turnover_rate', 'cash_flow_to_sales', 'fixed_assets_to_assets',  
       'current_liability_to_liability', 'current_liability_to_equity',  
       'equity_to_long_term_liability', 'cash_flow_to_total_assets',  
       'cash_flow_to_liability', 'cfo_to_assets', 'cash_flow_to_equity',  
       'current_liability_to_current_assets', 'liability_assets_flag',  
       'net_income_to_total_assets', 'total_assets_to_gnp_price',  
       'no_credit_interval', 'gross_profit_to_sales',  
       'net_income_to_stockholders_equity', 'liability_to_equity',  
       'degree_of_financial_leverage_dfl_',  
       'interest_coverage_ratio_interest_expense_to_ebit_', 'net_income_flag',  
       'equity_to_liability'],  
      dtype='object')

统计并绘制目标变量

该步骤的目的是查看目标变量是否平衡，如果不平衡，则需要针对性处理。

python 复制代码

class_bar=sns.countplot(data=bankruptcy_df,x="bankrupt_")  
ax = plt.gca()  
for p in ax.patches:  
        ax.annotate('{:.1f}'.format(p.get_height()), (p.get_x()+0.3, p.get_height()+500))  
class_bar

3.2 特征分布

检查偏态

python 复制代码

# Return true/false if skewed  
import scipy.stats  
skew_df = pd.DataFrame(bankruptcy_df.select_dtypes(np.number).columns, columns = ['Feature'])  
  
skew_df['Skew'] = skew_df['Feature'].apply(lambda feature: scipy.stats.skew(bankruptcy_df[feature]))   
  
skew_df['Absolute Skew'] = skew_df['Skew'].apply(abs)    
# 得到与方向无关的倾斜幅度  
skew_df['Skewed']= skew_df['Absolute Skew'].apply(lambda x: True if x>= 0.5 else False)  
with pd.option_context("display.max_rows", 1000):  
    display(skew_df)

可视化分布

python 复制代码

cols = list(bankruptcy_df.columns)  
ncols = 8  
nrows = math.ceil(len(cols) / ncols)  
  
fig, ax = plt.subplots(nrows, ncols, figsize = (4.5 * ncols, 4 * nrows))  
for i in range(len(cols)):  
    sns.kdeplot(bankruptcy_df[cols[i]], ax = ax[i // ncols, i % ncols])  
    if i % ncols != 0:  
        ax[i // ncols, i % ncols].set_ylabel(" ")  
plt.tight_layout()  
plt.show()

查看有偏态的特征

python 复制代码

query_skew=skew_df.query("Skewed == True")["Feature"]  
with pd.option_context("display.max_rows", 1000):  
    display(query_skew)

python 复制代码

0                                             bankrupt_  
2                 roa_a_before_interest_and_%_after_tax  
3      roa_b_before_interest_and_depreciation_after_tax  
4                                operating_gross_margin  
5                           realized_sales_gross_margin  
6                                 operating_profit_rate  
7                             pre_tax_net_interest_rate  
8                           after_tax_net_interest_rate  
9           non_industry_income_and_expenditure_revenue  
10                  continuous_interest_rate_after_tax_  
11                               operating_expense_rate  
12                research_and_development_expense_rate  
13                                       cash_flow_rate  
14                  interest_bearing_debt_interest_rate  
15                                          tax_rate_a_  
16                               net_value_per_share_b_  
17                               net_value_per_share_a_  
18                               net_value_per_share_c_  
19              persistent_eps_in_the_last_four_seasons  
20                                  cash_flow_per_share  
21                            revenue_per_share_yuan_¥_  
22                   operating_profit_per_share_yuan_¥_  
23              per_share_net_profit_before_tax_yuan_¥_  
24              realized_sales_gross_profit_growth_rate  
25                         operating_profit_growth_rate  
26                     after_tax_net_profit_growth_rate  
27                       regular_net_profit_growth_rate  
28                    continuous_net_profit_growth_rate  
29                              total_asset_growth_rate  
30                                net_value_growth_rate  
31                 total_asset_return_growth_rate_ratio  
32                                  cash_reinvestment_%  
33                                        current_ratio  
34                                          quick_ratio  
35                               interest_expense_ratio  
36                           total_debt_total_net_worth  
37                                         debt_ratio_%  
38                                     net_worth_assets  
39                  long_term_fund_suitability_ratio_a_  
40                                 borrowing_dependency  
41                     contingent_liabilities_net_worth  
42                     operating_profit_paid_in_capital  
43                net_profit_before_tax_paid_in_capital  
44          inventory_and_accounts_receivable_net_value  
45                                 total_asset_turnover  
46                         accounts_receivable_turnover  
47                              average_collection_days  
48                       inventory_turnover_rate_times_  
49                      fixed_assets_turnover_frequency  
50                       net_worth_turnover_rate_times_  
51                                   revenue_per_person  
52                          operating_profit_per_person  
53                           allocation_rate_per_person  
57                                    cash_total_assets  
58                       quick_assets_current_liability  
59                               cash_current_liability  
60                          current_liability_to_assets  
61                         operating_funds_to_liability  
62                            inventory_working_capital  
63                          inventory_current_liability  
64                        current_liabilities_liability  
65                               working_capital_equity  
66                           current_liabilities_equity  
67                long_term_liability_to_current_assets  
68                    retained_earnings_to_total_assets  
69                           total_income_total_expense  
70                                 total_expense_assets  
71                          current_asset_turnover_rate  
72                            quick_asset_turnover_rate  
73                       working_capitcal_turnover_rate  
74                                   cash_turnover_rate  
75                                   cash_flow_to_sales  
76                               fixed_assets_to_assets  
77                       current_liability_to_liability  
78                          current_liability_to_equity  
79                        equity_to_long_term_liability  
81                               cash_flow_to_liability  
83                                  cash_flow_to_equity  
84                  current_liability_to_current_assets  
85                                liability_assets_flag  
86                           net_income_to_total_assets  
87                            total_assets_to_gnp_price  
88                                   no_credit_interval  
89                                gross_profit_to_sales  
90                    net_income_to_stockholders_equity  
91                                  liability_to_equity  
92                    degree_of_financial_leverage_dfl_  
93    interest_coverage_ratio_interest_expense_to_ebit_  
95                                  equity_to_liability  
Name: Feature, dtype: object

进行下采样，直至样本集中的破产与非破产比例为50/50。完成之后再次对数据进行偏态检查，决定是否需要做log转换，另外进行相关矩阵分析。

3.3 下采样

首先对数据集进行下采样，目标比例为bankrupt vs non bankrupt = 50 vs 50。

python 复制代码

bankruptcy_df2 = bankruptcy_df.sample(frac=1) #Shuffle Bankruptcy df  
  
bankruptcy_df_b = bankruptcy_df2.loc[bankruptcy_df2["bankrupt_"] == 1]  
bankruptcy_df_nb = bankruptcy_df2.loc[bankruptcy_df2["bankrupt_"] == 0][:220]  
  
bankruptcy_subdf_comb = pd.concat([bankruptcy_df_b,bankruptcy_df_nb])  
bankruptcy_subdf = bankruptcy_subdf_comb.sample(frac=1,random_state=42)  
  
bankruptcy_subdf

再次绘图查看正负样本数。

python 复制代码

sns.countplot(bankruptcy_subdf["bankrupt_"])

随机选择220家非破产公司和220家破产公司。

4. 特征工程

python 复制代码

bankruptcy_subdf2 = bankruptcy_subdf.drop(["net_income_flag"],axis=1)  
bankruptcy_subdf2.shape

python 复制代码

(440, 95)

4.1 相关矩阵

python 复制代码

fig = plt.figure(figsize=(30,20))  
ax1 = fig.add_subplot(1,1,1)  
sns.heatmap(bankruptcy_subdf2.corr(),ax=ax1,cmap="coolwarm")

4.1.1 找出与破产相关的最高特征

根据对破产企业的基本认识，破产企业资产少、负债高、盈利能力低、现金流少。可以朝这个方向分析我们的数据集。

python 复制代码

corr=bankruptcy_subdf2[bankruptcy_subdf2.columns[:-1]].corr()['bankrupt_'][:]  
  
corr_df = pd.DataFrame(corr)  
  
print("Correlations to Bankruptcy:")  
for index, row in corr_df["bankrupt_"].iteritems():  
    if row!=1.0 and row>=0.5:  
        print(f'Positive Correlation: {index}')  
    elif row!=1.0 and row<=-0.5:  
        print(f'Negative Correlation: {index}')

python 复制代码

Correlations to Bankruptcy:  
Negative Correlation: roa_c_before_interest_and_depreciation_before_interest  
Negative Correlation: roa_b_before_interest_and_depreciation_after_tax  
Negative Correlation: net_value_per_share_b_  
Negative Correlation: net_value_per_share_a_  
Negative Correlation: net_value_per_share_c_  
Negative Correlation: persistent_eps_in_the_last_four_seasons  
Negative Correlation: per_share_net_profit_before_tax_yuan_¥_  
Positive Correlation: debt_ratio_%  
Negative Correlation: net_worth_assets  
Negative Correlation: net_profit_before_tax_paid_in_capital  
Negative Correlation: total_income_total_expense

这些特征代表什么

roa_c_before_interest_and_depreciation_before_interest息前资产收益率和息前折旧：总资产收益率--如果总资产收益率低，破产风险高
roa_a_before_interest_and_after_tax息前和税后利润：总资产回报率--如果总资产回报率较低，破产风险较高
roa_b_before_interest_and_depreciation_after_tax利润不计利息及税后折旧：总资产回报率--如果总资产回报率较低，破产风险较高
debt_ratio负债率：负债占总资产的比例--价值越高，负债占资产的比例越高，导致破产风险越高
net_worth_assets净资产：净资产越少，破产风险越高
retained_earnings_to_total_assets留存收益与总资产之比：留存收益越少，破产风险越高
total_income_total_expense总费用：收入与费用之比较低，破产风险较高
net_income_to_total_assets净收入与总资产之比：净收入越低，破产风险越高

从结果看，导致公司违约风险越高的特征，似乎与背景知识一致。

4.2 下采样后特征分布可视化

python 复制代码

# Visualisation of distributions after sub-sampling  
cols = list(bankruptcy_subdf2.columns)  
ncols = 8  
nrows = math.ceil(len(cols) / ncols)  
  
fig, ax = plt.subplots(nrows, ncols, figsize = (4.5 * ncols, 4 * nrows))  
for i in range(len(cols)):  
    sns.kdeplot(bankruptcy_subdf2[cols[i]], ax = ax[i // ncols, i % ncols])  
    if i % ncols != 0:  
        ax[i // ncols, i % ncols].set_ylabel(" ")  
plt.tight_layout()  
plt.show()

4.3 所有特征的箱线图

python 复制代码

plt.figure(figsize=(30,20))  
boxplot=sns.boxplot(data=bankruptcy_subdf2,orient="h")  
boxplot.set(xscale="log")  
plt.show()

4.4 异常值处理

python 复制代码

quartile1 = bankruptcy_subdf2.quantile(q=0.25,axis=0)  
# display(quartile1)  
quartile3 = bankruptcy_subdf2.quantile(q=0.75,axis=0)  
# display(quartile3)  
IQR = quartile3 -quartile1  
lower_limit = quartile1-1.5*IQR  
upper_limit = quartile3+1.5*IQR  
  
lower_limit = lower_limit.drop(["bankrupt_"])  
upper_limit = upper_limit.drop(["bankrupt_"])  
# print(lower_limit)  
# print(" ")  
# print(upper_limit)  
  
bankruptcy_subdf2_out = bankruptcy_subdf2[((bankruptcy_subdf2<lower_limit) | (bankruptcy_subdf2>upper_limit)).any(axis=1)]  
display(bankruptcy_subdf2_out.shape)  
display(bankruptcy_subdf2.shape)

python 复制代码

(423, 95)  
  
(440, 95)

额外复制一份表，供后续分析处理。

python 复制代码

bankruptcy_subdf3 = bankruptcy_subdf2_out.copy()  
bankruptcy_subdf3

下采样后且去除离群值后的分布可视化。

python 复制代码

# Visualisation of distributions after sub-sampling after outlier removal  
cols = list(bankruptcy_subdf3.columns)  
ncols = 8  
nrows = math.ceil(len(cols) / ncols)  
  
fig, ax = plt.subplots(nrows, ncols, figsize = (4.5 * ncols, 4 * nrows))  
for i in range(len(cols)):  
    sns.kdeplot(bankruptcy_subdf3[cols[i]], ax = ax[i // ncols, i % ncols],fill=True,color="red")  
    sns.kdeplot(bankruptcy_subdf2[cols[i]], ax = ax[i // ncols, i % ncols],color="green")  
    if i % ncols != 0:  
        ax[i // ncols, i % ncols].set_ylabel(" ")  
plt.tight_layout()  
plt.show()

5 数据预处理

5.1 特征编码

所有类别在基础数据中都已编码完成，因此这里不需要再次编码列。在实际工作中，这一步大概率是必不可少的，编码技术也是尤其重要，需要好好掌握。如果你还不了解或不是很了解，推荐查看：

5.2 Log转换

这一步是为了去除数据中的偏态分布。

python 复制代码

# Log transform to remove skews  
target = bankruptcy_subdf3['bankrupt_']  
bankruptcy_subdf4 = bankruptcy_subdf3.drop(["bankrupt_"],axis=1)  
  
def log_trans(data):  
    for col in data:  
        skew = data[col].skew()  
        if skew>=0.5 or skew<=0.5:  
            data[col] = np.log1p(data[col])  
        else:  
            continue  
    return data  
  
bankruptcy_subdf4_log = log_trans(bankruptcy_subdf4)  
bankruptcy_subdf4_log.head()

5.2.1 Log转换数据的箱线图

python 复制代码

plt.figure(figsize=(30,20))  
boxplot=sns.boxplot(data=bankruptcy_subdf4_log,orient="h")  
boxplot.set(xscale="log")  
plt.show()

5.2.2 Log转换后的数据分布可视化

python 复制代码

# 在下采样后、去除离群值及log变换后的数据分布的可视化  
compare_subdf2 = bankruptcy_subdf2.drop(["bankrupt_"],axis=1)  
  
cols = list(bankruptcy_subdf4.columns)  
ncols = 8  
nrows = math.ceil(len(cols) / ncols)  
  
fig, ax = plt.subplots(nrows, ncols, figsize = (4.5 * ncols, 4 * nrows))  
for i in range(len(cols)):  
    sns.kdeplot(bankruptcy_subdf4_log[cols[i]], ax = ax[i // ncols, i % ncols],fill=True,color="red")  
    sns.kdeplot(bankruptcy_subdf2[cols[i]], ax = ax[i // ncols, i % ncols],color="green")  
    if i % ncols != 0:  
        ax[i // ncols, i % ncols].set_ylabel(" ")  
plt.tight_layout()  
plt.show()  
print("Red represents distributions after log transforms, green represents before log transform")

红色表示Log变换后的分布，绿色表示Log变换前的分布。（完整数据集：关注@公众号：数据STUDIO，联系云朵君获取）

6 使用Pycaret构建模型

本次模型构建使用的是自动机器学习框架pycaret，如果你还没有安装，可使用下述命令安装即可。

python 复制代码

pip install -U --ignore-installed --pre pycaret

在pycaret中自动完成训练及测试数据的切分工作。

python 复制代码

from pycaret.classification import *  
exp_name = setup(data = bankruptcy_subdf4,  target = bankruptcy_subdf3["bankrupt_"])

python 复制代码

compare_models()

Pycaret显示，3种模型的准确性最高的是

LightGBM分类器
梯度提升GBC分类器
XGBoost分类器

接下来将使用这5个模型进行超参数调优。

6.1 选定模型交叉验证

LightGBM

python 复制代码

print("LGBM Model")  
lgb_clf = create_model("lightgbm")  
lgb_clf_scoregrid = pull()

python 复制代码

LGBM Model

GBC

python 复制代码

print("GBC Model")  
gbc_clf = create_model("gbc")  
gbc_clf_scoregrid = pull()

复制代码

GBC Model

XGBoost

python 复制代码

print("XGB Model")  
xgb_clf = create_model("xgboost")  
xgb_clf_scoregrid = pull()

python 复制代码

XGB Model

7 使用Pycaret进行超参数调优

7.1 模型调优

LightGBM

python 复制代码

print("Before Tuning")  
print(lgb_clf_scoregrid.loc[["Mean","Std"]])  
print("")  
lgb_clf = tune_model(lgb_clf,choose_better=True)  
print(lgb_clf)

python 复制代码

Before Tuning  
      Accuracy     AUC  Recall   Prec.      F1   Kappa     MCC  
Fold                                                            
Mean    0.8433  0.9233  0.8562  0.8497  0.8495  0.6866  0.6929  
Std     0.0524  0.0429  0.0802  0.0681  0.0506  0.1046  0.1048

GBC

python 复制代码

print("Before Tuning")  
print(gbc_clf_scoregrid.loc[["Mean","Std"]])  
print("")  
gbc_clf = tune_model(gbc_clf,choose_better=True)  
print(gbc_clf)

python 复制代码

Before Tuning  
      Accuracy     AUC  Recall   Prec.      F1   Kappa     MCC  
Fold                                                         
Mean    0.8329  0.9242  0.8558  0.8324  0.8419  0.6649  0.6691  
Std     0.0599  0.0403  0.0634  0.0750  0.0557  0.1204  0.1198

XGBoost

python 复制代码

print("Before Tuning")  
print(xgb_clf_scoregrid.loc[["Mean","Std"]])  
print("")  
xgb_clf = tune_model(xgb_clf,choose_better = True)  
print(xgb_clf)

python 复制代码

Before Tuning  
      Accuracy     AUC  Recall   Prec.      F1   Kappa     MCC  
Fold                                                            
Mean    0.8400  0.9270  0.8562  0.8410  0.8460  0.6797  0.6852  
Std     0.0582  0.0382  0.0906  0.0586  0.0583  0.1161  0.1187

7.2 模型集成

Bagged & Boosting 方法
Blending
Stacking

LightGBM

python 复制代码

# Original  
print(lgb_clf_scoregrid.loc[['Mean', 'Std']])  
  
# Compare the original against bagged and boosted  
  
# Bagged  
lgb_clf = ensemble_model(lgb_clf,fold =5,choose_better = True)  
# Boosted  
lgb_clf = ensemble_model(lgb_clf,method="Boosting",choose_better = True)

python 复制代码

      Accuracy     AUC  Recall   Prec.      F1   Kappa     MCC  
Fold                                                            
Mean    0.8433  0.9233  0.8562  0.8497  0.8495  0.6866  0.6929  
Std     0.0524  0.0429  0.0802  0.0681  0.0506  0.1046  0.1048

GBC

python 复制代码

# Original  
print(gbc_clf_scoregrid.loc[['Mean', 'Std']])  
  
# Compare the original against bagged and boosted  
  
# Bagged  
gbc_clf = ensemble_model(gbc_clf,fold =5,choose_better = True)  
# Boosted  
gbc_clf = ensemble_model(gbc_clf,method="Boosting",choose_better = True)

复制代码

      Accuracy     AUC  Recall   Prec.      F1   Kappa     MCC  
Fold                                                            
Mean    0.8329  0.9242  0.8558  0.8324  0.8419  0.6649  0.6691  
Std     0.0599  0.0403  0.0634  0.0750  0.0557  0.1204  0.1198

XGBoost

python 复制代码

# Original  
print(xgb_clf_scoregrid.loc[['Mean', 'Std']])  
  
# Compare the original and boosted against bagged and boosted  
  
# Bagged  
xgb_clf = ensemble_model(xgb_clf,fold =5,choose_better = True)  
# Boosted  
xgb_clf = ensemble_model(xgb_clf,method="Boosting",choose_better = True)

python 复制代码

      Accuracy     AUC  Recall   Prec.      F1   Kappa     MCC  
Fold                                                            
Mean    0.8400  0.9270  0.8562  0.8410  0.8460  0.6797  0.6852  
Std     0.0582  0.0382  0.0906  0.0586  0.0583  0.1161  0.1187

7.3.1 Blend Models

python 复制代码

blend_models([lgb_clf, gbc_clf, xgb_clf],choose_better=True)

7.3.2 Stacking

python 复制代码

stacker = stack_models(lgb_clf,gbc_clf)  #remove xgb as some issues

python 复制代码

print(stacker)

8 模型评估

python 复制代码

# evaluate_model(lgb_clf)  
# evaluate_model(gbc_clf)  
# evaluate_model(xgb_clf)

8.1 ROC-AUC

python 复制代码

plot_model(stacker, plot = 'auc')     
# Stacked classifier from ensembling  
plot_model(lgb_clf, plot = 'auc')     
# lgb最适合Bagging集成并被选中  
plot_model(gbc_clf, plot = 'auc')     
# gbc最适合Boosting集成并被选中  
plot_model(xgb_clf, plot = 'auc')     
# 基本的xgb分类器在经过调优和集成后仍然表现最好，因此选择了它

8.2 混淆矩阵

python 复制代码

plot_model(stacker,   
           plot = 'confusion_matrix',   
           plot_kwargs = {'percent' : True})  
plot_model(lgb_clf,   
           plot = 'confusion_matrix',   
           plot_kwargs = {'percent' : True})  
plot_model(gbc_clf,   
           plot = 'confusion_matrix',   
           plot_kwargs = {'percent' : True})  
plot_model(xgb_clf,  
           plot = 'confusion_matrix',   
           plot_kwargs = {'percent' : True})

8.3 学习曲线

python 复制代码

plot_model(stacker, plot = 'learning')

python 复制代码

plot_model(lgb_clf, plot = 'learning')

就到这里了！