银行个人贷款违约风险预测(逻辑回归)

业务痛点:某城商行个人贷款年放款额超500亿元,人工审核依赖经验判断,存在三大问题:

  • 效率低:单笔审核耗时2-3天,旺季积压严重
  • 主观性强:不同审核员对"收入稳定性"等指标判断差异大,导致风险偏好不一致
  • 漏判率高:历史数据显示,人工审核后首年违约率仍达3.2%,年损失超5000万元

项目目标:构建逻辑回归违约风险预测模型,实现"申请信息→违约概率"的自动化评估,目标:

  • 模型性能:AUC-ROC≥0.85,KS值≥0.4(区分违约/正常客户)
  • 业务效率:审核时间从3天缩短至2小时,人工干预率降低50%
  • 风险控制:首年违约率降至2.8%以下,年减少损失≥3000万元

开发环境与工具链

  • 语言:Python 3.9
  • 数据处理:Pandas 1.5+、NumPy 1.23+、Imbalanced-learn(SMOTE)
  • 模型训练:Scikit-learn 1.2+(逻辑回归)、SHAP(特征重要性解释)
  • 实验跟踪:MLflow(记录参数/指标/模型)
  • 服务部署:Flask 2.3+、Gunicorn(WSGI服务器)、Docker 24.0+
  • 版本控制:Git + DVC(数据版本管理)
  • 监控:Prometheus(指标采集)+ Grafana(可视化)

数据准备与特征变化

(1)原数据结构(示例)

① 信贷申请表(loan_applications.csv)

② 央行征信数据(credit_records.csv)

③ 历史贷款数据库(historical_loans.csv,含标签)

④ 第三方数据(社保/公积金,social_security.csv)

(2)数据清洗与特征工程

缺失值:

  • income_monthly(月收入):自由职业者用"社保缴纳基数×行业均值"填充(如互联网行业均值1.2倍)
  • overdue_times_1y(1年内逾期次数):无征信记录者填0(视为信用白户)

异常值处理:

  • ebt_ratio(负债比)>1(资不抵债)视为无效,用同类职业中位数替换
  • age<22或>65(超出常规工作年龄):标记为高风险,单独分组

特征提取:

特征编码与标准化

  • 互联网、公务员、自由职业类别用LabelEncoder
  • 购房装修、子女教育、创业资金用One-Hot
  • 收入、社保年限用StandardScaler标准化
  • SMOTE过采用合成

处理后的数据(特征矩阵)

代码结构

text 复制代码
credit_risk_prediction/          # 项目根目录  
├── data/                          # 数据存储  
│   ├── raw/                       # 原始数据(DVC跟踪)  
│   │   ├── loan_applications.csv  
│   │   ├── credit_records.csv  
│   │   ├── historical_loans.csv  
│   │   └── social_security.csv  
│   ├── processed/                 # 处理后数据  
│   │   └── features_train.parquet  
│   └── external/                   # 外部数据(职业稳定性映射表)  
│       └── occupation_stability_map.json  
├── src/                           # 源代码  
│   ├── data_processing/           # 数据处理模块  
│   │   ├── __init__.py  
│   │   ├── clean_data.py          # 数据清洗  
│   │   └── feature_engineering.py # 特征工程  
│   ├── model/                      # 模型模块  
│   │   ├── __init__.py  
│   │   ├── train.py               # 模型训练(逻辑回归)  
│   │   ├── evaluate.py            # 模型评估(AUC/KS)  
│   │   ├── explain.py             # 特征重要性解释(SHAP)  
│   │   └── predict.py             # 预测推理  
│   ├── api/                        # API服务  
│   │   ├── app.py                  # Flask服务入口  
│   │   └── schemas.py              # 请求/响应格式定义(Pydantic)  
│   └── utils/                      # 工具函数  
│       ├── logger.py               # 日志配置  
│       ├── config.py               # 配置文件(路径/参数)  
│       └── metrics.py              # 自定义评估指标(KS值)  
├── tests/                         # 单元测试  
│   ├── test_feature_engineering.py  
│   └── test_model.py  
├── docker/                        # Docker部署文件  
│   ├── Dockerfile  
│   └── requirements.txt  
├── mlruns/                        # MLflow实验跟踪(Git忽略)  
├── README.md                      # 项目说明  
└── requirements.txt               # Python依赖

数据清洗与特征工程(src/data_processing/feature_engineering.py)

还款能力:业务意义,衡量借款人的还款压力,比率越高风险越大

  • loan_amount / (loan_term * 12):计算基础月还款额(本金)
    • 1.1:增加10%作为利息估算
  • / income_monthly:得到债务收入比

还款意愿特征:

  • debt_ratio 分箱:0-30%(0),30-50%(1),50-70%(2),70-100%(3)
  • max_overdue_days 分箱:无逾期(0),1-30天(1),31-90天(2),90天以上(3)

稳定性特征:

  • 将职业类型映射为稳定性分数(occ_map 是预设的映射字典)
  • 未映射的职业默认值为0(最不稳定)

数据合并:

text 复制代码
apps_df(申请信息)
    ↓ 合并 ss_df(可能为社保/工作信息)
        ↓ 合并 credit_df(信用信息)
            ↓ 合并 hist_df(历史表现,包含标签)
python 复制代码
import pandas as pd
import numpy as np
import json
from sklearn.preprocessing import StandardScaler,LabelEncoder,OneHotEncoder
from imblearn.voer_sampling import SMOTE # 处理类别不平衡
from src.utils.logger import get_logger

logger = get_logger(__name__)  

def load_occupation_stability_map(path:str)->dict:
	"""加载职业稳定性映射表(外部配置)"""
	with open(path,"r") as f:
		return json.load(f)

def feature_engineering(raw_data_dir:str,external_dir:str,output_path:str):
	"""  
    特征工程主函数:整合原始数据→清洗→特征提取→编码→标准化→过采样→保存  
    Args:  
        raw_data_dir: 原始数据目录(含4个CSV)  
        external_dir: 外部数据目录(职业稳定性映射表)  
        output_path: 处理后特征矩阵保存路径(parquet)  
    Returns:  
        processed_df: 处理后的特征矩阵(含标签)  
        scaler: 标准化器(用于预测时复用)  
        smote: SMOTE过采样对象(记录采样参数)  
    """  
    # 1.加载原始数据
    apps_df = pd.read_csv(f"{raw_data_dir}/loan_applications.csv")
    credit_df = pd.read_csv(f"{raw_data_dir}/credit_records.csv")
    hist_df = pd.read_csv(f"{raw_data_dir}/historical_loans.csv")
    ss_df = pd.read_csv(f"{raw_data_dir}/social_security.csv")
    occ_map = load_occupation_stability_map(f"{external_dir}/occupation_stability_map.json")# 职业稳定性映射
	
	# 2.数据清洗
	# 填充月收入缺失值(自由职业者用社保技术*1.2)
	apps_df["income_monthly"] = apps_df.apply(
		lambda x: x["income_monthly"] if not pd.isna(x["income_monthly"])   
        else ss_df[ss_df["applicant_id"] == x["applicant_id"]]["social_security_base"].iloc[0] * 1.2   
        if x["occupation"] == "自由职业" and not ss_df[ss_df["applicant_id"] == x["applicant_id"]].empty   
        else apps_df["income_monthly"].median(),  # 其他情况用中位数  
        axis=1
	)

	# 2.2 处理负债比异常值(>1视为无效,用同类职业中位数替换)  
    occ_median_debt = credit_df.groupby("applicant_id").apply(  # 按申请人关联职业后计算  
        lambda x: apps_df[apps_df["applicant_id"] == x.name]["occupation"].iloc[0]  
    ).reset_index().merge(apps_df[["applicant_id", "occupation"]], on="applicant_id").groupby("occupation")["debt_ratio"].median()  
    credit_df["debt_ratio"] = credit_df.apply(  
        lambda x: occ_median_debt[apps_df[apps_df["applicant_id"] == x["applicant_id"]]["occupation"].iloc[0]]   
        if x["debt_ratio"] > 1 else x["debt_ratio"], axis=1  
    )  

	# 3.特征提取(核心业务逻辑)
	# 3.1 还款能力特征
	apps_df["debt_service_ratio"] = (apps_df["loan_amount"] / (apps_df["loan_term"] * 12)) * 1.1 / apps_df["income_monthly"]  # 含利息月还款额/月收入 
	# 将连续变量debt_ratio(负债比)进行分箱处理
	credit_df["debt_ratio_bin"] =  pd.cut(credit_df["debt_ratio"], bins=[0, 0.3, 0.5, 0.7, 1], labels=[0, 1, 2, 3])  # 负债比分箱  

	# 3.2 还款意愿特征  
    credit_df["overdue_flag"] = (credit_df["overdue_times_1y"] > 0).astype(int)  
    credit_df["max_overdue_bin"] = pd.cut(credit_df["max_overdue_days"], bins=[-1, 0, 30, 90, np.inf], labels=[0, 1, 2, 3])  

    # 3.3 稳定性特征  
    apps_df["occupation_stability"] = apps_df["occupation"].map(occ_map).fillna(0)  # 映射职业稳定性(默认0)  
    merged_df = apps_df.merge(ss_df, on="applicant_id", how="left").merge(credit_df, on="applicant_id", how="left").merge(hist_df[["applicant_id", "default_label"]], on="applicant_id", how="left")  
	
	# 4. 特征编码与标准化  
    # 4.1 类别特征编码  
    # 有序类别:occupation_stability(LabelEncoder)
    le = LabelEncoder()
    merged_df["occupation_stability_enc"] = le.fit_transform(merged_df["occupation_stability"])
    # 无序类别:purpose(One-Hot编码)  
    ohe = OneHotEncoder(sparse_output=False,drop="first") # 避免多重共线性
    purpose_ohe = ohe.fit_transform(merged_df[["purpose"]])  
    purpose_cols = [f"purpose_{cat}" for cat in ohe.categories_[0][1:]]  # 去掉第一个类别  
    purpose_df = pd.DataFrame(purpose_ohe, columns=purpose_cols)  

	# 4.2 连续特征标准化
	continuous_features = ["income_monthly", "social_security_years"] 
	scaler = StandardScaler() 
	merged_df[continuous_features] = scaler.fit_transform(merged_df[continuous_features])
	merged_df.rename(columns={"income_monthly": "income_norm", "social_security_years": "social_security_years_norm"}, inplace=True) 

	# 5. 类别不平衡处理(SMOTE过采样)  
    features = [  
        "income_norm", "debt_service_ratio", "debt_ratio_bin", "overdue_flag",  
        "occupation_stability_enc", "social_security_years_norm", "historical_default_flag"  
    ] + purpose_cols  
    X = merged_df[features]  
    y = merged_df["default_label"]  # 标签:1=违约,0=正常  
    smote = SMOTE(random_state=42, sampling_strategy=1.0)  # 正负样本比1:1  
    X_resampled, y_resampled = smote.fit_resample(X, y)

	# 6. 构造最终特征矩阵  
    processed_df = pd.concat([X_resampled.reset_index(drop=True), y_resampled.reset_index(drop=True)], axis=1)  
    processed_df.to_parquet(output_path, index=False)  
    logger.info(f"特征工程完成,保存至{output_path},样本数:{len(processed_df)}(原始{y.sum()}正样本,过采样后{y_resampled.sum()}正样本)")  

    return processed_df, scaler, smote
	

模型训练与评估(src/model/train.py& evaluate.py

python 复制代码
import pandas as pd  
from sklearn.linear_model import LogisticRegression  
from sklearn.model_selection import train_test_split  
from src.utils.logger import get_logger  
import mlflow  
import joblib  

logger = get_logger(__name__)  

def train_logistic_regression(features_path: str, test_size: float = 0.2, random_state: int = 42, C: float = 1.0):
	"""  
    训练逻辑回归模型(带L2正则化)  
    Args:  
        features_path: 特征矩阵路径(parquet)  
        test_size: 测试集比例  
        random_state: 随机种子  
        C: 正则化强度倒数(C越小正则化越强)  
    Returns:  
        model: 训练好的逻辑回归模型  
        X_test, y_test: 测试集特征与标签  
        scaler: 标准化器(从特征工程返回,此处简化为重新加载)  
    """
	# 加载特征矩阵
	df = pd.read_parquet(features_path)
	X = df.drop(columns=["default_label"])
	y = df["default_label"]

	# 划分训练集/测试集  
	X_train, X_test, y_train, y_test = train_test_split(  
        X, y, test_size=test_size, random_state=random_state, stratify=y  # 分层抽样保持类别分布  
    )
	
	# 训练逻辑回归模型(L2正则化,Sigmoid激活)
	model = LogisticRegression(
		penalty="12",# L2正则化  
		C=C,# 正则化强度倒数
		solver="liblinear",  # 适用于小数据集  
        class_weight="balanced",  # 平衡类别权重(备选方案)  
        random_state=random_state  
	)
	model.fit(X_train,y_train)
	
	#记录实验(MLflow)  
	with mlflow.start_run():  
        mlflow.log_param("model", "LogisticRegression")  
        mlflow.log_param("penalty", "l2")  
        mlflow.log_param("C", C)  
        mlflow.log_metric("train_accuracy", model.score(X_train, y_train))  
        mlflow.log_metric("test_accuracy", model.score(X_test, y_test))  
        mlflow.sklearn.log_model(model, "logistic_regression_model")  
        logger.info(f"模型训练完成,参数C={C},测试集准确率={model.score(X_test, y_test):.4f}")  

	# 保存模型和标准化器  
    joblib.dump(model, "model/logistic_regression_model.pkl")  
    joblib.dump(scaler, "model/scaler.pkl")  # 假设scaler已保存  

    return model, X_test, y_test
python 复制代码
# src/model/evaluate.py  
import numpy as np  
import pandas as pd  
from sklearn.metrics import roc_auc_score, confusion_matrix, classification_report  
from src.utils.metrics import calculate_ks  # 自定义KS值计算  
from src.utils.logger import get_logger  

logger = get_logger(__name__)  

def evaluate_model(model, X_test, y_test):  
    """  
    评估模型性能(AUC-ROC/KS/混淆矩阵)  
    Args:  
        model: 训练好的模型  
        X_test: 测试集特征  
        y_test: 测试集标签  
    Returns:  
        metrics: 评估指标字典  

	验收标准:测试集AUC-ROC≥0.85,KS≥0.4,精确率≥0.75,召回率≥0.60
	
    """  
	y_pred_proba = model.predict_proba(X_test)[:,1] # 违约概率(正类概率)
	y_pred = model.predict(X_test) # 预测类别(阈值=0.5)

	# 核心指标
	auc = roc_auc_score(y_test,y_pred_proba)
	ks = calculate_ks(y_test,y_pred_proba)
	cm = confusion_matrix(y_test,y_pred)
	tn,fp,fn,tp = cm.ravel()
	precision = tp/(tp+fp)if(tp+fp)>0 else 0
	recall = tp/(tp+fn) if (tp+fn)>0 else 0

	metircs = {
		"AUC-ROC": auc,  
        "KS": ks,  
        "Precision": precision,  
        "Recall": recall,  
        "ConfusionMatrix": {"TN": tn, "FP": fp, "FN": fn, "TP": tp}
	}
	logger.info(f"模型评估结果:{metrics}")  
    return metrics  

# src/utils/metrics.py(KS值计算)  
def calculate_ks(y_true, y_pred_proba):  
    """计算KS值(区分违约/正常客户的能力)"""  
    df = pd.DataFrame({"y_true": y_true, "y_pred_proba": y_pred_proba}).sort_values("y_pred_proba", ascending=False)  
    df["cum_good"] = (1 - df["y_true"]).cumsum() / (1 - df["y_true"]).sum()  
    df["cum_bad"] = df["y_true"].cumsum() / df["y_true"].sum()  
    df["ks"] = df["cum_bad"] - df["cum_good"]  
    return df["ks"].max()

模型服务化(Flask API,src/api/app.py)

python 复制代码
from flask import Flask, request, jsonify  
import joblib  
import pandas as pd  
import numpy as np  
from src.data_processing.feature_engineering import feature_engineering  # 复用特征工程逻辑(简化版)  
from src.utils.logger import get_logger  
from pydantic import BaseModel, Field  # 请求参数校验  

logger = get_logger(__name__)  
app = Flask(__name__)  

# 加载模型与服务端资源(启动时加载)  
model = joblib.load("model/logistic_regression_model.pkl")  
scaler = joblib.load("model/scaler.pkl")
occupation_stability_map = {"公务员": 3, "事业单位": 3, "国企": 2, "民企": 1, "自由职业": 0}  

# 请求参数校验模型
class PredictionRequest(BaseModel):
	applicant_id: str = Field(..., description="申请人ID")  
    age: int = Field(..., ge=22, le=65, description="年龄(22-65岁)")  
    occupation: str = Field(..., description="职业")  
    income_monthly: float = Field(..., gt=0, description="月收入(元)")  
    loan_amount: float = Field(..., gt=0, description="贷款金额(元)")  
    loan_term: int = Field(..., ge=12, le=60, description="贷款期限(月,12-60)")  
    purpose: str = Field(..., description="贷款用途")  
    overdue_times_1y: int = Field(default=0, ge=0, description="1年内逾期次数")  
    debt_ratio: float = Field(default=0.0, ge=0, lt=1, description="负债比")  
    social_security_years: float = Field(default=0.0, ge=0, description="社保缴纳年限")  
    historical_default_flag: int = Field(default=0, ge=0, le=1, description="历史违约标志(0/1)") 

@app.route("/predict_risk", methods=["POST"])
def predict_rick():
	"""  
    违约风险预测API  
    请求格式:JSON(符合PredictionRequest模型)  
    响应格式:JSON(违约概率、风险等级、关键影响因素)  
    """  
	try:
		# 1.参数校验与解析
		req_data = request.get_json()
		req = PredictionRequest(**req_data)

		# 2. 构造单条样本(复用特征工程逻辑)  
        sample = {  
            "applicant_id": req.applicant_id,  
            "age": req.age,  
            "occupation": req.occupation,  
            "income_monthly": req.income_monthly,  
            "loan_amount": req.loan_amount,  
            "loan_term": req.loan_term,  
            "purpose": req.purpose,  
            "overdue_times_1y": req.overdue_times_1y,  
            "debt_ratio": req.debt_ratio,  
            "social_security_years": req.social_security_years,  
            "historical_default_flag": req.historical_default_flag  
        }  
        sample_df = pd.DataFrame([sample]) 

		# 3. 特征工程(与训练时一致,简化版)  
        # 3.1 还款能力特征  
        sample_df["debt_service_ratio"] = (sample_df["loan_amount"] / (sample_df["loan_term"] * 12)) * 1.1 / sample_df["income_monthly"]  
        # 3.2 稳定性特征  
        sample_df["occupation_stability"] = sample_df["occupation"].map(occupation_stability_map).fillna(0)  
        # 3.3 标准化连续特征  
        sample_df[["income_norm", "social_security_years_norm"]] = scaler.transform(  
            sample_df[["income_monthly", "social_security_years"]]  
        )  
        # 3.4 类别特征编码(简化,实际需复用训练时的ohe/le)  
        sample_df["purpose_dummy"] = 1 if sample_df["purpose"].iloc[0] == "创业资金" else 0  # 示例:仅保留一个虚拟变量  

        # 4. 特征选择与预测  
        features = ["income_norm", "debt_service_ratio", "occupation_stability", "social_security_years_norm", "purpose_dummy", "overdue_times_1y", "debt_ratio", "historical_default_flag"]  
        X_sample = sample_df[features].values  
        default_prob = model.predict_proba(X_sample)[0][1]  # 违约概率  

        # 5. 构造响应(含风险等级与解释)  
        risk_level = "高风险" if default_prob > 0.6 else "中风险" if default_prob > 0.3 else "低风险"  
        response = {  
            "applicant_id": req.applicant_id,  
            "default_probability": round(default_prob, 4),  # 违约概率(0-1)  
            "risk_level": risk_level,  
            "key_factors": [  # 基于逻辑回归系数的关键因素(示例)  
                {"factor": "负债比", "impact": "正向影响", "coefficient": 0.85},  
                {"factor": "社保缴纳年限", "impact": "负向影响", "coefficient": -0.62}  
            ],  
            "timestamp": pd.Timestamp.now().isoformat()  
        }  
        return jsonify(response), 200
	except Exception as e:
		logger.error(f"预测失败:{str(e)}", exc_info=True)  
        return jsonify({"error": str(e)}), 400 
		
if __name__ == "__main__":  
    app.run(host="0.0.0.0", port=5000, debug=False)  # 生产环境用Gunicorn部署

容器化部署

dockerfile 复制代码
FROM python:3.9-slim  
WORKDIR /app  
COPY requirements.txt .  
RUN pip install --no-cache-dir -r requirements.txt  
COPY src/ ./src/  
COPY model/ ./model/  
EXPOSE 5000  
CMD ["gunicorn", "--workers", "4", "--bind", "0.0.0.0:5000", "src.api.app:app"]  # 4个工作进程

Nginx配置(负载均衡)

nginx 复制代码
upstream credit_risk_api {  
    server 10.0.0.1:5000;  # 容器1  
    server 10.0.0.2:5000;  # 容器2  
}  
server {  
    listen 80;  
    location /predict_risk {  
        proxy_pass http://credit_risk_api;  
        proxy_set_header Host $host;  
    }  
}

示例调用API

bash 复制代码
curl -X POST http://credit-risk-api/predict_risk \
  -H "Content-Type: application/json" \
  -d '{
    "applicant_id": "APP005",
    "age": 32,
    "occupation": "互联网",
    "income_monthly": 18000,
    "loan_amount": 250000,
    "loan_term": 48,
    "purpose": "购房装修",
    "overdue_times_1y": 1,
    "debt_ratio": 0.55,
    "social_security_years": 6,
    "historical_default_flag": 0
  }'
相关推荐
Volunteer Technology1 小时前
Sentinel的限流算法
java·python·算法
偷吃的耗子2 小时前
【CNN算法理解】:基于训练好的MNIST CNN模型进行预测
人工智能·算法·cnn
葵花楹2 小时前
【算法题】【动态规划2】【背包动态规划】
算法·动态规划
数研小生2 小时前
1688商品列表API:高效触达批发电商海量商品数据的技术方案
大数据·python·算法·信息可视化·json
2301_765703142 小时前
C++中的代理模式变体
开发语言·c++·算法
Godspeed Zhao2 小时前
从零开始学AI7——机器学习0
人工智能·机器学习
酉鬼女又兒2 小时前
27. 移除元素
数据结构·算法·排序算法
TracyCoder1233 小时前
LeetCode Hot100(28/100)——104. 二叉树的最大深度
算法·leetcode
执着2593 小时前
力扣hot100 - 101、对称二叉树
数据结构·算法·leetcode