电信运营商用户分群与精准运营(K-Means聚类)

业务痛点:某省级电信运营商用户规模超2000万,存在三大问题:

  • 运营粗放:统一推送"流量包升级"活动,高价值用户(ARPU>200元)转化率仅3%,低价值用户(ARPU<50元)投诉率8%
  • 用户流失:年流失用户300万,其中高价值用户占比20%,但缺乏精准识别手段(仅靠"连续3月ARPU下降"单一指标,漏判率40%)
  • 资源浪费:营销成本年超2亿元,80%预算投入低效群体(如无合约低活跃用户)

算法团队:数据清洗、特征工程(衍生/标准化)、K-Means模型训练(K-Means++)、特征存储(Feast)、模型注册(MLflow);

  • 语言:Python 3.9(特征工程/模型训练)、Scala 2.13(Spark数据预处理)
  • 数据处理:Pandas 2.0(小数据)、PySpark 3.4(大数据)、NumPy 1.24
  • 特征工程:Scikit-learn 1.2(标准化/PCA)、Feature-engine 1.6(特征衍生)
  • 聚类模型:Scikit-learn KMeans、Yellowbrick 1.5(轮廓系数可视化)
  • 特征存储:Feast 0.34(实体:用户ID,特征:ARPU/流量使用率/合约状态等)
  • 实验跟踪:MLflow 2.8(记录K值/轮廓系数/模型)、Weights & Biases(可视化簇分布)
  • 版本控制:git@github.com:telecom/algorithm-user-clustering.git

业务团队:API网关、用户分群服务(调用模型+特征服务)、运营平台集成、分群效果监控;

  • 语言:Go 1.20(高性能API)、Java 17(运营平台集成)
  • 服务框架:FastAPI 0.104(Python轻量级API)、gRPC(跨语言特征服务调用)
  • 服务治理:Kong 3.4(API网关)、Consul 1.16(服务发现)
  • 监控:Prometheus 2.47(指标采集)、Grafana 10.2(分群效果面板)
  • 版本控制:git@github.com:telecom/business-user-clustering.git

基础设施团队:K8s集群、MinIO存储、CI/CD工具链

算法团队

text 复制代码
algorithm-user-clustering/  
├── data_processing/                # 数据清洗(Python/Pandas)  
│   ├── data_cleaning.py            # 清洗CRM/计费/行为数据(含详细代码)  
│   └── requirements.txt             # 依赖:pandas, pyarrow, pyspark  
├── feature_engineering/            # 特征工程(分模块)  
│   ├── feature_derivation.py       # 特征衍生(含代码)  
│   ├── feature_scaler.py            # 特征标准化(含代码)  
│   ├── generate_feature_data.py     # 特征数据生成(含feature_path/label_path说明)  
│   └── requirements.txt             # 依赖:scikit-learn, feature-engine  
├── model_training/                 # 模型训练(K-Means)  
│   ├── train_kmeans.py              # K-Means++训练(选K值/轮廓系数评估)  
│   ├── evaluate_clustering.py       # 评估(轮廓系数/PSI/簇分布)  
│   └── kmeans_params.yaml           # 调参记录(K=5, init='k-means++')  
├── feature_store/                  # Feast特征存储(明确内容)  
│   ├── feature_repo/                # Feast特征仓库  
│   │   ├── features.py              # 定义实体(user_id)、特征视图、在线/离线特征  
│   │   └── feature_store.yaml       # Feast配置(在线Redis/离线Parquet)  
│   └── deploy_feast.sh              # 部署Feast到K8s脚本  
├── mlflow_tracking/                # MLflow实验跟踪  
│   ├── run_kmeans_experiment.py     # 记录K值/轮廓系数/模型  
│   └── runs/                        # 实验记录存档  
└── README.md                       # 特征字典/模型输入输出说明

原数据→清洗→衍生→标准化→特征矩阵

数据准备与特征变化

(1)原数据结构(结构化数据,来自多系统)

① CRM系统(user_profile,Hive表)

② 计费系统(billing_detail,Kafka Topic)

json 复制代码
// 月消费明细(流量/通话/短信)
{ "user_id": "U1001", "month": "2023-10", "data_usage_gb": 28.5, "voice_minutes": 420, "sms_count": 15, "total_fee": 199.0 }

③ 用户行为日志(user_behavior,Flume采集)

(2)数据清洗(详细代码,算法团队负责)

处理缺失值、异常值、去重,输出干净的结构化数据

代码文件:data_processing/src/main/python/data_cleaning.py

python 复制代码
import pandas as pd
import numpy as np
from pyspark.sql import SparkSession
from pyspark.sql.functions import col,when,count,isnan,isnull
import logging

logging.basicConfig(level=logging.INFO)  
logger = logging.getLogger(__name__)

def clean_crm_data(crm_path:str)->pd.DataFrame:
	"""清洗CRM系统基础数据(处理缺失值、异常值、去重)"""
	# 读取原始数据(Parquet格式,数据湖存储)
	df = pd.read_parquet(crm_path)
	logger.info(f"原始CRM数据量:{len(df)},缺失值统计:\n{df.isnull().sum()}")
	
	# 1.缺失值处理
	# arpu缺失:用同年龄段均值填充(按age分组计算均值)
	arg_group_mean = df.groupby("age")["arpu"].transform(lambda x: x.mean())
	df["arpu"] = df["arpu"].fillna(age_group_mean)
	# 极端缺失(如age也为空):用全局均值填充
	global_arpu_mean = df["arpu"].mean()
	df["arpu"] = df["arpu"].fillna(global_arpu_mean)

	# 2.异常值处理(arpu用IQR法,contract_type用枚举值校验)
	# arpu异常值:Q1-1.5IQR以下或Q3+1.5IQR以上
	Q1,Q3 = df["arpu"].quantile([0.25,0.75]) # 计算25%和75%分位数
	IQR = Q3 - Q1 # 计算四分位距
	lower_bound,upper_bound = Q1 - 1.5*IQR,Q3 + 1.5*IQR # 计算异常值边界
	df["arpu"] = df["arpu"].clip(lower_bound,upper_bound) # 截断异常值
	# contract_type异常值,仅保留0/1/2,其他设为0(无合约)
	df["contract_type"] = df[contract_type].apply(lambda x:x if x in [0,1,2] else 0)

	# 3.去重(按user_id去重,保留最新join_date记录)
	df = df.sort_values("join_date").drop_duplicates(subset="user_id", keep="last")
	logger.info(f"清洗后CRM数据量:{len(df)},缺失值已处理")  
    return df

def clean_billing_data(billing_path:str)->pd.DataFrame:
	"""清洗计费系统消费数据(聚合月消费、处理异常值)"""
	df = pd.read_parquet(billing_path)

	# 聚合月消费(同一用户同月多条记录求和)
	df_agg = df.groupby(["user_id", "month"]).agg({
		"data_usage_gb": "sum", "voice_minutes": "sum",  
        "sms_count": "sum", "total_fee": "sum"
	}).reset_index()
	# 异常值:data_usage_gb>1000GB(可能为物联网卡,排除)
	df_agg = df_agg[df_agg["data_usage_gb"] <= 1000]
	logger.info(f"清洗后计费数据量:{len(df_agg)}(按月聚合后)")
	return df_agg 

def clean_behavior_data(behavior_path: str)->pd.DataFrame:
	"""清洗用户行为日志(处理极端值)"""
	df = pd.read_parquet(behavior_path)
	#异常值:app_active_days>31(不可能)→设为31;customer_service_cnt>20(高频投诉)→标记 
	df["app_active_days"] = df["app_active_days"].clip(0,31)
	df["high_cs_cnt"] = df["customer_service_cnt"].apply(lambda x: 1 if x > 20 else 0) 
	logger.info(f"清洗后行为数据量:{len(df)},异常值已处理")  
    return df

if __name__ == "__main__":
	# 路径配置(数据湖存储位置)
	crm_path = "s3://telecom-data-lake/raw/crm/user_profile.parquet"  
    billing_path = "s3://telecom-data-lake/raw/billing/detail.parquet"  
    behavior_path = "s3://telecom-data-lake/raw/behavior/log.parquet"

	# 执行清洗
	cleaned_crm = clean_crm_data(crm_path)
	cleaned_billing = clean_billing_data(billing_path)
	cleaned_behavior = clean_behavior_data(behavior_path)

	# 保存清洗后数据(供特征工程使用)
	cleaned_crm.to_parquet("s3://telecom-data-lake/cleaned/crm_cleaned.parquet", index=False)  
    cleaned_billing.to_parquet("s3://telecom-data-lake/cleaned/billing_cleaned.parquet", index=False)  
    cleaned_behavior.to_parquet("s3://telecom-data-lake/cleaned/behavior_cleaned.parquet", index=False)  
    logger.info("数据清洗完成,结果保存至cleaned目录")

(3)特征工程与特征数据生成(详细代码,明确feature_path/label_path)

  • 算法团队将特征矩阵存入Feast(在线/离线),feature_path指向特征矩阵文件(Parquet)
  • label_path指向用户ID与聚类结果映射文件(CSV,无监督学习无真实标签,此处"label"为簇ID)

① 特征衍生(feature_derivation.py)

python 复制代码
import pandas as pd
import numpy as np
from feature_engine.creation import MathFeatures,RelativeFeatures

def derive_features(cleaned_crm: pd.DataFrame, cleaned_billing: pd.DataFrame, cleaned_behavior: pd.DataFrame)->pd.DataFrame:
	"""特征衍生:构造业务相关新特征"""
	# 1.合并多源数据(按user_id和month关联)
	# 假设取最近一个月数据
	billing_recent = cleaned_billing[cleaned_billing["month"] == "2023-10"]
	behavior_recent = cleaned_behavior[cleaned_behavior["month"] == "2023-10"]
	merged_df = cleaned_crm.merge(billing_recent, on="user_id", how="left").merge(behavior_recent, on="user_id", how="left")

	# 2.基础特征(直接来自于清洗后数据)
	base_features = merged_df[[  
        "user_id", "arpu", "age", "contract_type", "data_usage_gb",  
        "voice_minutes", "app_active_days", "high_cs_cnt"  
    ]].copy() 

	# 3.衍生特征(领域知识驱动)
	# 流量使用率(假设套餐内流量为30GB,实际用data_usage_gb/30,超过1记为1)  
    base_features["data_usage_rate"] = (base_features["data_usage_gb"] / 30).clip(0, 1)  
    # 通话费用占比(假设语音资费0.1元/分钟,费用占比=voice_minutes*0.1 / arpu)  
    base_features["voice_cost_ratio"] = (base_features["voice_minutes"] * 0.1) / base_features["arpu"]  
    base_features["voice_cost_ratio"] = base_features["voice_cost_ratio"].fillna(0)  # arpu为0时填0  
    # 活跃度得分(app_active_days/31 * 0.6 + (1 - high_cs_cnt) * 0.4,越高越活跃)  
    base_features["activity_score"] = (base_features["app_active_days"] / 31) * 0.6 + (1 - base_features["high_cs_cnt"]) * 0.4  
    # 合约价值(contract_type=2年→1.0,1年→0.6,无合约→0.2)  
    contract_map = {2: 1.0, 1: 0.6, 0: 0.2}  
    base_features["contract_value"] = base_features["contract_type"].map(contract_map) 

	# 4.特征选择(保留最终入模特征)
	final_features = base_features[[  
        "user_id", "arpu", "data_usage_gb", "voice_minutes", "app_active_days",  
        "data_usage_rate", "voice_cost_ratio", "activity_score", "contract_value"  
    ]].dropna()  # 删除剩余缺失值  
    logger.info(f"特征衍生完成,共{len(final_features)}用户,{final_features.shape[1]-1}个特征")  
    return final_features

	

② 特征标准化(feature_scaler.py)

python 复制代码
from sklearn.preprocessing import StandardScaler  
import pandas as pd  
import joblib

def scale_features()->pd.DataFrame:
	"""
		特征标准化(K-Means必需):均值0,方差1
	"""
	# 分离用户ID和特征列
	user_ids = feature_df["user_id"]
	feature_cols = [col for col in feature_df.columns if col != "user_id"]
	X = feature_df[feature_cols].values

	# 初始化并拟合标准化器(保存scaler供线上预测用)
	scaler = StandardScaler()
	X_scaled = scaler.fit_transform(X)
	joblib.jump(scaler,scaler_path) # 保存到算法团队模型目录
	logger.info(f"特征标准化完成,scaler保存至{scaler_path}") 

	# 构造标准化的特征DataFrame
	scaled_df = pd.DataFrame(X_scaled,columns=feature_cols)
	scaled_df.insert(0,"user_id",user_ids) # 回复user_id
	return scaled_df
	

③ 特征数据生成(generate_feature_data.py,明确feature_path/label_path)

python 复制代码
import pandas as pd  
from feature_derivation import derive_features  
from feature_scaler import scale_features  
import logging  

logging.basicConfig(level=logging.INFO)  
logger = logging.getLogger(__name__)  

def generate_feature_data(cleaned_crm_path: str, cleaned_billing_path: str, cleaned_behavior_path: str) -> tuple:  
    """生成特征矩阵(feature_path)和聚类结果文件(label_path)""" 
	
	# 加载清洗后数据
	cleaned_crm = pd.read_parquet(cleaned_crm_path)
	cleaned_billing = pd.read_parquet(cleaned_billing_path)
	cleaned_behavior = pd.read_parquet(cleaned_behavior_path)
	
	# 1.特征衍生
	feature_df = derive_features(cleaned_crm, cleaned_billing, cleaned_behavior) 

	# 2.特征标准化
	saled_feature_df = scale_features(feature_df,scaler_path="model/scaler.pkl")

	# 3.定义文件路径(算法团队存储位置)
	# feature_path:标准化后的特征矩阵(Parquet),供模型训练用  
    feature_path = "s3://telecom-data-lake/processed/clustering_features.parquet"  
    # label_path:聚类结果文件(CSV),含user_id和簇ID(训练后生成,此处先占位)  
    label_path = "s3://telecom-data-lake/processed/clustering_labels.csv" 
		
	# 4.保存特征矩阵(feature_path指向的文件内容)
	scaled_feature_df.to_parquet(feature_path,index=False)
	logger.info(f"特征矩阵已保存至{feature_path},文件内容示例:\n{scaled_feature_df.head(2)}")

	# 返回路径和内容说明(供模型训练调用)  
	return feature_path, label_path, scaled_feature_df 
	

特征数据变化对比表(原数据→清洗后→特征工程后):

代码结构

(1)算法团队:K-Means模型训练(model_training/train_kmeans.py)

python 复制代码
import joblib
import pandas as pd
import numpy as np
from sklearn.cluster import KMeans
from sklearn.mtrics import silhouette_score
from yellowbrick.cluster import KElbowVisualizer
from feature_entineering.generate_feature_data import generate_feature_data
import mlflow
import logging

logging.basicConfig(level=logging.INFO)  
logger = logging.getLogger(__name__) 

def load_features(feature_path:str)->tuple:
	"""加载算法团队生成的feature_path文件(标准化特征矩阵)"""
	df = pd.read_parquet(feature_path)
	user_ids = df["user_id"].values
	feature_cols = [col for col in df.columns if col != "user_id"]
	X = df[feature_cols].values
	logger.info(f"加载特征矩阵:{X.shape[0]}用户,{X.shape[1]}特征")  
    return X, user_ids, feature_cols

def select_optimal_k(X: np.ndarray, k_range: range = range(2, 11))->int:
	"""用轮廓系数选K值(越大越好,通常取峰值)"""
	silhouette_scores = []
	for k in k_range:
		kmeans = KMeans(n_clusters=k, init="k-means++", random_state=42, n_init=10)
		labels = kmeans.fit_predict(X)
		score = silhouette_score(X, labels)
		silhouette_scores.append(score) 
		logger.info(f"K={k}, 轮廓系数={score:.4f}")

	# 取轮廓系数最大的K(若有多个峰值,结合业务选K=5)
	optimal_k = k_range[np.argmax(silhouette_scores)]
	logger.info(f"最优K值:{optimal_k}(轮廓系数={max(silhouette_scores):.4f})")  
    return optimal_k

	def train_kmeans_plusplus(X:np.ndarray,k:int) -> KMeans:
		"""训练K-Means++模型(优化初始质心)"""
		kmeans = KMeans(
			n_clusters=k,  
	        init="k-means++",  # 优化初始质心选择(比随机更优)  
	        n_init=10,  # 多次初始化取最优  
	        max_iter=300,  
	        random_state=42,  
	        verbose=1 
		)
		
		kmeans.fit(X)
		logger.info(f"K-Means++训练完成,簇中心数:{k},惯性(SSE):{kmeans.inertia_:.2f}")  
    	return kmeans
	
	def save_cluster_labels(user_ids: np.ndarray, labels: np.ndarray, label_path: str):
		"""保存聚类结果到label_path(CSV文件,供业务团队调用)"""
		df = pd.DataFrame({"user_id": user_ids, "cluster_id": labels})
		df.to_csv(label_path, index=False)
		logger.info(f"聚类结果已保存至{label_path},示例:\n{df.head(2)}")

if __name__ == "__main__":  
    # 1. 生成特征数据(调用算法团队特征工程模块,获取feature_path和label_path)  
    feature_path, label_path, _ = generate_feature_data(  
        cleaned_crm_path="s3://telecom-data-lake/cleaned/crm_cleaned.parquet",  
        cleaned_billing_path="s3://telecom-data-lake/cleaned/billing_cleaned.parquet",  
        cleaned_behavior_path="s3://telecom-data-lake/cleaned/behavior_cleaned.parquet"  
    )  

    # 2. 加载特征矩阵(feature_path指向的文件)  
    X, user_ids, feature_cols = load_features(feature_path)  

    # 3. 选优K值(轮廓系数法)  
    optimal_k = select_optimal_k(X, k_range=range(3, 8))  # 业务预期分5群,缩小范围  

    # 4. 训练K-Means++模型  
    kmeans_model = train_kmeans_plusplus(X, optimal_k)  

    # 5. 保存聚类结果到label_path(CSV文件)  
    save_cluster_labels(user_ids, kmeans_model.labels_, label_path)  

    # 6. 保存模型(供业务团队调用)  
    joblib.dump(kmeans_model, "model/kmeans_clustering_model.pkl")  

    # 7. 记录MLflow实验  
    with mlflow.start_run(run_name="kmeans_user_clustering"):  
        mlflow.log_param("k", optimal_k)  
        mlflow.log_param("init", "k-means++")  
        mlflow.log_metric("silhouette_score", silhouette_score(X, kmeans_model.labels_))  
        mlflow.log_metric("inertia", kmeans_model.inertia_)  
        mlflow.sklearn.log_model(kmeans_model, "kmeans_model")  
    logger.info("模型训练完成,已保存至model/kmeans_clustering_model.pkl")

(2)算法团队:Feast特征存储(feature_store/feature_repo/features.py)

python 复制代码
from feast import Entity, FeatureView, Field, FileSource  
from feast.types import Float32, Int64, String  
import pandas as pd  

# 1. 实体:用户ID(唯一标识)  
user_entity = Entity(name="user_id", value_type=String, description="电信用户唯一标识")  

# 2. 离线特征(从特征矩阵Parquet文件读取,训练用)  
offline_source = FileSource(  
    path="s3://telecom-data-lake/processed/clustering_features.parquet",  # feature_path文件  
    event_timestamp_column="event_time"  # 无时间戳,用虚拟列  
)  

# 3. 在线特征(从Redis读取,实时分群用)  
online_source = FileSource(  # 生产环境用RedisSource  
    path="s3://telecom-data-lake/processed/clustering_features.parquet"  
)  

# 4. 特征视图(整合实体、特征、数据源)  
user_features_view = FeatureView(  
    name="user_clustering_features",  
    entities=[user_entity],  
    ttl=timedelta(days=30),  # 特征保留30天  
    schema=[  # 特征列表(与算法团队特征工程输出一致)  
        Field(name="arpu_scaled", dtype=Float32),  
        Field(name="data_usage_gb_scaled", dtype=Float32),  
        Field(name="voice_minutes_scaled", dtype=Float32),  
        Field(name="app_active_days_scaled", dtype=Float32),  
        Field(name="data_usage_rate_scaled", dtype=Float32),  
        Field(name="activity_score_scaled", dtype=Float32),  
        Field(name="contract_value_scaled", dtype=Float32)  
    ],  
    source=offline_source,  
    online_source=online_source  
)

(3)业务团队:用户分群服务(调用特征服务+模型)

python 复制代码
import joblib  
import pandas as pd  
from feast import FeatureStore  
from pydantic import BaseModel  
import logging  

logging.basicConfig(level=logging.INFO)  
logger = logging.getLogger(__name__)  

class UserRequest(BaseModel):  
    user_id: str  
    arpu: float  
    data_usage_gb: float  
    voice_minutes: float  
    app_active_days: int  
    contract_type: int  

class ClusterResponse(BaseModel):  
    user_id: str  
    cluster_id: int  
    cluster_name: str  # 如"高价值稳定用户"  
    feature_importance: dict  # 关键特征贡献  

class UserCluster:
	def__init__(self,model_path:str,feast_repo_path:str,cluster_names:dict):
		self.model = joblib.load(model_path) # 算法团队的k-Means模型
		self.feature_store = FeatureStore(repo_path=feast_repo_path)  # 连接Feast服务  
        self.cluster_names = cluster_names  # 簇名称映射(如0:"高价值稳定用户")  
        self.scaler = joblib.load("model/scaler.pkl")  # 算法团队的标准化器
	
	def get_features_from_feast(self, user_id: str)->pd.DataFrame:
		"""调用Feast服务获取用户特征(算法团队维护的在线特征)""" 
		feature_refs = [f"user_clustering_features:{col}" for col in self.scaler.feature_names_in_]  
        features = self.feature_store.get_online_features(  
            entity_rows=[{"user_id": user_id}],  
            features=feature_refs  
        ).to_dict()  
        return pd.DataFrame(features)		
		
	def predict_cluster(self, request: UserRequest) -> ClusterResponse:  
        # 1. 构造特征向量(新用户用请求数据,老用户用Feast特征)  
        if request.user_id.startswith("new_"):  # 新用户  
            feature_dict = {  
                "arpu": request.arpu, "data_usage_gb": request.data_usage_gb,  
                "voice_minutes": request.voice_minutes, "app_active_days": request.app_active_days,  
                "data_usage_rate": request.data_usage_gb / 30,  # 衍生特征  
                "activity_score": (request.app_active_days / 31)*0.6 + (1 if request.contract_type==2 else 0.5)*0.4,  
                "contract_value": {2:1.0, 1:0.6, 0:0.2}[request.contract_type]  
            }  
            X = pd.DataFrame([feature_dict])[self.scaler.feature_names_in_]  
            X_scaled = self.scaler.transform(X)  
        else:  # 老用户:调用Feast服务  
            feat_df = self.get_features_from_feast(request.user_id)  
            X_scaled = feat_df[self.scaler.feature_names_in_].values  

        # 2. 预测簇ID  
        cluster_id = self.model.predict(X_scaled)[0]  
        # 3. 构造响应  
        return ClusterResponse(  
            user_id=request.user_id,  
            cluster_id=cluster_id,  
            cluster_name=self.cluster_names[cluster_id],  
            feature_importance={"arpu": 0.3, "contract_value": 0.25}  # 示例  
        )

模型性能:轮廓系数≥0.6,PSI<0.1(月度数据漂移),簇内用户特征标准差<0.5(紧密度)

业务价值:高价值用户保活率提升20%,营销转化率提升至15%,年节省成本5000万元;

服务性能:API响应时间≤100ms(P99),支持并发≥5000 QPS

相关推荐
TracyCoder1233 小时前
LeetCode Hot100(34/100)——98. 验证二叉搜索树
算法·leetcode
power 雀儿4 小时前
掩码(Mask)机制 结合 多头自注意力函数
算法
会叫的恐龙4 小时前
C++ 核心知识点汇总(第六日)(字符串)
c++·算法·字符串
小糯米6014 小时前
C++顺序表和vector
开发语言·c++·算法
We་ct4 小时前
LeetCode 56. 合并区间:区间重叠问题的核心解法与代码解析
前端·算法·leetcode·typescript
Lionel6894 小时前
分步实现 Flutter 鸿蒙轮播图核心功能(搜索框 + 指示灯)
算法·图搜索算法
小妖6665 小时前
js 实现快速排序算法
数据结构·算法·排序算法
xsyaaaan5 小时前
代码随想录Day30动态规划:背包问题二维_背包问题一维_416分割等和子集
算法·动态规划
zheyutao6 小时前
字符串哈希
算法