PySpark实现ABC_manage_channel逻辑

问题描述

我们需要确定"ABC_manage_channel"列的逻辑,该列的值在客户连续在同一渠道下单时更新为当前渠道,否则保留之前的值。具体规则如下:

  • 初始值为第一个订单的渠道
  • 如果客户连续两次在同一渠道下单,则更新为当前渠道
  • 否则保持前一个值不变

数据准备

首先创建orders表并插入测试数据:

sql 复制代码
CREATE OR REPLACE TABLE orders (
    customerid INTEGER,
    channel VARCHAR(20),
    order_date DATE
);

INSERT INTO orders (customerid, channel, order_date) VALUES
(1, 'TMALL', '2024-11-01'),
(1, 'TMALL', '2024-11-02'),
(1, 'TMALL', '2024-11-03'),
(1, 'douyin', '2024-11-25'),
(1, 'JD', '2025-01-13'),
(1, 'JD', '2025-01-14'),
(1, 'douyin', '2025-03-02'),
(1, 'douyin', '2025-03-27'),
(3, 'JD', '2024-04-23'),
(4, 'JD', '2025-02-15'),
(5, 'JD', '2024-08-30'),
(6, 'douyin', '2024-10-05'),
(7, 'JD', '2024-05-29'),
(7, 'douyin', '2024-09-15'),
(7, 'Wholesale', '2024-12-22'),
(7, 'JD', '2025-03-19'),
(8, 'douyin', '2024-08-01'),
(8, 'douyin', '2024-08-07'),
(8, 'douyin', '2024-11-15'),
(9, 'JD', '2025-03-19'),
(10, 'douyin', '2024-07-30'),
(10, 'douyin', '2024-12-27'),
(10, 'douyin', '2025-03-21'),
(10, 'douyin', '2025-03-23');

解决方案

方法一:使用SparkSQL(结合UDF)

python 复制代码
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import ArrayType, StringType, StructType, StructField, DateType

# 初始化Spark会话
spark = SparkSession.builder.appName("ABCManageChannel").getOrCreate()

# 创建并插入测试数据
spark.sql("""
CREATE OR REPLACE TABLE orders (
    customerid INTEGER,
    channel VARCHAR(20),
    order_date DATE
) USING parquet;

INSERT INTO orders VALUES
(1, 'TMALL', '2024-11-01'),
(1, 'TMALL', '2024-11-02'),
(1, 'TMALL', '2024-11-03'),
(1, 'douyin', '2024-11-25'),
(1, 'JD', '2025-01-13'),
(1, 'JD', '2025-01-14'),
(1, 'douyin', '2025-03-02'),
(1, 'douyin', '2025-03-27'),
(3, 'JD', '2024-04-23'),
(4, 'JD', '2025-02-15'),
(5, 'JD', '2024-08-30'),
(6, 'douyin', '2024-10-05'),
(7, 'JD', '2024-05-29'),
(7, 'douyin', '2024-09-15'),
(7, 'Wholesale', '2024-12-22'),
(7, 'JD', '2025-03-19'),
(8, 'douyin', '2024-08-01'),
(8, 'douyin', '2024-08-07'),
(8, 'douyin', '2024-11-15'),
(9, 'JD', '2025-03-19'),
(10, 'douyin', '2024-07-30'),
(10, 'douyin', '2024-12-27'),
(10, 'douyin', '2025-03-21'),
(10, 'douyin', '2025-03-23');
""")

# 读取数据并按客户分组排序
orders_df = spark.table("orders")

# 定义UDF处理渠道序列
def calculate_abc(channels):
    abc = []
    prev_channel = None
    current_abc = None
    for idx, c in enumerate(channels):
        if idx == 0:
            current_abc = c
        else:
            if c == prev_channel:
                current_abc = c
            # 否则保持前一个current_abc
        abc.append(current_abc)
        prev_channel = c
    return abc

udf_calculate_abc = F.udf(calculate_abc, ArrayType(StringType()))

# 使用SparkSQL处理
result_sql = spark.sql("""
WITH sorted_orders AS (
    SELECT customerid, channel, order_date,
           ROW_NUMBER() OVER (PARTITION BY customerid ORDER BY order_date) AS rn
    FROM orders
),
grouped AS (
    SELECT customerid, 
           COLLECT_LIST(channel) OVER (PARTITION BY customerid ORDER BY order_date) AS channels,
           COLLECT_LIST(order_date) OVER (PARTITION BY customerid ORDER BY order_date) AS order_dates
    FROM sorted_orders
)
SELECT customerid, order_date, channel, abc
FROM (
    SELECT customerid, 
           EXPLODE(ARRAYS_ZIP(order_dates, channels, abc_list)) AS data
    FROM (
        SELECT customerid, order_dates, channels,
               udf_calculate_abc(channels) AS abc_list
        FROM grouped
    )
)
SELECT customerid, 
       data.order_dates AS order_date,
       data.channels AS channel,
       data.abc_list AS ABC_manage_channel
""")

result_sql.show()

方法二:不使用SparkSQL(使用DataFrame API)

python 复制代码
# 使用DataFrame API处理
window_spec = Window.partitionBy("customerid").orderBy("order_date")

# 收集每个客户的订单渠道并按时间排序
grouped_df = orders_df.withColumn("rn", F.row_number().over(window_spec)) \
    .groupBy("customerid") \
    .agg(F.collect_list(F.struct("order_date", "channel")).alias("orders"))

# 定义UDF处理订单序列
schema = ArrayType(StructType([
    StructField("order_date", DateType()),
    StructField("channel", StringType()),
    StructField("ABC_manage_channel", StringType())
]))

def process_orders(orders):
    abc_list = []
    prev_channel = None
    current_abc = None
    sorted_orders = sorted(orders, key=lambda x: x.order_date)
    for idx, order in enumerate(sorted_orders):
        if idx == 0:
            current_abc = order.channel
        else:
            if order.channel == prev_channel:
                current_abc = order.channel
        abc_list.append((order.order_date, order.channel, current_abc))
        prev_channel = order.channel
    return abc_list

udf_process_orders = F.udf(process_orders, schema)

# 应用UDF并展开结果
result_df = grouped_df.withColumn("processed", udf_process_orders("orders")) \
    .select(F.explode("processed").alias("data")) \
    .select(
        F.col("data.order_date").alias("order_date"),
        F.col("data.channel").alias("channel"),
        F.col("data.ABC_manage_channel").alias("ABC_manage_channel")
    )

result_df.show()

解释

  • 方法一使用SparkSQL结合UDF,通过窗口函数排序并收集渠道数据,使用UDF处理每个客户的订单序列,生成ABC管理渠道列。
  • 方法二使用DataFrame API,通过分组和聚合操作收集订单数据,利用UDF处理每个分组内的订单序列,最后展开结果。
相关推荐
上单带刀不带妹6 分钟前
手写 Vue 中虚拟 DOM 到真实 DOM 的完整过程
开发语言·前端·javascript·vue.js·前端框架
im_AMBER26 分钟前
学习日志05 python
python·学习
大虫小呓31 分钟前
Python 处理 Excel 数据 pandas 和 openpyxl 哪家强?
python·pandas
哪 吒43 分钟前
2025B卷 - 华为OD机试七日集训第5期 - 按算法分类,由易到难,循序渐进,玩转OD(Python/JS/C/C++)
python·算法·华为od·华为od机试·2025b卷
-凌凌漆-1 小时前
【Qt】QStringLiteral 介绍
开发语言·qt
程序员爱钓鱼1 小时前
Go语言项目工程化 — 常见开发工具与 CI/CD 支持
开发语言·后端·golang·gin
军训猫猫头1 小时前
1.如何对多个控件进行高效的绑定 C#例子 WPF例子
开发语言·算法·c#·.net
真的想上岸啊2 小时前
学习C++、QT---18(C++ 记事本项目的stylesheet)
开发语言·c++·学习
明天好,会的2 小时前
跨平台ZeroMQ:在Rust中使用zmq库的完整指南
开发语言·后端·rust
摸爬滚打李上进2 小时前
重生学AI第十六集:线性层nn.Linear
人工智能·pytorch·python·神经网络·机器学习