外卖霸王餐用户画像标签系统:Spark SQL批处理+Kafka流处理混合计算

外卖霸王餐用户画像标签系统:Spark SQL批处理+Kafka流处理混合计算

一、业务背景与系统目标

吃喝不愁APP的"霸王餐"频道每天产生千万级订单事件,需在分钟级完成用户画像标签更新,支撑"千人千面"发券、风控反薅、商家补贴结算三大场景。标签分两类:

  1. 离线T+1全量标签:消费力、品类偏好、敏感价格带。
  2. 实时增量标签:当前是否爆单、是否异常下单、是否达到当日免单上限。

技术选型:Spark SQL做小时级批处理补全历史,Kafka+Flink做秒级流处理修正最新状态,混合计算结果统一落入HBase,对外提供毫秒级查询。


二、整体架构

复制代码
                        ┌-----------------┐
                        │  DB 订单流水     │
                        └-------┬---------┘
                                │ Canal 增量
                                ▼
┌-------------┐    Kafka    ┌----------------┐   Spark Streaming   ┌----------------┐
│ 业务网关日志 │---►order_raw---►  Kafka Topic │-----►  分钟级聚合  │  tag_rt_stream │
└-------------┘             └----------------┘                     └-------┬--------┘
       ▲                                                                  │
       │                      ┌----------------┐                          │
       │  每日0点全量同步      │  Hive 订单宽表  │ Spark SQL 批任务          ▼
       └----------------------┤  dws_order_d   │-----► tag_offline_tbl   HBase
                             └----------------┘

三、离线批处理:Spark SQL构建基础标签

3.1 建表与分区策略

sql 复制代码
CREATE EXTERNAL TABLE dws_order_d (
    user_id              STRING,
    merchant_id          STRING,
    order_amt            DECIMAL(10,2),
    discount_amt         DECIMAL(10,2),
    cate                 STRING,
    pay_time             TIMESTAMP
) PARTITIONED BY (dt STRING)
STORED AS ORC
LOCATION 'hdfs://cluster/dws/dws_order_d';

3.2 消费力分级标签

sql 复制代码
INSERT OVERWRITE TABLE tag_consume_level PARTITION(dt='${bizdate}')
SELECT user_id,
       CASE WHEN month_amt>=1500 THEN 'H'
            WHEN month_amt>=600  THEN 'M'
            ELSE 'L' END AS consume_level
FROM (
    SELECT user_id, SUM(order_amt) AS month_amt
    FROM dws_order_d
    WHERE dt BETWEEN add_months('${bizdate}',-1) AND '${bizdate}'
    GROUP BY user_id
) t;

3.3 品类偏好Top3

sql 复制代码
INSERT OVERWRITE TABLE tag_cate_pref PARTITION(dt='${bizdate}')
SELECT user_id,
       concat_ws(',', collect_list(cate)) AS top3_cate
FROM (
    SELECT user_id, cate,
           ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY cnt DESC) rn
    FROM (
        SELECT user_id, cate, COUNT(*) cnt
        FROM dws_order_d
        WHERE dt BETWEEN date_sub('${bizdate}',30) AND '${bizdate}'
        GROUP BY user_id, cate
    ) t1
) t2
WHERE rn<=3
GROUP BY user_id;

四、实时流处理:Kafka+Flink秒级修正

4.1 订单事件POJO

java 复制代码
package cn.juwatech.food.model;

public class OrderEvent {
    public String orderId;
    public String userId;
    public long   orderAmt;   // 分为单位
    public long   discountAmt;
    public long   payTime;
    public String cate;
}

4.2 Flink消费Kafka并计算实时免单次数

java 复制代码
package cn.juwatech.food.job;

import cn.juwatech.food.model.OrderEvent;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.state.MapStateDescriptor;
import org.apache.flink.api.common.state.ReadOnlyBroadcastState;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.streaming.api.datastream.BroadcastStream;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.KeyedBroadcastProcessFunction;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.apache.flink.util.Collector;

import java.time.Duration;
import java.util.Properties;

public class FreeOrderCounterJob {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        Properties p = new Properties();
        p.setProperty("bootstrap.servers", "kafka001:9092");
        p.setProperty("group.id", "free-order-counter");

        FlinkKafkaConsumer<OrderEvent> source = new FlinkKafkaConsumer<>(
                "order_raw", new OrderEventDes(), p);
        source.assignTimestampsAndWatermarks(
                WatermarkStrategy.<OrderEvent>forBoundedOutOfOrderness(Duration.ofSeconds(5))
                                 .withTimestampAssigner((e, t) -> e.payTime));

        DataStream<OrderEvent> orderStream = env.addSource(source);

        // 广播配置:每日免单上限
        MapStateDescriptor<String, Integer> broadcastDesc =
                new MapStateDescriptor<>("freeLimit", Types.STRING, Types.INT);
        BroadcastStream<Integer> broadcast = env
                .addSource(new FreeLimitSource())
                .broadcast(broadcastDesc);

        orderStream
                .keyBy(o -> o.userId)
                .connect(broadcast)
                .process(new KeyedBroadcastProcessFunction<String, OrderEvent, Integer, String>() {
                    private static final long serialVersionUID = 1L;
                    @Override
                    public void processElement(OrderEvent value,
                                               KeyedBroadcastProcessFunction<String, OrderEvent, Integer, String>.ReadOnlyContext ctx,
                                               Collector<String> out) throws Exception {
                        ReadOnlyBroadcastState<String, Integer> state = ctx.getBroadcastState(broadcastDesc);
                        int limit = state.get("limit");
                        // 累加状态
                        int todayCnt = getRuntimeContext()
                                .getState(new ValueStateDescriptor<>("cnt", Types.INT))
                                .value();
                        if (todayCnt < limit && value.discountAmt == value.orderAmt) {
                            todayCnt += 1;
                            getRuntimeContext()
                                    .getState(new ValueStateDescriptor<>("cnt", Types.INT))
                                    .update(todayCnt);
                            out.collect(value.userId + "\t" + todayCnt);
                        }
                    }
                })
                .addSink(new HBaseSink("user_realtime", "free_cnt"));

        env.execute("FreeOrderCounterJob");
    }
}

五、混合计算:Lambda架构合并结果

5.1 HBase表设计

  • RowKey = reverse(user_id)
  • 列族 f
    • f:consume_level 离线批结果
    • f:free_cnt 实时流结果
    • f:update_time 时间戳

5.2 Spark SQL写入HBase工具类

java 复制代码
package cn.juwatech.food.utils;

import org.apache.hadoop.hbase.HConstants;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.mapreduce.Job;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;

public class HBaseBulkWriter {
    public static void write(JavaRDD<Put> putRDD, String tableNameStr) throws Exception {
        Configuration conf = HBaseConfiguration.create();
        conf.set(HConstants.ZOOKEEPER_QUORUM, "zk1,zk2,zk3");
        Job job = Job.getInstance(conf);
        job.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, tableNameStr);
        job.setOutputFormatClass(TableOutputFormat.class);

        JavaPairRDD<ImmutableBytesWritable, Put> pairRDD = putRDD.mapToPair(
                (PairFunction<Put, ImmutableBytesWritable, Put>) put ->
                        new Tuple2<>(new ImmutableBytesWritable(), put));
        pairRDD.saveAsNewAPIHadoopDataset(job.getConfiguration());
    }
}

5.3 每日批任务把consume_level写入HBase

java 复制代码
Dataset<Row> levelDS = spark.sql(
        "SELECT user_id,consume_level FROM tag_consume_level WHERE dt='" + bizdate + "'");
JavaRDD<Put> putRDD = levelDS.toJavaRDD().map(row -> {
    Put put = new Put(Bytes.toBytes(reverse(row.getString(0))));
    put.addColumn(Bytes.toBytes("f"), Bytes.toBytes("consume_level"),
                  Bytes.toBytes(row.getString(1)));
    return put;
});
HBaseBulkWriter.write(putRDD, "user_profile");

六、线上服务:毫秒级查询

SpringBoot + Phoenix JDBC,直接映射HBase:

java 复制代码
package cn.juwatech.food.api.service;

import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.jdbc.core.JdbcTemplate;
import org.springframework.stereotype.Service;

@Service
public class UserProfileService {
    @Autowired
    private JdbcTemplate phoenixTemplate;

    public UserProfile getProfile(String userId) {
        String sql = "SELECT consume_level,free_cnt FROM user_profile WHERE pk=?";
        return phoenixTemplate.queryForObject(sql,
                (rs, rowNum) -> new UserProfile(
                        rs.getString("consume_level"),
                        rs.getInt("free_cnt")),
                reverse(userId));
    }

    private static String reverse(String s) {
        return new StringBuilder(s).reverse().toString();
    }
}

七、性能与调优

  1. Spark批:dws_order_d按dt+user_id做bucket(128),写ORC+ZSTD,3T数据30分钟跑完。
  2. Flink流:并行度=Kafka分区数=48,开启对象重用,checkpoint 10s到HDFS,端到端延迟<3s。
  3. HBase:预分区15 Region,Phoenix建加盐表,读P99 8ms。

八、灰度与回滚

  • 离线标签采用Hive双表双版本,先写tag_consume_level_tmp,校验行数+MD5后rename覆盖。
  • 实时标签使用HBase列族多版本,回滚时批量删除异常时间戳即可。

九、后续演进

  1. 引入Iceberg替换Hive,实现流批一体meta。
  2. 将Flink状态由Heap改为RocksDB+增量checkpoint,支撑小时级大窗口。
  3. 实时特征写入线上Redis,支持推荐引擎<50ms召回。

本文著作权归吃喝不愁app开发者团队,转载请注明出处!

相关推荐
嘉禾望岗5031 小时前
spark standalone模式HA部署,任务失败重提测试
大数据·分布式·spark
B站计算机毕业设计之家1 小时前
电商数据实战:python京东商品爬取与可视化系统 大数据 Hadoop spark 优秀项目(源码)✅
大数据·hadoop·python·机器学习·spark·echarts·推荐算法
M***Z21010 小时前
SQL 建表语句详解
java·数据库·sql
bailaoshi66610 小时前
reactor-kafka无traceId
分布式·kafka
b***594311 小时前
MySQL数据库(SQL分类)
数据库·sql·mysql
i***279512 小时前
MySQL 常用 SQL 语句大全
数据库·sql·mysql
IndulgeCui12 小时前
KingbaseES 数据库与用户默认表空间深度解析
数据库·sql·mysql
v***913012 小时前
DVWA靶场通关——SQL Injection篇
数据库·sql
学c菜鸟鸟12 小时前
漏洞知识——sql注入(二)
数据库·sql·oracle