外卖霸王餐用户画像标签系统:Spark SQL批处理+Kafka流处理混合计算
一、业务背景与系统目标
吃喝不愁APP的"霸王餐"频道每天产生千万级订单事件,需在分钟级完成用户画像标签更新,支撑"千人千面"发券、风控反薅、商家补贴结算三大场景。标签分两类:
- 离线T+1全量标签:消费力、品类偏好、敏感价格带。
- 实时增量标签:当前是否爆单、是否异常下单、是否达到当日免单上限。
技术选型:Spark SQL做小时级批处理补全历史,Kafka+Flink做秒级流处理修正最新状态,混合计算结果统一落入HBase,对外提供毫秒级查询。

二、整体架构
┌-----------------┐
│ DB 订单流水 │
└-------┬---------┘
│ Canal 增量
▼
┌-------------┐ Kafka ┌----------------┐ Spark Streaming ┌----------------┐
│ 业务网关日志 │---►order_raw---► Kafka Topic │-----► 分钟级聚合 │ tag_rt_stream │
└-------------┘ └----------------┘ └-------┬--------┘
▲ │
│ ┌----------------┐ │
│ 每日0点全量同步 │ Hive 订单宽表 │ Spark SQL 批任务 ▼
└----------------------┤ dws_order_d │-----► tag_offline_tbl HBase
└----------------┘
三、离线批处理:Spark SQL构建基础标签
3.1 建表与分区策略
sql
CREATE EXTERNAL TABLE dws_order_d (
user_id STRING,
merchant_id STRING,
order_amt DECIMAL(10,2),
discount_amt DECIMAL(10,2),
cate STRING,
pay_time TIMESTAMP
) PARTITIONED BY (dt STRING)
STORED AS ORC
LOCATION 'hdfs://cluster/dws/dws_order_d';
3.2 消费力分级标签
sql
INSERT OVERWRITE TABLE tag_consume_level PARTITION(dt='${bizdate}')
SELECT user_id,
CASE WHEN month_amt>=1500 THEN 'H'
WHEN month_amt>=600 THEN 'M'
ELSE 'L' END AS consume_level
FROM (
SELECT user_id, SUM(order_amt) AS month_amt
FROM dws_order_d
WHERE dt BETWEEN add_months('${bizdate}',-1) AND '${bizdate}'
GROUP BY user_id
) t;
3.3 品类偏好Top3
sql
INSERT OVERWRITE TABLE tag_cate_pref PARTITION(dt='${bizdate}')
SELECT user_id,
concat_ws(',', collect_list(cate)) AS top3_cate
FROM (
SELECT user_id, cate,
ROW_NUMBER() OVER (PARTITION BY user_id ORDER BY cnt DESC) rn
FROM (
SELECT user_id, cate, COUNT(*) cnt
FROM dws_order_d
WHERE dt BETWEEN date_sub('${bizdate}',30) AND '${bizdate}'
GROUP BY user_id, cate
) t1
) t2
WHERE rn<=3
GROUP BY user_id;
四、实时流处理:Kafka+Flink秒级修正
4.1 订单事件POJO
java
package cn.juwatech.food.model;
public class OrderEvent {
public String orderId;
public String userId;
public long orderAmt; // 分为单位
public long discountAmt;
public long payTime;
public String cate;
}
4.2 Flink消费Kafka并计算实时免单次数
java
package cn.juwatech.food.job;
import cn.juwatech.food.model.OrderEvent;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.state.MapStateDescriptor;
import org.apache.flink.api.common.state.ReadOnlyBroadcastState;
import org.apache.flink.api.common.typeinfo.Types;
import org.apache.flink.streaming.api.datastream.BroadcastStream;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.KeyedBroadcastProcessFunction;
import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
import org.apache.flink.util.Collector;
import java.time.Duration;
import java.util.Properties;
public class FreeOrderCounterJob {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
Properties p = new Properties();
p.setProperty("bootstrap.servers", "kafka001:9092");
p.setProperty("group.id", "free-order-counter");
FlinkKafkaConsumer<OrderEvent> source = new FlinkKafkaConsumer<>(
"order_raw", new OrderEventDes(), p);
source.assignTimestampsAndWatermarks(
WatermarkStrategy.<OrderEvent>forBoundedOutOfOrderness(Duration.ofSeconds(5))
.withTimestampAssigner((e, t) -> e.payTime));
DataStream<OrderEvent> orderStream = env.addSource(source);
// 广播配置:每日免单上限
MapStateDescriptor<String, Integer> broadcastDesc =
new MapStateDescriptor<>("freeLimit", Types.STRING, Types.INT);
BroadcastStream<Integer> broadcast = env
.addSource(new FreeLimitSource())
.broadcast(broadcastDesc);
orderStream
.keyBy(o -> o.userId)
.connect(broadcast)
.process(new KeyedBroadcastProcessFunction<String, OrderEvent, Integer, String>() {
private static final long serialVersionUID = 1L;
@Override
public void processElement(OrderEvent value,
KeyedBroadcastProcessFunction<String, OrderEvent, Integer, String>.ReadOnlyContext ctx,
Collector<String> out) throws Exception {
ReadOnlyBroadcastState<String, Integer> state = ctx.getBroadcastState(broadcastDesc);
int limit = state.get("limit");
// 累加状态
int todayCnt = getRuntimeContext()
.getState(new ValueStateDescriptor<>("cnt", Types.INT))
.value();
if (todayCnt < limit && value.discountAmt == value.orderAmt) {
todayCnt += 1;
getRuntimeContext()
.getState(new ValueStateDescriptor<>("cnt", Types.INT))
.update(todayCnt);
out.collect(value.userId + "\t" + todayCnt);
}
}
})
.addSink(new HBaseSink("user_realtime", "free_cnt"));
env.execute("FreeOrderCounterJob");
}
}
五、混合计算:Lambda架构合并结果
5.1 HBase表设计
- RowKey =
reverse(user_id) - 列族
f:f:consume_level离线批结果f:free_cnt实时流结果f:update_time时间戳
5.2 Spark SQL写入HBase工具类
java
package cn.juwatech.food.utils;
import org.apache.hadoop.hbase.HConstants;
import org.apache.hadoop.hbase.TableName;
import org.apache.hadoop.hbase.client.*;
import org.apache.hadoop.hbase.mapreduce.TableOutputFormat;
import org.apache.hadoop.hbase.util.Bytes;
import org.apache.hadoop.mapreduce.Job;
import org.apache.spark.api.java.JavaPairRDD;
import org.apache.spark.api.java.JavaRDD;
import org.apache.spark.api.java.function.PairFunction;
import scala.Tuple2;
public class HBaseBulkWriter {
public static void write(JavaRDD<Put> putRDD, String tableNameStr) throws Exception {
Configuration conf = HBaseConfiguration.create();
conf.set(HConstants.ZOOKEEPER_QUORUM, "zk1,zk2,zk3");
Job job = Job.getInstance(conf);
job.getConfiguration().set(TableOutputFormat.OUTPUT_TABLE, tableNameStr);
job.setOutputFormatClass(TableOutputFormat.class);
JavaPairRDD<ImmutableBytesWritable, Put> pairRDD = putRDD.mapToPair(
(PairFunction<Put, ImmutableBytesWritable, Put>) put ->
new Tuple2<>(new ImmutableBytesWritable(), put));
pairRDD.saveAsNewAPIHadoopDataset(job.getConfiguration());
}
}
5.3 每日批任务把consume_level写入HBase
java
Dataset<Row> levelDS = spark.sql(
"SELECT user_id,consume_level FROM tag_consume_level WHERE dt='" + bizdate + "'");
JavaRDD<Put> putRDD = levelDS.toJavaRDD().map(row -> {
Put put = new Put(Bytes.toBytes(reverse(row.getString(0))));
put.addColumn(Bytes.toBytes("f"), Bytes.toBytes("consume_level"),
Bytes.toBytes(row.getString(1)));
return put;
});
HBaseBulkWriter.write(putRDD, "user_profile");
六、线上服务:毫秒级查询
SpringBoot + Phoenix JDBC,直接映射HBase:
java
package cn.juwatech.food.api.service;
import org.springframework.beans.factory.annotation.Autowired;
import org.springframework.jdbc.core.JdbcTemplate;
import org.springframework.stereotype.Service;
@Service
public class UserProfileService {
@Autowired
private JdbcTemplate phoenixTemplate;
public UserProfile getProfile(String userId) {
String sql = "SELECT consume_level,free_cnt FROM user_profile WHERE pk=?";
return phoenixTemplate.queryForObject(sql,
(rs, rowNum) -> new UserProfile(
rs.getString("consume_level"),
rs.getInt("free_cnt")),
reverse(userId));
}
private static String reverse(String s) {
return new StringBuilder(s).reverse().toString();
}
}
七、性能与调优
- Spark批:dws_order_d按dt+user_id做bucket(128),写ORC+ZSTD,3T数据30分钟跑完。
- Flink流:并行度=Kafka分区数=48,开启对象重用,checkpoint 10s到HDFS,端到端延迟<3s。
- HBase:预分区15 Region,Phoenix建加盐表,读P99 8ms。
八、灰度与回滚
- 离线标签采用Hive双表双版本,先写tag_consume_level_tmp,校验行数+MD5后rename覆盖。
- 实时标签使用HBase列族多版本,回滚时批量删除异常时间戳即可。
九、后续演进
- 引入Iceberg替换Hive,实现流批一体meta。
- 将Flink状态由Heap改为RocksDB+增量checkpoint,支撑小时级大窗口。
- 实时特征写入线上Redis,支持推荐引擎<50ms召回。
本文著作权归吃喝不愁app开发者团队,转载请注明出处!