本文基于 Flink 1.17 + Kafka 3.2 + Debezium 1.9 + MySQL 5.7 环境,记录从零搭建实时数仓的完整过程,涵盖 4 种窗口、TopN、4 种 JOIN 的实战案例和踩坑记录。
一、环境准备
1.1 启动 Flink SQL Client
bash
cd /opt/module/flink-1.17.0
./bin/sql-client.sh
1.2 设置展示模式
sql
-- Tableau 风格(推荐,更清晰)
SET 'sql-client.execution.result-mode' = 'tableau';
-- 或者 changelog 模式(显示 +I / -U / +U)
SET 'sql-client.execution.result-mode' = 'changelog';
-- 设置时区
SET 'table.local-time-zone' = 'Asia/Shanghai';
-- 设置并行度
SET 'parallelism.default' = '2';
-- 设置状态 TTL(防止状态无限膨胀)
SET 'table.exec.state.ttl' = '1h';
1.3 三种展示模式对比
| 模式 | 命令 | 特点 |
|---|---|---|
| table | SET 'sql-client.execution.result-mode' = 'table'; |
默认,表格形式展示 |
| tableau | SET 'sql-client.execution.result-mode' = 'tableau'; |
更紧凑,适合快速浏览 |
| changelog | SET 'sql-client.execution.result-mode' = 'changelog'; |
显示 +I、-U、+U,观察数据更新过程 |
二、建表模板
2.1 Kafka 源表(流表)
sql
CREATE TABLE orders_source (
order_id INT,
product_id INT,
quantity INT,
order_time STRING,
status STRING,
`op` STRING,
-- UTC → 北京时间(+8 小时)
order_time_ts AS TO_TIMESTAMP_LTZ(
(UNIX_TIMESTAMP(order_time, 'yyyy-MM-dd''T''HH:mm:ss''Z''') + 28800) * 1000,
3
),
WATERMARK FOR order_time_ts AS order_time_ts - INTERVAL '1' SECOND
) WITH (
'connector' = 'kafka',
'topic' = 'retail_db.retail_db.orders',
'properties.bootstrap.servers' = 'hadoop102:9092,hadoop103:9092,hadoop104:9092',
'properties.group.id' = 'flink-orders-group',
'scan.startup.mode' = 'earliest-offset',
'format' = 'debezium-json'
);
关键参数说明:
| 参数 | 说明 |
|---|---|
scan.startup.mode |
earliest-offset:从头消费;latest-offset:从最新开始 |
format |
debezium-json:解析 Debezium CDC 格式 |
WATERMARK |
定义事件时间和乱序容忍度 |
2.2 MySQL 维表(Lookup Table)
sql
CREATE TABLE products_dim (
product_id INT,
product_name STRING,
category STRING,
price DECIMAL(10, 2),
stock INT,
PRIMARY KEY (product_id) NOT ENFORCED
) WITH (
'connector' = 'jdbc',
'url' = 'jdbc:mysql://hadoop102:3306/retail_db',
'table-name' = 'products',
'username' = 'root',
'password' = '123456',
'lookup.cache.max-rows' = '10000',
'lookup.cache.ttl' = '10min'
);
2.3 JDBC Sink 表(结果回写)
sql
CREATE TABLE category_sales_sink (
window_start TIMESTAMP(3),
category STRING,
total_sales DECIMAL(10, 2),
PRIMARY KEY (window_start, category) NOT ENFORCED
) WITH (
'connector' = 'jdbc',
'url' = 'jdbc:mysql://hadoop102:3306/retail_db',
'table-name' = 'minute_category_sales',
'username' = 'root',
'password' = '123456',
'sink.buffer-flush.max-rows' = '1',
'sink.buffer-flush.interval' = '1s'
);
2.4 MySQL 目标表
sql
CREATE TABLE IF NOT EXISTS minute_category_sales (
window_start DATETIME(3) NOT NULL,
category VARCHAR(50) NOT NULL,
total_sales DECIMAL(10,2),
PRIMARY KEY (window_start, category)
);
注意 :MySQL 中
TIMESTAMP有时区问题,建议用DATETIME(3)。
三、四种窗口实战
3.1 滚动窗口(TUMBLE)------ 每分钟独立统计
特点:每个窗口独立计算,数据不重叠。
触发条件:窗口结束时间到达且有数据。
sql
INSERT INTO category_sales_sink
SELECT
TUMBLE_START(order_time_ts, INTERVAL '1' MINUTE) AS window_start,
p.category,
SUM(o.quantity * p.price) AS total_sales
FROM orders_source o
JOIN products_dim p ON o.product_id = p.product_id
GROUP BY TUMBLE(order_time_ts, INTERVAL '1' MINUTE), p.category;
3.2 滑动窗口(HOP)------ 近 5 分钟滚动统计
特点:窗口重叠,一条数据可能属于多个窗口。
触发条件:每个窗口结束时间到达且有数据。
sql
INSERT INTO category_sales_sink
SELECT
HOP_START(order_time_ts, INTERVAL '1' MINUTE, INTERVAL '5' MINUTE) AS window_start,
p.category,
SUM(o.quantity * p.price) AS total_sales
FROM orders_source o
JOIN products_dim p ON o.product_id = p.product_id
GROUP BY HOP(order_time_ts, INTERVAL '1' MINUTE, INTERVAL '5' MINUTE), p.category;
3.3 累积窗口(CUMULATE)------ 今日累计
特点:从当天 0 点开始持续累加,直到当天结束。第一条数据激活后,每分钟固定输出。
触发条件:有数据激活后,每分钟定时触发(类似定时器)。
sql
INSERT INTO category_sales_sink
SELECT
CUMULATE_START(order_time_ts, INTERVAL '1' MINUTE, INTERVAL '1' DAY) AS window_start,
p.category,
SUM(o.quantity * p.price) AS total_sales
FROM orders_source o
JOIN products_dim p ON o.product_id = p.product_id
WHERE order_time_ts >= TIMESTAMP '2026-07-02 00:00:00'
GROUP BY CUMULATE(order_time_ts, INTERVAL '1' MINUTE, INTERVAL '1' DAY), p.category;
3.4 会话窗口(SESSION)------ 按活跃间隙分组
特点:超过 5 分钟没有新数据才触发计算,持续有数据则不断延长窗口。
触发条件:空闲时间达到阈值才触发(事件驱动),持续有数据就不触发。
sql
SELECT
SESSION_START(order_time_ts, INTERVAL '5' MINUTE) AS session_start,
SESSION_END(order_time_ts, INTERVAL '5' MINUTE) AS session_end,
p.category,
COUNT(*) AS order_count,
SUM(o.quantity * p.price) AS total_sales
FROM orders_source o
JOIN products_dim p ON o.product_id = p.product_id
GROUP BY SESSION(order_time_ts, INTERVAL '5' MINUTE), p.category;
四、分组 TopN 实时排行榜
场景:每个品类下,过去 5 分钟销售额最高的 Top 2 商品。
sql
INSERT INTO top_category_products_sink
SELECT
window_start,
window_end,
category,
product_name,
total_sales,
rank_num
FROM (
SELECT
window_start,
window_end,
category,
product_name,
total_sales,
ROW_NUMBER() OVER (
PARTITION BY window_start, window_end, category
ORDER BY total_sales DESC
) AS rank_num
FROM (
SELECT
HOP_START(o.order_time_ts, INTERVAL '1' MINUTE, INTERVAL '5' MINUTE) AS window_start,
HOP_END(o.order_time_ts, INTERVAL '1' MINUTE, INTERVAL '5' MINUTE) AS window_end,
p.category,
p.product_name,
SUM(o.quantity * p.price) AS total_sales
FROM orders_source o
JOIN products_dim p ON o.product_id = p.product_id
GROUP BY HOP(o.order_time_ts, INTERVAL '1' MINUTE, INTERVAL '5' MINUTE), p.category, p.product_name
) t1
WHERE total_sales > 0
) t2
WHERE rank_num <= 2;
⚠️
QUALIFY语法需要 Flink 1.18+,1.17 用子查询方式。
五、四种 JOIN 实战
5.1 Regular Join(双流 Join)
场景:两条流实时任意条件关联(如订单流 JOIN 支付流)。
特点:
-
两条流都必须有 Watermark
-
状态会保留所有历史数据(需要设置 TTL)
-
支持 CDC 流的更新/删除(
-U/+U/-D)
测试数据示例:
sql
WITH orders AS (
SELECT 1 AS order_id, 1 AS product_id, 2 AS quantity, TIMESTAMP '2026-07-02 22:30:00' AS order_time
UNION ALL
SELECT 2, 1, 1, TIMESTAMP '2026-07-02 22:32:00'
),
payments AS (
SELECT 1 AS payment_id, 1 AS order_id, 6999.00 AS amount, TIMESTAMP '2026-07-02 22:31:00' AS pay_time
UNION ALL
SELECT 2, 2, 12999.00, TIMESTAMP '2026-07-02 22:33:00'
)
SELECT
o.order_id,
o.quantity,
p.payment_id,
p.amount AS pay_amount
FROM orders o
JOIN payments p ON o.order_id = p.order_id;
生产环境注意事项:
sql
-- 必须设置状态 TTL,防止状态无限膨胀
SET 'table.exec.state.ttl' = '1h';
-- 两条流都必须有 Watermark
-- 适用于任意条件的实时关联
SELECT ...
FROM orders_source o
JOIN payments_source p
ON o.order_id = p.order_id;
5.2 Lookup Join(维表 Join)
场景:流表关联外部静态维表(如 MySQL 商品表)。
特点:
-
维表必须有主键
-
不阻塞 Watermark
-
适合静态/缓慢变化的维度数据
sql
SELECT
o.order_id,
p.product_name,
p.category,
o.quantity * p.price AS total_amount
FROM orders_source o
JOIN products_dim p ON o.product_id = p.product_id;
5.3 Interval Join(时间区间 Join)
场景:两条流在指定时间窗口内关联(如订单和支付在 5 分钟内匹配)。
特点:
-
要求输入流是 append-only(CDC 流不支持)
-
状态可控(只保留窗口内数据)
-
适合两条日志流的关联
sql
WITH orders_test AS (
SELECT 1 AS order_id, TIMESTAMP '2026-07-02 22:30:00' AS order_time
UNION ALL
SELECT 2, TIMESTAMP '2026-07-02 22:32:00'
UNION ALL
SELECT 3, TIMESTAMP '2026-07-02 22:33:00'
UNION ALL
SELECT 4, TIMESTAMP '2026-07-02 22:38:00'
)
SELECT
o1.order_id AS left_order_id,
o2.order_id AS right_order_id,
o1.order_time AS left_time,
o2.order_time AS right_time
FROM orders_test o1
JOIN orders_test o2
ON o1.order_id < o2.order_id
AND o1.order_time BETWEEN o2.order_time - INTERVAL '5' MINUTE
AND o2.order_time + INTERVAL '5' MINUTE;
5.4 Temporal Join(版本表 Join)
场景:查询订单下单时的商品价格(历史版本)。
特点:
-
版本表必须有主键 + 时间属性
-
状态只保留每个主键的最新版本
-
适合持续更新的 CDC 流
sql
WITH product_price_history AS (
SELECT 1 AS product_id, 6999.00 AS price, TIMESTAMP '2026-07-01 00:00:00' AS valid_from
UNION ALL
SELECT 1, 7999.00, TIMESTAMP '2026-07-02 00:00:00'
UNION ALL
SELECT 2, 12999.00, TIMESTAMP '2026-07-01 00:00:00'
UNION ALL
SELECT 2, 13999.00, TIMESTAMP '2026-07-02 00:00:00'
),
orders AS (
SELECT 1 AS order_id, 1 AS product_id, TIMESTAMP '2026-07-01 12:00:00' AS order_time
UNION ALL
SELECT 2, 1, TIMESTAMP '2026-07-02 14:00:00'
)
SELECT
o.order_id,
o.product_id,
o.order_time,
p.price AS price_at_order_time
FROM orders o
JOIN product_price_history p
ON o.product_id = p.product_id
AND p.valid_from <= o.order_time
WHERE p.valid_from = (
SELECT MAX(valid_from)
FROM product_price_history p2
WHERE p2.product_id = p.product_id
AND p2.valid_from <= o.order_time
);
六、四种 JOIN 对比总结
| JOIN 类型 | 适用场景 | 状态大小 | Watermark 要求 | CDC 流支持 | 触发机制 |
|---|---|---|---|---|---|
| Regular Join | 双流任意条件关联 | 大(全量) | 两条流都必须有 | ✅ 支持 | 数据驱动 |
| Lookup Join | 流表 + 外部维表 | 可控(缓存) | 只要求主表有 | ✅ 支持 | 查询驱动 |
| Interval Join | 时间窗口内双流关联 | 可控(窗口内) | 两条流都必须有 | ❌ 不支持 | 时间驱动 |
| Temporal Join | 历史版本维表关联 | 可控(快照) | 主表必须有 | ✅ 支持 | 数据驱动 |
七、踩坑记录
坑 1:双流 JOIN 阻塞 Watermark
现象:窗口一直不触发。
原因 :两张流表 JOIN 时,必须两张表都有 Watermark ,否则全局 Watermark 取最小值(Long.MIN_VALUE)。
解决:将维表改为 Lookup Join,或给维表也加上时间属性和 Watermark。
坑 2:时区问题
现象:Flink 时间和 MySQL 时间相差 8 小时。
解决:
sql
order_time_ts AS TO_TIMESTAMP_LTZ(
(UNIX_TIMESTAMP(order_time, 'yyyy-MM-dd''T''HH:mm:ss''Z''') + 28800) * 1000,
3
)
坑 3:QUALIFY 语法不支持(Flink < 1.18)
解决 :用子查询 + WHERE rn <= N 替代。
坑 4:Interval Join 不支持 CDC 流
解决:用 Temporal Join 或 Lookup Join 替代;或用纯 JSON 数据模拟 append-only 流。
坑 5:资源不足(NoResourceAvailableException)
解决:
bash
./bin/flink list -r
./bin/flink cancel <JobID>
SET 'parallelism.default' = '1';