Flink SQL 从入门到实战:4 种窗口 + TopN + 4 种 JOIN 全解析

本文基于 Flink 1.17 + Kafka 3.2 + Debezium 1.9 + MySQL 5.7 环境,记录从零搭建实时数仓的完整过程,涵盖 4 种窗口、TopN、4 种 JOIN 的实战案例和踩坑记录。


一、环境准备

bash

复制代码
cd /opt/module/flink-1.17.0
./bin/sql-client.sh

1.2 设置展示模式

sql

复制代码
-- Tableau 风格(推荐,更清晰)
SET 'sql-client.execution.result-mode' = 'tableau';

-- 或者 changelog 模式(显示 +I / -U / +U)
SET 'sql-client.execution.result-mode' = 'changelog';

-- 设置时区
SET 'table.local-time-zone' = 'Asia/Shanghai';

-- 设置并行度
SET 'parallelism.default' = '2';

-- 设置状态 TTL(防止状态无限膨胀)
SET 'table.exec.state.ttl' = '1h';

1.3 三种展示模式对比

模式 命令 特点
table SET 'sql-client.execution.result-mode' = 'table'; 默认,表格形式展示
tableau SET 'sql-client.execution.result-mode' = 'tableau'; 更紧凑,适合快速浏览
changelog SET 'sql-client.execution.result-mode' = 'changelog'; 显示 +I-U+U,观察数据更新过程

二、建表模板

2.1 Kafka 源表(流表)

sql

复制代码
CREATE TABLE orders_source (
  order_id INT,
  product_id INT,
  quantity INT,
  order_time STRING,
  status STRING,
  `op` STRING,
  -- UTC → 北京时间(+8 小时)
  order_time_ts AS TO_TIMESTAMP_LTZ(
    (UNIX_TIMESTAMP(order_time, 'yyyy-MM-dd''T''HH:mm:ss''Z''') + 28800) * 1000,
    3
  ),
  WATERMARK FOR order_time_ts AS order_time_ts - INTERVAL '1' SECOND
) WITH (
  'connector' = 'kafka',
  'topic' = 'retail_db.retail_db.orders',
  'properties.bootstrap.servers' = 'hadoop102:9092,hadoop103:9092,hadoop104:9092',
  'properties.group.id' = 'flink-orders-group',
  'scan.startup.mode' = 'earliest-offset',
  'format' = 'debezium-json'
);

关键参数说明:

参数 说明
scan.startup.mode earliest-offset:从头消费;latest-offset:从最新开始
format debezium-json:解析 Debezium CDC 格式
WATERMARK 定义事件时间和乱序容忍度

2.2 MySQL 维表(Lookup Table)

sql

复制代码
CREATE TABLE products_dim (
  product_id INT,
  product_name STRING,
  category STRING,
  price DECIMAL(10, 2),
  stock INT,
  PRIMARY KEY (product_id) NOT ENFORCED
) WITH (
  'connector' = 'jdbc',
  'url' = 'jdbc:mysql://hadoop102:3306/retail_db',
  'table-name' = 'products',
  'username' = 'root',
  'password' = '123456',
  'lookup.cache.max-rows' = '10000',
  'lookup.cache.ttl' = '10min'
);

2.3 JDBC Sink 表(结果回写)

sql

复制代码
CREATE TABLE category_sales_sink (
  window_start TIMESTAMP(3),
  category STRING,
  total_sales DECIMAL(10, 2),
  PRIMARY KEY (window_start, category) NOT ENFORCED
) WITH (
  'connector' = 'jdbc',
  'url' = 'jdbc:mysql://hadoop102:3306/retail_db',
  'table-name' = 'minute_category_sales',
  'username' = 'root',
  'password' = '123456',
  'sink.buffer-flush.max-rows' = '1',
  'sink.buffer-flush.interval' = '1s'
);

2.4 MySQL 目标表

sql

复制代码
CREATE TABLE IF NOT EXISTS minute_category_sales (
  window_start DATETIME(3) NOT NULL,
  category VARCHAR(50) NOT NULL,
  total_sales DECIMAL(10,2),
  PRIMARY KEY (window_start, category)
);

注意 :MySQL 中 TIMESTAMP 有时区问题,建议用 DATETIME(3)


三、四种窗口实战

3.1 滚动窗口(TUMBLE)------ 每分钟独立统计

特点:每个窗口独立计算,数据不重叠。

触发条件:窗口结束时间到达且有数据。

sql

复制代码
INSERT INTO category_sales_sink
SELECT
  TUMBLE_START(order_time_ts, INTERVAL '1' MINUTE) AS window_start,
  p.category,
  SUM(o.quantity * p.price) AS total_sales
FROM orders_source o
JOIN products_dim p ON o.product_id = p.product_id
GROUP BY TUMBLE(order_time_ts, INTERVAL '1' MINUTE), p.category;

3.2 滑动窗口(HOP)------ 近 5 分钟滚动统计

特点:窗口重叠,一条数据可能属于多个窗口。

触发条件:每个窗口结束时间到达且有数据。

sql

复制代码
INSERT INTO category_sales_sink
SELECT
  HOP_START(order_time_ts, INTERVAL '1' MINUTE, INTERVAL '5' MINUTE) AS window_start,
  p.category,
  SUM(o.quantity * p.price) AS total_sales
FROM orders_source o
JOIN products_dim p ON o.product_id = p.product_id
GROUP BY HOP(order_time_ts, INTERVAL '1' MINUTE, INTERVAL '5' MINUTE), p.category;

3.3 累积窗口(CUMULATE)------ 今日累计

特点:从当天 0 点开始持续累加,直到当天结束。第一条数据激活后,每分钟固定输出。

触发条件:有数据激活后,每分钟定时触发(类似定时器)。

sql

复制代码
INSERT INTO category_sales_sink
SELECT
  CUMULATE_START(order_time_ts, INTERVAL '1' MINUTE, INTERVAL '1' DAY) AS window_start,
  p.category,
  SUM(o.quantity * p.price) AS total_sales
FROM orders_source o
JOIN products_dim p ON o.product_id = p.product_id
WHERE order_time_ts >= TIMESTAMP '2026-07-02 00:00:00'
GROUP BY CUMULATE(order_time_ts, INTERVAL '1' MINUTE, INTERVAL '1' DAY), p.category;

3.4 会话窗口(SESSION)------ 按活跃间隙分组

特点:超过 5 分钟没有新数据才触发计算,持续有数据则不断延长窗口。

触发条件:空闲时间达到阈值才触发(事件驱动),持续有数据就不触发。

sql

复制代码
SELECT
  SESSION_START(order_time_ts, INTERVAL '5' MINUTE) AS session_start,
  SESSION_END(order_time_ts, INTERVAL '5' MINUTE) AS session_end,
  p.category,
  COUNT(*) AS order_count,
  SUM(o.quantity * p.price) AS total_sales
FROM orders_source o
JOIN products_dim p ON o.product_id = p.product_id
GROUP BY SESSION(order_time_ts, INTERVAL '5' MINUTE), p.category;

四、分组 TopN 实时排行榜

场景:每个品类下,过去 5 分钟销售额最高的 Top 2 商品。

sql

复制代码
INSERT INTO top_category_products_sink
SELECT
  window_start,
  window_end,
  category,
  product_name,
  total_sales,
  rank_num
FROM (
  SELECT
    window_start,
    window_end,
    category,
    product_name,
    total_sales,
    ROW_NUMBER() OVER (
      PARTITION BY window_start, window_end, category
      ORDER BY total_sales DESC
    ) AS rank_num
  FROM (
    SELECT
      HOP_START(o.order_time_ts, INTERVAL '1' MINUTE, INTERVAL '5' MINUTE) AS window_start,
      HOP_END(o.order_time_ts, INTERVAL '1' MINUTE, INTERVAL '5' MINUTE) AS window_end,
      p.category,
      p.product_name,
      SUM(o.quantity * p.price) AS total_sales
    FROM orders_source o
    JOIN products_dim p ON o.product_id = p.product_id
    GROUP BY HOP(o.order_time_ts, INTERVAL '1' MINUTE, INTERVAL '5' MINUTE), p.category, p.product_name
  ) t1
  WHERE total_sales > 0
) t2
WHERE rank_num <= 2;

⚠️ QUALIFY 语法需要 Flink 1.18+,1.17 用子查询方式。


五、四种 JOIN 实战

5.1 Regular Join(双流 Join)

场景:两条流实时任意条件关联(如订单流 JOIN 支付流)。

特点

  • 两条流都必须有 Watermark

  • 状态会保留所有历史数据(需要设置 TTL

  • 支持 CDC 流的更新/删除(-U/+U/-D

测试数据示例

sql

复制代码
WITH orders AS (
  SELECT 1 AS order_id, 1 AS product_id, 2 AS quantity, TIMESTAMP '2026-07-02 22:30:00' AS order_time
  UNION ALL
  SELECT 2, 1, 1, TIMESTAMP '2026-07-02 22:32:00'
),
payments AS (
  SELECT 1 AS payment_id, 1 AS order_id, 6999.00 AS amount, TIMESTAMP '2026-07-02 22:31:00' AS pay_time
  UNION ALL
  SELECT 2, 2, 12999.00, TIMESTAMP '2026-07-02 22:33:00'
)
SELECT
  o.order_id,
  o.quantity,
  p.payment_id,
  p.amount AS pay_amount
FROM orders o
JOIN payments p ON o.order_id = p.order_id;

生产环境注意事项

sql

复制代码
-- 必须设置状态 TTL,防止状态无限膨胀
SET 'table.exec.state.ttl' = '1h';

-- 两条流都必须有 Watermark
-- 适用于任意条件的实时关联
SELECT ...
FROM orders_source o
JOIN payments_source p
ON o.order_id = p.order_id;

5.2 Lookup Join(维表 Join)

场景:流表关联外部静态维表(如 MySQL 商品表)。

特点

  • 维表必须有主键

  • 不阻塞 Watermark

  • 适合静态/缓慢变化的维度数据

sql

复制代码
SELECT
  o.order_id,
  p.product_name,
  p.category,
  o.quantity * p.price AS total_amount
FROM orders_source o
JOIN products_dim p ON o.product_id = p.product_id;

5.3 Interval Join(时间区间 Join)

场景:两条流在指定时间窗口内关联(如订单和支付在 5 分钟内匹配)。

特点

  • 要求输入流是 append-only(CDC 流不支持)

  • 状态可控(只保留窗口内数据)

  • 适合两条日志流的关联

sql

复制代码
WITH orders_test AS (
  SELECT 1 AS order_id, TIMESTAMP '2026-07-02 22:30:00' AS order_time
  UNION ALL
  SELECT 2, TIMESTAMP '2026-07-02 22:32:00'
  UNION ALL
  SELECT 3, TIMESTAMP '2026-07-02 22:33:00'
  UNION ALL
  SELECT 4, TIMESTAMP '2026-07-02 22:38:00'
)
SELECT
  o1.order_id AS left_order_id,
  o2.order_id AS right_order_id,
  o1.order_time AS left_time,
  o2.order_time AS right_time
FROM orders_test o1
JOIN orders_test o2
ON o1.order_id < o2.order_id
AND o1.order_time BETWEEN o2.order_time - INTERVAL '5' MINUTE
                      AND o2.order_time + INTERVAL '5' MINUTE;

5.4 Temporal Join(版本表 Join)

场景:查询订单下单时的商品价格(历史版本)。

特点

  • 版本表必须有主键 + 时间属性

  • 状态只保留每个主键的最新版本

  • 适合持续更新的 CDC 流

sql

复制代码
WITH product_price_history AS (
  SELECT 1 AS product_id, 6999.00 AS price, TIMESTAMP '2026-07-01 00:00:00' AS valid_from
  UNION ALL
  SELECT 1, 7999.00, TIMESTAMP '2026-07-02 00:00:00'
  UNION ALL
  SELECT 2, 12999.00, TIMESTAMP '2026-07-01 00:00:00'
  UNION ALL
  SELECT 2, 13999.00, TIMESTAMP '2026-07-02 00:00:00'
),
orders AS (
  SELECT 1 AS order_id, 1 AS product_id, TIMESTAMP '2026-07-01 12:00:00' AS order_time
  UNION ALL
  SELECT 2, 1, TIMESTAMP '2026-07-02 14:00:00'
)
SELECT
  o.order_id,
  o.product_id,
  o.order_time,
  p.price AS price_at_order_time
FROM orders o
JOIN product_price_history p
ON o.product_id = p.product_id
AND p.valid_from <= o.order_time
WHERE p.valid_from = (
  SELECT MAX(valid_from)
  FROM product_price_history p2
  WHERE p2.product_id = p.product_id
  AND p2.valid_from <= o.order_time
);

六、四种 JOIN 对比总结

JOIN 类型 适用场景 状态大小 Watermark 要求 CDC 流支持 触发机制
Regular Join 双流任意条件关联 大(全量) 两条流都必须有 ✅ 支持 数据驱动
Lookup Join 流表 + 外部维表 可控(缓存) 只要求主表有 ✅ 支持 查询驱动
Interval Join 时间窗口内双流关联 可控(窗口内) 两条流都必须有 ❌ 不支持 时间驱动
Temporal Join 历史版本维表关联 可控(快照) 主表必须有 ✅ 支持 数据驱动

七、踩坑记录

坑 1:双流 JOIN 阻塞 Watermark

现象:窗口一直不触发。

原因 :两张流表 JOIN 时,必须两张表都有 Watermark ,否则全局 Watermark 取最小值(Long.MIN_VALUE)。

解决:将维表改为 Lookup Join,或给维表也加上时间属性和 Watermark。

坑 2:时区问题

现象:Flink 时间和 MySQL 时间相差 8 小时。

解决

sql

复制代码
order_time_ts AS TO_TIMESTAMP_LTZ(
  (UNIX_TIMESTAMP(order_time, 'yyyy-MM-dd''T''HH:mm:ss''Z''') + 28800) * 1000,
  3
)

解决 :用子查询 + WHERE rn <= N 替代。

坑 4:Interval Join 不支持 CDC 流

解决:用 Temporal Join 或 Lookup Join 替代;或用纯 JSON 数据模拟 append-only 流。

坑 5:资源不足(NoResourceAvailableException)

解决

bash

复制代码
./bin/flink list -r
./bin/flink cancel <JobID>
SET 'parallelism.default' = '1';