1. 环境准备
1.1 目录准备
把下面的 {...} 替换成你机器上的真实路径。
bash
# 1) File Catalog Store:保存 catalog 信息
mkdir -p {catalog_store_path}
# 2) test-filesystem Catalog:保存表元数据 & 表数据
mkdir -p {catalog_path}
mkdir -p {catalog_path}/mydb
# 3) Checkpoints & Savepoints
mkdir -p {checkpoints_path}
mkdir -p {savepoints_path}
1.2 资源准备
1.2.1 下载 Flink 并解压
Flink 可运行在 Linux / Mac / Cygwin(Windows) 等类 UNIX 系统。
bash
tar -xzf flink-*.tgz
1.2.2 安装 test-filesystem connector
把 flink-table-filesystem-test-utils-{VERSION}.jar 放到 Flink 的 lib/ 下:
bash
cp flink-table-filesystem-test-utils-{VERSION}.jar flink-*/lib/
1.3 配置准备(config.yaml)
编辑 config.yaml,追加/确保以下配置存在:
yaml
execution:
checkpoints:
dir: file://{checkpoints_path}
# Configure file catalog store
table:
catalog-store:
kind: file
file:
path: {catalog_store_path}
# Configure embedded scheduler
workflow-scheduler:
type: embedded
# Configure SQL gateway address and port
sql-gateway:
endpoint:
rest:
address: 127.0.0.1
port: 8083
说明:
catalog-store用于持久化 catalog properties,保证重启后仍能初始化 Catalogworkflow-scheduler: embedded用于 FULL 模式的周期刷新调度(当前只支持内置调度器)
2. 启动组件(本地)
2.1 启动 Flink Cluster
进入 Flink 解压目录后执行:
bash
./bin/start-cluster.sh
访问 Web UI:
http://localhost:8081
2.2 启动 SQL Gateway
bash
./bin/sql-gateway.sh start
2.3 启动 SQL Client 并连接 Gateway
bash
./bin/sql-client.sh gateway --endpoint http://127.0.0.1:8083
3. 创建 Catalog 与源表(test-filesystem)
3.1 创建 test-filesystem Catalog
在 SQL Client 中执行:
sql
CREATE CATALOG mt_cat WITH (
'type' = 'test-filesystem',
'path' = '{catalog_path}',
'default-database' = 'mydb'
);
USE CATALOG mt_cat;
3.2 创建源表 json_source 并写入测试数据
sql
-- 1) Create Source table (json format)
CREATE TABLE json_source (
order_id BIGINT,
user_id BIGINT,
user_name STRING,
order_created_at STRING,
payment_amount_cents BIGINT
) WITH (
'format' = 'json',
'source.monitor-interval' = '10s'
);
-- 2) Insert test data
INSERT INTO json_source VALUES
(1001, 1, 'user1', '2024-06-19', 10),
(1002, 2, 'user2', '2024-06-19', 20),
(1003, 3, 'user3', '2024-06-19', 30),
(1004, 4, 'user4', '2024-06-19', 40),
(1005, 1, 'user1', '2024-06-20', 10),
(1006, 2, 'user2', '2024-06-20', 20),
(1007, 3, 'user3', '2024-06-20', 30),
(1008, 4, 'user4', '2024-06-20', 40);
4. CONTINUOUS 模式:流式连续刷新物化表
4.1 创建 CONTINUOUS 物化表(Freshness=30s)
创建一个分区字段为 ds 的物化表,并设置新鲜度为 30 秒(意味着 streaming job 的 checkpoint interval ≈ 30s)。
sql
CREATE MATERIALIZED TABLE continuous_users_shops
PARTITIONED BY (ds)
WITH (
'format' = 'debezium-json',
'sink.rolling-policy.rollover-interval' = '10s',
'sink.rolling-policy.check-interval' = '10s'
)
FRESHNESS = INTERVAL '30' SECOND
AS SELECT
user_id,
ds,
SUM(payment_amount_cents) AS payed_buy_fee_sum,
SUM(1) AS PV
FROM (
SELECT user_id, order_created_at AS ds, payment_amount_cents
FROM json_source
) AS tmp
GROUP BY user_id, ds;
验证点:
- 打开
http://localhost:8081 - 你会看到一个 Streaming Job 在运行(连续刷新 pipeline)
4.2 挂起 CONTINUOUS 物化表(SUSPEND)
挂起前先设置 savepoint 目录:
sql
SET 'execution.checkpointing.savepoint-dir' = 'file://{savepoints_path}';
ALTER MATERIALIZED TABLE continuous_users_shops SUSPEND;
验证点:
- Web UI 上该 streaming job 会转为 FINISHED(被 stop with savepoint)
4.3 查询物化表数据
sql
SELECT * FROM continuous_users_shops;
4.4 恢复 CONTINUOUS 物化表(RESUME)
sql
ALTER MATERIALIZED TABLE continuous_users_shops RESUME;
验证点:
- Web UI 上会出现一个新的 streaming job
- 会从你配置的 savepoint 路径尝试恢复(取决于实现/配置)
4.5 删除 CONTINUOUS 物化表(DROP)
sql
DROP MATERIALIZED TABLE continuous_users_shops;
验证点:
- Web UI 上对应 job 会进入 CANCELED 或相应终止状态
5. FULL 模式:批式周期刷新物化表
5.1 创建 FULL 物化表(Freshness=1 minute)
为了便于测试,这里把 freshness 设置为 1 分钟,并显式指定 REFRESH_MODE = FULL。
sql
CREATE MATERIALIZED TABLE full_users_shops
PARTITIONED BY (ds)
WITH (
'format' = 'json',
'partition.fields.ds.date-formatter' = 'yyyy-MM-dd'
)
FRESHNESS = INTERVAL '1' MINUTE
REFRESH_MODE = FULL
AS SELECT
user_id,
ds,
SUM(payment_amount_cents) AS payed_buy_fee_sum,
SUM(1) AS PV
FROM (
SELECT user_id, order_created_at AS ds, payment_amount_cents
FROM json_source
) AS tmp
GROUP BY user_id, ds;
验证点:
- Web UI
http://localhost:8081会看到 周期性 Batch Job(每分钟调度一次)
5.2 写入"今天分区"的数据并观察自动刷新
向源表写入当前日期数据:
sql
INSERT INTO json_source VALUES
(1001, 1, 'user1', CAST(CURRENT_DATE AS STRING), 10),
(1002, 2, 'user2', CAST(CURRENT_DATE AS STRING), 20),
(1003, 3, 'user3', CAST(CURRENT_DATE AS STRING), 30),
(1004, 4, 'user4', CAST(CURRENT_DATE AS STRING), 40);
等待至少 1 分钟后查询:
sql
SELECT * FROM full_users_shops;
你应该能观察到:
- 在 FULL + 分区表 +
partition.fields.ds.date-formatter配置正确的前提下,周期刷新通常只刷新"最新分区"(即今天的 ds)
5.3 手动刷新历史分区(REFRESH PARTITION)
刷新指定历史分区 ds='2024-06-20':
sql
ALTER MATERIALIZED TABLE full_users_shops REFRESH PARTITION(ds='2024-06-20');
SELECT * FROM full_users_shops;
验证点:
- Web UI 会出现一条"本次手动刷新"的 Batch Job
5.4 挂起与恢复 FULL 物化表(SUSPEND / RESUME)
挂起后将不再周期调度;恢复后重新开始周期调度。
sql
ALTER MATERIALIZED TABLE full_users_shops SUSPEND;
ALTER MATERIALIZED TABLE full_users_shops RESUME;
5.5 删除 FULL 物化表(DROP)
sql
DROP MATERIALIZED TABLE full_users_shops;
验证点:
- 之后 Web UI 不会再出现该物化表对应的周期刷新 Batch Job
6. 常见检查点(跑不通先看这里)
6.1 SQL Gateway 没配 workflow-scheduler: embedded
- FULL 模式周期刷新可能不触发/不可用
6.2 没配 table.catalog-store
- 重启后 catalog properties 可能无法自动恢复,影响物化表相关操作
6.3 CONTINUOUS 挂起前没设置 savepoint dir
SUSPEND可能失败或无法按预期 stop-with-savepoint
6.4 FULL 想按分区刷但没配 date-formatter
- 可能会变成整表覆盖刷新,成本陡增
- 且
partition.fields.ds.date-formatter仅 FULL 生效,字段类型需为 string