Flink Keyed State 详解之二

1. 基本概念

Operator State(算子状态)是 Flink 中另一种重要的状态类型。与 Keyed State 不同,Operator State 与特定的算子实例相关联,而不是与特定的 key 相关联。Operator State 的特点是:

  • 作用域限定在算子实例上
  • 每个并行算子实例都有其独立的状态副本
  • 状态会根据算子的并行度自动重新分布

2. 适用场景

Operator State 适用于以下场景:

  1. Kafka 消费者偏移量管理
  2. 需要维护算子级别状态的场景
  3. 广播状态模式
  4. 需要在算子实例间共享状态的场景

3. Operator State 类型

Flink 提供了多种 Operator State 类型:

3.1 ListState

ListState 是 Operator State 中最常用的状态类型,用于存储元素列表。

java 复制代码
import org.apache.flink.api.common.state.ListState;
import org.apache.flink.api.common.state.ListStateDescriptor;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.runtime.state.FunctionInitializationContext;
import org.apache.flink.runtime.state.FunctionSnapshotContext;
import org.apache.flink.streaming.api.checkpoint.CheckpointedFunction;
import org.apache.flink.streaming.api.functions.source.SourceFunction;

/**
 * Operator ListState 使用示例
 * 模拟一个带有状态的 SourceFunction
 */
public class StatefulSourceFunction implements SourceFunction<String>, CheckpointedFunction {
    private volatile boolean isRunning = true;
    private ListState<Long> checkpointedCount;
    private long count = 0L;

    @Override
    public void run(SourceContext<String> ctx) throws Exception {
        while (isRunning && count < 1000) {
            synchronized (ctx.getCheckpointLock()) {
                ctx.collect("Element: " + count);
                count++;
                Thread.sleep(100); // 模拟处理时间
            }
        }
    }

    @Override
    public void cancel() {
        isRunning = false;
    }

    @Override
    public void snapshotState(FunctionSnapshotContext context) throws Exception {
        checkpointedCount.clear();
        checkpointedCount.add(count);
    }

    @Override
    public void initializeState(FunctionInitializationContext context) throws Exception {
        ListStateDescriptor<Long> descriptor = new ListStateDescriptor<>(
            "count-state",
            Long.class
        );
        
        checkpointedCount = context.getOperatorStateStore().getListState(descriptor);
        
        if (context.isRestored()) {
            for (Long countValue : checkpointedCount.get()) {
                count = countValue;
            }
        }
    }
}

3.2 UnionListState

UnionListState 与 ListState 类似,但在恢复时会将所有并行实例的状态合并。

java 复制代码
import org.apache.flink.api.common.state.ListState;
import org.apache.flink.api.common.state.ListStateDescriptor;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.runtime.state.FunctionInitializationContext;
import org.apache.flink.runtime.state.FunctionSnapshotContext;
import org.apache.flink.streaming.api.checkpoint.CheckpointedFunction;
import org.apache.flink.streaming.api.functions.sink.RichSinkFunction;
import java.util.ArrayList;
import java.util.List;

/**
 * UnionListState 使用示例
 * 维护所有并行实例处理的元素列表
 */
public class UnionListStateSink extends RichSinkFunction<String> implements CheckpointedFunction {
    private List<String> processedElements;
    private ListState<String> checkpointedState;

    @Override
    public void open(Configuration parameters) throws Exception {
        processedElements = new ArrayList<>();
    }

    @Override
    public void invoke(String value, Context context) throws Exception {
        processedElements.add(value);
        System.out.println("Processed: " + value + " by subtask " + getRuntimeContext().getIndexOfThisSubtask());
    }

    @Override
    public void snapshotState(FunctionSnapshotContext context) throws Exception {
        checkpointedState.clear();
        for (String element : processedElements) {
            checkpointedState.add(element);
        }
    }

    @Override
    public void initializeState(FunctionInitializationContext context) throws Exception {
        ListStateDescriptor<String> descriptor = new ListStateDescriptor<>(
            "union-list-state",
            String.class
        );

        // 获取 UnionListState
        checkpointedState = context.getOperatorStateStore().getUnionListState(descriptor);

        if (context.isRestored()) {
            processedElements.clear();
            for (String element : checkpointedState.get()) {
                processedElements.add(element);
            }
        }
    }
}

3.3 BroadcastState<K, V>

BroadcastState 用于广播状态模式,允许将状态从一个流广播到所有并行实例。

java 复制代码
import org.apache.flink.api.common.state.MapStateDescriptor;
import org.apache.flink.api.common.state.BroadcastState;
import org.apache.flink.api.common.state.ReadOnlyBroadcastState;
import org.apache.flink.streaming.api.datastream.BroadcastStream;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.co.BroadcastProcessFunction;
import org.apache.flink.util.Collector;

/**
 * BroadcastState 使用示例
 * 实现规则广播和数据处理
 */
public class BroadcastStateExample {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        
        // 创建规则流(需要广播)
        DataStream<Rule> rulesStream = env.fromElements(
            new Rule("rule1", "pattern1"),
            new Rule("rule2", "pattern2")
        );
        
        // 创建数据流
        DataStream<Event> dataStream = env.fromElements(
            new Event("event1", "pattern1"),
            new Event("event2", "pattern2"),
            new Event("event3", "pattern1")
        );
        
        // 创建广播状态描述符
        MapStateDescriptor<String, Rule> ruleStateDescriptor = new MapStateDescriptor<>(
            "RulesBroadcastState",
            String.class,
            Rule.class
        );
        
        // 将规则流转换为广播流
        BroadcastStream<Rule> broadcastRules = rulesStream.broadcast(ruleStateDescriptor);
        
        // 连接数据流和广播流进行处理
        DataStream<String> output = dataStream
            .connect(broadcastRules)
            .process(new RuleBroadcastProcessFunction());
            
        output.print();
        
        env.execute("Broadcast State Example");
    }
    
    /**
     * 广播状态处理函数
     */
    public static class RuleBroadcastProcessFunction extends BroadcastProcessFunction<Event, Rule, String> {
        private final MapStateDescriptor<String, Rule> ruleStateDescriptor;
        
        public RuleBroadcastProcessFunction() {
            this.ruleStateDescriptor = new MapStateDescriptor<>(
                "RulesBroadcastState",
                String.class,
                Rule.class
            );
        }
        
        // 处理广播的规则元素
        @Override
        public void processBroadcastElement(Rule rule, Context ctx, Collector<String> out) throws Exception {
            BroadcastState<String, Rule> broadcastState = ctx.getBroadcastState(ruleStateDescriptor);
            // 将规则存储到广播状态中
            broadcastState.put(rule.getName(), rule);
        }
        
        // 处理数据流元素
        @Override
        public void processElement(Event event, ReadOnlyContext ctx, Collector<String> out) throws Exception {
            // 从广播状态中获取规则
            ReadOnlyBroadcastState<String, Rule> broadcastState = ctx.getBroadcastState(ruleStateDescriptor);
            
            // 遍历所有规则并匹配事件
            for (Map.Entry<String, Rule> entry : broadcastState.immutableEntries()) {
                Rule rule = entry.getValue();
                if (rule.getPattern().equals(event.getPattern())) {
                    out.collect("Event " + event.getName() + " matches rule " + rule.getName());
                }
            }
        }
    }
    
    // 规则类
    public static class Rule {
        private String name;
        private String pattern;
        
        public Rule() {}
        
        public Rule(String name, String pattern) {
            this.name = name;
            this.pattern = pattern;
        }
        
        public String getName() { return name; }
        public String getPattern() { return pattern; }
    }
    
    // 事件类
    public static class Event {
        private String name;
        private String pattern;
        
        public Event() {}
        
        public Event(String name, String pattern) {
            this.name = name;
            this.pattern = pattern;
        }
        
        public String getName() { return name; }
        public String getPattern() { return pattern; }
    }
}

4. 配置方法

4.1 状态描述符

Operator State 使用相应的描述符:

java 复制代码
// ListStateDescriptor
ListStateDescriptor<String> listDescriptor = new ListStateDescriptor<>(
    "operator-list-state",
    String.class
);

// BroadcastStateDescriptor(通过 MapStateDescriptor 创建)
MapStateDescriptor<String, String> broadcastDescriptor = new MapStateDescriptor<>(
    "broadcast-state",
    String.class,
    String.class
);

4.2 状态后端配置

Operator State 的存储依赖于状态后端:

java 复制代码
import org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend;
import org.apache.flink.runtime.state.hashmap.HashMapStateBackend;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

// 配置 HashMap 状态后端(默认)
StreamExecutionEnvironment env1 = StreamExecutionEnvironment.getExecutionEnvironment();
env1.setStateBackend(new HashMapStateBackend());

// 配置 RocksDB 状态后端(适用于大状态)
StreamExecutionEnvironment env2 = StreamExecutionEnvironment.getExecutionEnvironment();
env2.setStateBackend(new EmbeddedRocksDBStateBackend());

5. CheckpointedFunction 接口

实现 Operator State 的主要方式是实现 CheckpointedFunction 接口:

java 复制代码
import org.apache.flink.runtime.state.FunctionInitializationContext;
import org.apache.flink.runtime.state.FunctionSnapshotContext;
import org.apache.flink.streaming.api.checkpoint.CheckpointedFunction;

public class MyCheckpointedFunction implements CheckpointedFunction {
    private ListState<Integer> checkpointedState;
    private int count = 0;

    @Override
    public void snapshotState(FunctionSnapshotContext context) throws Exception {
        // 快照状态时调用
        checkpointedState.clear();
        checkpointedState.add(count);
    }

    @Override
    public void initializeState(FunctionInitializationContext context) throws Exception {
        // 初始化状态时调用
        ListStateDescriptor<Integer> descriptor = new ListStateDescriptor<>(
            "checkpointed-count",
            Integer.class
        );
        
        checkpointedState = context.getOperatorStateStore().getListState(descriptor);
        
        if (context.isRestored()) {
            // 从检查点恢复状态
            for (Integer countValue : checkpointedState.get()) {
                count = countValue;
            }
        }
    }
}

6. 完整使用示例

java 复制代码
import org.apache.flink.api.common.state.ListState;
import org.apache.flink.api.common.state.ListStateDescriptor;
import org.apache.flink.configuration.Configuration;
import org.apache.flink.runtime.state.FunctionInitializationContext;
import org.apache.flink.runtime.state.FunctionSnapshotContext;
import org.apache.flink.streaming.api.checkpoint.CheckpointedFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.source.RichParallelSourceFunction;
import java.util.ArrayList;
import java.util.List;
import java.util.Random;

/**
 * Operator State 完整使用示例
 * 实现一个带有状态的并行数据源
 */
public class OperatorStateExample {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.enableCheckpointing(5000); // 每5秒进行一次检查点
        
        DataStream<Integer> numbers = env.addSource(new ParallelNumberSource(100));
        numbers.print();
        
        env.execute("Operator State Example");
    }
    
    /**
     * 并行数字源函数
     * 每个并行实例维护自己的状态
     */
    public static class ParallelNumberSource extends RichParallelSourceFunction<Integer> implements CheckpointedFunction {
        private final int maxCount;
        private volatile boolean isRunning = true;
        private int currentCount = 0;
        private ListState<Integer> checkpointedCount;
        private Random random = new Random();
        
        public ParallelNumberSource(int maxCount) {
            this.maxCount = maxCount;
        }
        
        @Override
        public void run(SourceContext<Integer> ctx) throws Exception {
            while (isRunning && currentCount < maxCount) {
                synchronized (ctx.getCheckpointLock()) {
                    ctx.collect(currentCount);
                    currentCount++;
                    Thread.sleep(100 + random.nextInt(200)); // 随机延迟
                }
            }
        }
        
        @Override
        public void cancel() {
            isRunning = false;
        }
        
        @Override
        public void snapshotState(FunctionSnapshotContext context) throws Exception {
            checkpointedCount.clear();
            checkpointedCount.add(currentCount);
        }
        
        @Override
        public void initializeState(FunctionInitializationContext context) throws Exception {
            ListStateDescriptor<Integer> descriptor = new ListStateDescriptor<>(
                "number-source-count",
                Integer.class
            );
            
            checkpointedCount = context.getOperatorStateStore().getListState(descriptor);
            
            if (context.isRestored()) {
                // 从检查点恢复状态
                for (Integer count : checkpointedCount.get()) {
                    currentCount = count;
                }
                System.out.println("Subtask " + getRuntimeContext().getIndexOfThisSubtask() + 
                    " restored count to " + currentCount);
            } else {
                System.out.println("Subtask " + getRuntimeContext().getIndexOfThisSubtask() + 
                    " starting with count " + currentCount);
            }
        }
    }
}

7. 最佳实践建议

  1. 合理选择状态类型

    • 一般状态管理使用 ListState
    • 需要合并状态时使用 UnionListState
    • 广播场景使用 BroadcastState
  2. 状态命名规范

    • 使用有意义的名称
    • 避免重复名称
    • 建议使用小写字母和连字符
  3. 状态管理

    • 在 snapshotState 中正确保存状态
    • 在 initializeState 中正确恢复状态
    • 处理并行度变化时的状态重分布
  4. 性能优化

    • 避免在状态中存储大量数据
    • 合理设置检查点间隔
    • 考虑使用 RocksDB 状态后端处理大状态
  5. 容错处理

    • 确保状态操作的幂等性
    • 处理状态恢复时的异常情况
    • 定期检查点以保证状态一致性
  6. 广播状态注意事项

    • 广播状态只允许在 processBroadcastElement 中修改
    • 在 processElement 中只能读取广播状态
    • 注意广播状态的内存使用

通过合理使用 Operator State,可以有效地在 Flink 应用程序中维护和处理与特定算子实例相关联的状态信息,实现复杂的状态管理和计算逻辑。

相关推荐
夏洛克信徒23 分钟前
从 “工具” 到 “代理”:Gemini 3.0 重构 AI 能力边界,开启智能协作新纪元
大数据·人工智能·神经网络
Caitlin_lee_1 小时前
SCAU数据科学与大数据技术导论期末整理(1-3章)
大数据
weixin_456904274 小时前
Git大文件管理与版本回退
大数据·git·elasticsearch
TDengine (老段)5 小时前
TDengine 转换函数 CAST 用户手册
java·大数据·数据库·物联网·时序数据库·tdengine·涛思数据
成长之路5145 小时前
【实证分析】地级市人口集聚度数据集-含代码(2000-2024年)
大数据
专注数据的痴汉5 小时前
「数据获取」《中国商务年鉴》(2004-2024)
大数据·人工智能·信息可视化
一只小青团6 小时前
Hadoop之javaAPI写HDFS的shell命令
大数据·hadoop·分布式
天下无敌笨笨熊6 小时前
ES作为向量库研究
大数据·python·elasticsearch