13、Flink 的 Operator State 详解

1.算子状态 (Operator State)

算子状态 （或者非 keyed 状态）是绑定到一个并行算子实例的状态，Kafka consumer 每个并行实例维护了 topic partitions 和偏移量的 map 作为它的算子状态。

当并行度改变的时候，算子状态支持将状态重新分发给各并行算子实例，处理重分发过程有多种不同的方案。

算子状态作为一种特殊类型的状态使用，用于实现 source/sink，以及无法对 state 进行分区而没有主键的这类场景中。

注意： Python DataStream API 仍无法支持算子状态。

2.使用 Operator State

用户可以通过实现 CheckpointedFunction 接口来使用 operator state。

a）CheckpointedFunction 概述

CheckpointedFunction 接口提供了访问 non-keyed state 的方法，需要实现如下两个方法：

复制代码

void snapshotState(FunctionSnapshotContext context) throws Exception;

void initializeState(FunctionInitializationContext context) throws Exception;

进行 checkpoint 时会调用 snapshotState()，自定义函数初始化时会调用 initializeState()，初始化包括第一次自定义函数初始化和从之前的 checkpoint 恢复；因此 initializeState() 不仅是定义不同状态类型初始化的地方，也需要包括状态恢复的逻辑。

当前 operator state 以 list 的形式存在；这些状态是一个 可序列化 对象的集合 List，彼此独立，方便在改变并发后进行状态的重新分派，根据状态的不同访问方式，有如下几种重新分配的模式：

Even-split redistribution: 每个算子都保存一个列表形式的状态集合，整个状态由所有的列表拼接而成；当作业恢复或重新分配的时候，整个状态会按照算子的并发度进行均匀分配。；比如说，算子 A 的并发度为 1，包含两个元素 element1 和 element2，当并发度增加为 2 时，element1 会被分到并发 0 上，element2 则会被分到并发 1 上。
Union redistribution: 每个算子保存一个列表形式的状态集合；整个状态由所有的列表拼接而成；当作业恢复或重新分配时，每个算子都将获得所有的状态数据【不建议使用】

b）带缓冲的 SinkFunction

案例： SinkFunction 在 CheckpointedFunction 中进行数据缓存，然后统一发送到下游，演示了列表状态数据的 event-split redistribution。

复制代码

public class BufferingSink
        implements SinkFunction<Tuple2<String, Integer>>,
                   CheckpointedFunction {

    private final int threshold;

    private transient ListState<Tuple2<String, Integer>> checkpointedState;

    private List<Tuple2<String, Integer>> bufferedElements;

    public BufferingSink(int threshold) {
        this.threshold = threshold;
        this.bufferedElements = new ArrayList<>();
    }

    @Override
    public void invoke(Tuple2<String, Integer> value, Context contex) throws Exception {
        bufferedElements.add(value);
        if (bufferedElements.size() >= threshold) {
            for (Tuple2<String, Integer> element: bufferedElements) {
                // send it to the sink
            }
            bufferedElements.clear();
        }
    }

    @Override
    public void snapshotState(FunctionSnapshotContext context) throws Exception {
        checkpointedState.update(bufferedElements);
    }

    @Override
    public void initializeState(FunctionInitializationContext context) throws Exception {
        ListStateDescriptor<Tuple2<String, Integer>> descriptor =
            new ListStateDescriptor<>(
                "buffered-elements",
                TypeInformation.of(new TypeHint<Tuple2<String, Integer>>() {}));

        checkpointedState = context.getOperatorStateStore().getListState(descriptor);

        if (context.isRestored()) {
            for (Tuple2<String, Integer> element : checkpointedState.get()) {
                bufferedElements.add(element);
            }
        }
    }
}

initializeState 方法接收一个 FunctionInitializationContext 参数，用来初始化 non-keyed state 的 "容器" 即 ListState 用于在 checkpoint 时保存 non-keyed state 对象，和 keyed state 类似，StateDescriptor 会包括状态名字、以及状态类型相关信息。

复制代码

ListStateDescriptor<Tuple2<String, Integer>> descriptor =
    new ListStateDescriptor<>(
        "buffered-elements",
        TypeInformation.of(new TypeHint<Tuple2<String, Integer>>() {}));

checkpointedState = context.getOperatorStateStore().getListState(descriptor);

调用不同的获取状态对象的接口，会使用不同的状态分配算法，比如 getUnionListState(descriptor) 会使用 union redistribution 算法，而 getListState(descriptor) 则使用 even-split redistribution 算法。

当初始化状态对象后，通过 isRestored() 方法判断是否从之前的故障中恢复，如果该方法返回 true 则表示从故障中进行恢复，会执行接下来的恢复逻辑，BufferingSink 中初始化时，恢复回来的 ListState 的所有元素会添加到一个局部变量中，供下次 snapshotState() 时使用，然后清空 ListState，再把当前局部变量中的所有元素写入到 checkpoint 中。

同样可以在 initializeState() 方法中使用 FunctionInitializationContext 初始化 keyed state。

c）带状态的 Source Function

需要保证更新状态以及输出的原子性（用于支持 exactly-once 语义），需要在发送数据前获取数据源的全局锁。

复制代码

public static class CounterSource
        extends RichParallelSourceFunction<Long>
        implements CheckpointedFunction {

    /**  current offset for exactly once semantics */
    private Long offset = 0L;

    /** flag for job cancellation */
    private volatile boolean isRunning = true;
    
    /** 存储 state 的变量. */
    private ListState<Long> state;
     
    @Override
    public void run(SourceContext<Long> ctx) {
        final Object lock = ctx.getCheckpointLock();

        while (isRunning) {
            // output and state update are atomic
            synchronized (lock) {
                ctx.collect(offset);
                offset += 1;
            }
        }
    }

    @Override
    public void cancel() {
        isRunning = false;
    }

    @Override
    public void initializeState(FunctionInitializationContext context) throws Exception {
        state = context.getOperatorStateStore().getListState(new ListStateDescriptor<>(
            "state",
            LongSerializer.INSTANCE));
            
        // 从已保存的状态中恢复 offset 到内存中，在进行任务恢复的时候也会调用此初始化状态的方法
        for (Long l : state.get()) {
            offset = l;
        }
    }

    @Override
    public void snapshotState(FunctionSnapshotContext context) throws Exception {
        state.update(Collections.singletonList(offset));
    }
}

要获取 checkpoint 成功消息的算子，可以参考 org.apache.flink.api.common.state.CheckpointListener 接口

【当算子完成 checkpoint 后会回调 notifyCheckpointComplete() 方法】。