Flink 实战：如何计算实时热门合约

本文将通过使用 Flink 框架实现 实时热门合约 需求。实际业务过程中，如何判断合约是否属于热门合约，可以从以下几个方面进行分析，比如：

交易数量：合约被调用的次数可以作为其热门程度的指标之一。
交易金额：合约处理的资金量也是评判热门程度的重要指标。
活跃用户数量：调用合约的用户数量可以反映合约的受欢迎程度。
交易频率：合约的调用频率可以反映其热门程度和使用情况。

但我们本次目的主要是关于学习 Flink API 的一些使用，以及在生产过程中，我们应该如何一步一步改进，所以本次我们主要以 交易数量 作为热门合约的评判标准。

通过本文你将学到：

如何基于 EventTime 处理，如何指定 Watermark
如何使用 Flink 灵活的 Window API
何时需要用到 State，以及如何使用
如何使用 ProcessFunction 实现 TopN 功能
如何使用 Flink DataStream API 读取 kafka 数据源
如何将计算结果 Sink 到 Kafka 存储

实战案例介绍

要实现一个 实时热门合约 的需求，我们首先拆解成以下思路：

基本需求

每隔 5 分钟输出最近一小时交易量最多的前N个合约
过滤出属于合约的交易数量

解决思路

抽取出业务时间戳，告诉 Flink 框架基于业务时间做窗口
在所有交易行为数据中，过滤出合约行为进行统计
构建滑动窗口，窗口长度为1小时，滑动距离为 5 分钟
将KeyedStream中的元素存储到ListState中，当水位线超过窗口结束时间时，排序输出
按每个窗口聚合，输出每个窗口中交易量前N名的合约

数据准备

这里我们采用已经同步好在 kafka 的真实的 链上数据 ，数据结构如下：

json 复制代码

{
    "hash":"0xf20f572847c23be6055f5373691c16b002cd573a16314ca2509c7c13805719c1",
    "blockHash":"0x7785b54d5e82bab42a0b1a3ef015ab1f0b3dce78fe188f0838993d360e26289a",
    "blockNumber":19168715,
    "from":"0xf20f572847c23be6055f5373691c16b002cd573a16314ca2509c7c13805719c1", //交易发起地址
    "to":"0xf20f572847c23be6055f5373691c16b002cd573a16314ca2509c7c13805719c1", //交易接收地址
    "value":0,
    "timestamp":1707216599,
    "transactionsType":1  //0:普通账户交易 1:合约账户交易
}

编写程序

首先获取环境

ini 复制代码

StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

创建kafka数据源

我们已经将链上数据同步管道开启，在实时同步数据到 kafka。

我们先创建一个 Transactions 的 POJO 类，所有成员变量声明成public。

typescript 复制代码

/**
 * 交易行为数据结构
 */
public class Transactions {
    public String hash;
    public String blockHash;
    public BigInteger blockNumber;
    public String from;
    public String to;
    public BigInteger value;
    public long timestamp;
    public Integer transactionsType;

    public Transactions(){}

    public Transactions(String hash, String blockHash, BigInteger blockNumber, String from, String to, BigInteger value, long timestamp, Integer transactionsType) {
        this.hash = hash;
        this.blockHash = blockHash;
        this.blockNumber = blockNumber;
        this.from = from;
        this.to = to;
        this.value = value;
        this.timestamp = timestamp;
        this.transactionsType = transactionsType;
    }

    public String getHash() {
        return hash;
    }

    public void setHash(String hash) {
        this.hash = hash;
    }

    public String getBlockHash() {
        return blockHash;
    }

    public void setBlockHash(String blockHash) {
        this.blockHash = blockHash;
    }

    public BigInteger getBlockNumber() {
        return blockNumber;
    }

    public void setBlockNumber(BigInteger blockNumber) {
        this.blockNumber = blockNumber;
    }

    public String getFrom() {
        return from;
    }

    public void setFrom(String from) {
        this.from = from;
    }

    public String getTo() {
        return to;
    }

    public void setTo(String to) {
        this.to = to;
    }

    public BigInteger getValue() {
        return value;
    }

    public void setValue(BigInteger value) {
        this.value = value;
    }

    public long getTimestamp() {
        return timestamp;
    }

    public void setTimestamp(long timestamp) {
        this.timestamp = timestamp;
    }

    public Integer getTransactionsType() {
        return transactionsType;
    }

    public void setTransactionsType(Integer transactionsType) {
        this.transactionsType = transactionsType;
    }

    @Override
    public String toString() {
        return "Transactions{" +
                "hash='" + hash + ''' +
                ", blockHash='" + blockHash + ''' +
                ", blockNumber=" + blockNumber +
                ", from='" + from + ''' +
                ", to='" + to + ''' +
                ", value=" + value +
                ", timestamp=" + timestamp +
                ", transactionsType=" + transactionsType +
                '}';
    }
}

接下来我们就可以创建一个一个读取 kafka 数据的数据源。

arduino 复制代码

        Properties properties = new Properties();
        properties.setProperty("bootstrap.servers", "localhost:9092"); //kafka地址
        properties.setProperty("group.id", "ods_transactions");  //消费组
        properties.setProperty("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer");  //key反序列化
        properties.setProperty("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer"); //value反序列化
        properties.setProperty("auto.offset.reset", "latest"); //消费偏移从最新开始
        DataStreamSource<String> datasource = env.addSource(new FlinkKafkaConsumer<String>("ods_transactions", new SimpleStringSchema(), properties));

下一步我们将数据源转为 Transactions 对象类型

java 复制代码

DataStream<Transactions> kafkaStream = datasource.map(new MapFunction<String, Transactions>() {
            @Override
            public Transactions map(String message) throws Exception {
                Gson gson = new Gson();
                return gson.fromJson(message, Transactions.class);
            }
        });

这就创建了一个 Transactions 类型的 DataStream。

EventTime 与 Watermark

在本案例中，我们需要统计业务时间上的每小时的合约交易量，所以要基于 EventTime 来处理。

将每条数据的业务时间就当做 Watermark，这里我们用 AscendingTimestampExtractor 来实现时间戳的抽取和 Watermark 的生成。

java 复制代码

 DataStream<Transactions> timedStream = kafkaStream
                .assignTimestampsAndWatermarks(new AscendingTimestampExtractor<Transactions>() {
                    @Override
                    public long extractAscendingTimestamp(Transactions transactions) {
                        // 原始数据单位秒，将其转成毫秒
                        return transactions.timestamp * 1000;
                    }
                });

这样我们就得到了一个带有时间标记的数据流了，后面就能做一些窗口的操作。

过滤出合约交易数据

由于原始数据中存在普通交易和合约行为的数据，但是我们只需要统计合约数据，所以先使用 FilterFunction 将合约行为数据过滤出来。

java 复制代码

  DataStream<Transactions> contractStream = timedStream
                .filter(new FilterFunction<Transactions>() {
                    @Override
                    public boolean filter(Transactions transactions) throws Exception {
                        // 过滤出只有合约交易的数据
                        return transactions.transactionsType.equals(TransactionEnum.CONTRACT_ADDRESS.getCode());
                    }
                });

窗口统计合约交易量

由于要每隔5分钟统计一次最近一小时每个合约的交易量，所以窗口大小是一小时，每隔5分钟滑动一次。即分别要统计 [09:00, 10:00), [09:05, 10:05), [09:10, 10:10)... 等窗口的合约交易量。是一个常见的滑动窗口需求（Sliding Window）。

scss 复制代码

DataStream<ContractCount> windowedStream = contractStream
                .keyBy("to")
                .timeWindow(Time.minutes(60), Time.minutes(5))
                .aggregate(new CountAgg(), new WindowResultFunction());

这里使用.keyBy("to")对合约进行分组，使用.timeWindow(Time size, Time slide)对每个合约做滑动窗口（1小时窗口，5分钟滑动一次）。使用 .aggregate(AggregateFunction af, WindowFunction wf) 做增量的聚合操作，它能使用AggregateFunction提前聚合掉数据，减少 state 的存储压力。

这里的CountAgg实现了AggregateFunction接口，功能是统计窗口中的条数，即遇到一条数据就加一。

kotlin 复制代码

 /** COUNT 统计的聚合函数实现，每出现一条记录加一 */
    public static class CountAgg implements AggregateFunction<Transactions, Long, Long> {

        @Override
        public Long createAccumulator() {
            return 0L;
        }

        @Override
        public Long add(Transactions transactions, Long acc) {
            return acc + 1;
        }

        @Override
        public Long getResult(Long acc) {
            return acc;
        }

        @Override
        public Long merge(Long acc1, Long acc2) {
            return acc1 + acc2;
        }
    }

这里实现的WindowResultFunction将主键合约地址，窗口，交易量封装成了ContractCount进行输出。

vbnet 复制代码

 /** 用于输出窗口的结果 */
    public static class WindowResultFunction implements WindowFunction<Long, ContractCount, Tuple, TimeWindow> {

        @Override
        public void apply(
                Tuple key,  // 窗口的主键，即 to 合约地址
                TimeWindow window,  // 窗口
                Iterable<Long> aggregateResult, // 聚合函数的结果，即 count 值
                Collector<ContractCount> collector  // 输出类型为 ContractCount
        ) throws Exception {
            String to = ((Tuple1<String>) key).f0;
            Long count = aggregateResult.iterator().next();
            collector.collect(ContractCount.of(to, window.getEnd(), count));
        }
    }

typescript 复制代码

/** 合约交易量(窗口操作的输出类型) */
public class ContractCount {
    public String to;     // 合约地址
    public Long windowEnd;  // 窗口结束时间戳
    public Long count; // 合约交易量

    public ContractCount(){}

    public ContractCount(String to, Long windowEnd, Long count) {
        this.to = to;
        this.windowEnd = windowEnd;
        this.count = count;
    }

    public String getTo() {
        return to;
    }

    public void setTo(String to) {
        this.to = to;
    }

    public Long getWindowEnd() {
        return windowEnd;
    }

    public void setWindowEnd(Long windowEnd) {
        this.windowEnd = windowEnd;
    }

    public Long getCount() {
        return count;
    }

    public void setCount(Long count) {
        this.count = count;
    }

    @Override
    public String toString() {
        return "ContractCount{" +
                "to='" + to + ''' +
                ", windowEnd=" + windowEnd +
                ", count=" + count +
                '}';
    }
}

现在我们得到了每个合约在每个窗口的交易量的数据流。

TopN 计算最热门合约

为了统计每个窗口下最热门的合约，我们需要再次按窗口进行分组，这里根据ContractCount中的windowEnd进行keyBy()操作。然后使用 ProcessFunction 实现一个自定义的 TopN 函数 TopNHotContracts 来计算交易量排名前5名的合约，并将排名结果sink到 kafka。

arduino 复制代码

DataStream<String> topContracts = windowedStream
                .keyBy("windowEnd")
                .process(new TopNHotContracts(5));  // 求交易量前5的合约

arduino 复制代码

 topContracts.addSink(new FlinkKafkaProducer<String>(
                "hot_contract",
                new SimpleStringSchema(),
                properties
        ));

java 复制代码

/** 求某个窗口中前 N 名的热门合约，key 为窗口时间戳 */
    public static class TopNHotContracts extends KeyedProcessFunction<Tuple, ContractCount, String> {

        private final int topSize;

        public TopNHotContracts(int topSize) {
            this.topSize = topSize;
        }

        // 用于存储合约与交易量的状态，待收齐同一个窗口的数据后，再触发 TopN 计算
        private ListState<ContractCount> contractState;

        @Override
        public void open(Configuration parameters) throws Exception {
            super.open(parameters);
            // 状态的注册
            ListStateDescriptor<ContractCount> contractStateDesc = new ListStateDescriptor<>(
                    "contractState-state",
                    ContractCount.class);
            contractState = getRuntimeContext().getListState(contractStateDesc);
        }

        @Override
        public void processElement(
                ContractCount input,
                Context context,
                Collector<String> collector) throws Exception {

            // 每条数据都保存到状态中
            contractState.add(input);
            // 注册 windowEnd+1 的 EventTime Timer, 当触发时，说明收齐了属于windowEnd窗口的所有合约数据
            context.timerService().registerEventTimeTimer(input.windowEnd + 1);
        }

        @Override
        public void onTimer(
                long timestamp, OnTimerContext ctx, Collector<String> out) throws Exception {
            // 获取收到的所有合约交易两
            List<ContractCount> allContract = new ArrayList<>();
            for (ContractCount contract : contractState.get()) {
                allContract.add(contract);
            }
            // 提前清除状态中的数据，释放空间
            contractState.clear();
            // 按照交易量从大到小排序
            allContract.sort(new Comparator<ContractCount>() {
                @Override
                public int compare(ContractCount o1, ContractCount o2) {
                    return (int) (o2.count - o1.count);
                }
            });
            // 将排名信息放在 List ，sink 到kafka
            List<HotContract> list = new ArrayList<>();
            for (int i=0;i<topSize;i++) {
                ContractCount currentItem = allContract.get(i);
                HotContract hotContract = new HotContract();
                hotContract.setAddress(currentItem.to);
                hotContract.setCount(currentItem.count);
                list.add(hotContract);
            }
            out.collect(GsonUtil.toJson(list));
        }
    }

打印输出

最后一步我们将结果打印输出到控制台，并调用env.execute执行任务。

lua 复制代码

topContracts.print();
env.execute("Hot contracts Job");

总结

我们来回顾下整个计算的流程，以及转换的原理。

1）首先，我们通过读取kafka数据源，得到 datasourceStream

2）通过将数据源数据 map 转化为 POJO 对象，得到 kafkaStream

3）通过 filter 算子，过滤出属于合约的交易，得到 contractStream

4）按照合约地址进行 keyBy("to") 分区

5）然后进行开窗

6）进行聚合统计

这只是一个基本的流程，实际生产过程中，我们还要考虑多个方面，比如：

1）一些基本的配置信息从环境变量获取，变成参数传入，从配置文件读取等

2）开启 Checkpoint，配置 CheckpointingMode.EXACTLY_ONCE，保证端到端的Exactly-Once一致性

3）并行度的设置

4）Flink 业务日志的配置，进行监控业务日志

5）程序的容错性等等

6）在读取 kafka 数据时，在读取数据反序列化时就转为对象，而不是通过 map 进行转换