54、Flink 使用 CoGroup 实现 left/right Join 代码示例

1、概述

1)left join 实现

bash 复制代码
for (Tuple3<String, String, Long> leftTuple : left) {
                            boolean isJoin = false;

                            for (Tuple3<String, String, Long> rightTuple : right) {
                                if (Objects.equals(leftTuple.f2, rightTuple.f2)) {
                                    isJoin = true;
                                    collector.collect(new Tuple5<>(leftTuple.f0, leftTuple.f1, rightTuple.f1, leftTuple.f2, rightTuple.f2));
                                }
                            }

                            if (!isJoin) {
                                collector.collect(new Tuple5<>(leftTuple.f0, leftTuple.f1, "", leftTuple.f2, 0L));
                            }
                        }

2、right join 实现

bash 复制代码
                        for (Tuple3<String, String, Long> rightTuple : right) {
                            boolean isJoin = false;

                            for (Tuple3<String, String, Long> leftTuple : left) {
                                if (Objects.equals(leftTuple.f2, rightTuple.f2)) {
                                    isJoin = true;
                                    collector.collect(new Tuple5<>(leftTuple.f0, leftTuple.f1, rightTuple.f1, leftTuple.f2, rightTuple.f2));
                                }
                            }

                            if (!isJoin) {
                                collector.collect(new Tuple5<>(rightTuple.f0, "", rightTuple.f1, 0L, rightTuple.f2));
                            }
                        }

3、Inner Join 实现

bash 复制代码
                       // inner join
                        for (Tuple3<String, String, Long> leftTuple : left) {
                            for (Tuple3<String, String, Long> rightTuple : right) {
                                if (Objects.equals(leftTuple.f2, rightTuple.f2)) {
                                    collector.collect(new Tuple5<>(leftTuple.f0, leftTuple.f1, rightTuple.f1, leftTuple.f2, rightTuple.f2));
                                }
                            }
                        }

2、完整代码示例

bash 复制代码
import org.apache.flink.api.common.eventtime.SerializableTimestampAssigner;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.functions.RichCoGroupFunction;
import org.apache.flink.api.java.tuple.Tuple3;
import org.apache.flink.api.java.tuple.Tuple5;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows;
import org.apache.flink.util.Collector;

import java.time.Duration;
import java.util.Objects;

public class _05_CoGroupInnerOuterJoin {
    public static void main(String[] args) throws Exception {
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();

        // 测试时限制了分区数,生产中需要设置空闲数据源
        env.setParallelism(2);
        env.disableOperatorChaining();

        DataStreamSource<String> inputLeft = env.socketTextStream("localhost", 8888);

        // 事件时间需要设置水位线策略和时间戳
        SingleOutputStreamOperator<Tuple3<String, String, Long>> mapLeft = inputLeft.map(new MapFunction<String, Tuple3<String, String, Long>>() {
            @Override
            public Tuple3<String, String, Long> map(String input) throws Exception {
                String[] fields = input.split(",");
                return new Tuple3<>(fields[0], fields[1], Long.parseLong(fields[2]));
            }
        });

        SingleOutputStreamOperator<Tuple3<String, String, Long>> watermarkLeft = mapLeft.assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple3<String, String, Long>>forBoundedOutOfOrderness(Duration.ofSeconds(0))
                .withTimestampAssigner(new SerializableTimestampAssigner<Tuple3<String, String, Long>>() {
                    @Override
                    public long extractTimestamp(Tuple3<String, String, Long> input, long l) {
                        return input.f2;
                    }
                }));

        DataStreamSource<String> inputRight = env.socketTextStream("localhost", 9999);

        // 事件时间需要设置水位线策略和时间戳
        SingleOutputStreamOperator<Tuple3<String, String, Long>> mapRight = inputRight.map(new MapFunction<String, Tuple3<String, String, Long>>() {
            @Override
            public Tuple3<String, String, Long> map(String input) throws Exception {
                String[] fields = input.split(",");
                return new Tuple3<>(fields[0], fields[1], Long.parseLong(fields[2]));
            }
        });

        SingleOutputStreamOperator<Tuple3<String, String, Long>> watermarkRight = mapRight.assignTimestampsAndWatermarks(WatermarkStrategy.<Tuple3<String, String, Long>>forBoundedOutOfOrderness(Duration.ofSeconds(0))
                .withTimestampAssigner(new SerializableTimestampAssigner<Tuple3<String, String, Long>>() {
                    @Override
                    public long extractTimestamp(Tuple3<String, String, Long> input, long l) {
                        return input.f2;
                    }
                }));

        /**
         * left-join 测试数据
         *
         * left-1
         *
         * a,1,1718089200000
         * b,2,1718089200000
         * c,3,1718089200000
         *
         * right-2
         *
         * a,1,1718089200000
         * b,2,1718089200000
         * c,3,1718089200000
         *
         * left-3
         *
         * a,4,1718089202000
         * b,5,1718089202000
         * c,6,1718089202000
         *
         * right-4
         *
         * a,1,1718089202000
         * b,2,1718089202000
         * c,3,1718089202000
         *
         * left-right-5
         *
         * a,1,1718089205001
         * b,2,1718089205001
         * c,3,1718089205001
         *
         * 1> (a,1,1,1718089200000,1718089200000)
         * 1> (b,2,2,1718089200000,1718089200000)
         * 1> (c,3,3,1718089200000,1718089200000)
         * 1> (a,4,,1718089202000,0)
         * 2> (b,5,,1718089202000,0)
         * 1> (c,6,,1718089202000,0)
         */
        watermarkLeft.keyBy(e -> e.f0)
                .coGroup(watermarkRight.keyBy(e -> e.f0))
                .where(e -> e.f1)
                .equalTo(e -> e.f1)
                .window(TumblingEventTimeWindows.of(Duration.ofSeconds(5)))
                .apply(new RichCoGroupFunction<Tuple3<String, String, Long>, Tuple3<String, String, Long>, Tuple5<String, String, String, Long, Long>>() {
                    @Override
                    public void coGroup(Iterable<Tuple3<String, String, Long>> left, Iterable<Tuple3<String, String, Long>> right, Collector<Tuple5<String, String, String, Long, Long>> collector) throws Exception {
                        // left join
                        for (Tuple3<String, String, Long> leftTuple : left) {
                            boolean isJoin = false;

                            for (Tuple3<String, String, Long> rightTuple : right) {
                                if (Objects.equals(leftTuple.f2, rightTuple.f2)) {
                                    isJoin = true;
                                    collector.collect(new Tuple5<>(leftTuple.f0, leftTuple.f1, rightTuple.f1, leftTuple.f2, rightTuple.f2));
                                }
                            }

                            if (!isJoin) {
                                collector.collect(new Tuple5<>(leftTuple.f0, leftTuple.f1, "", leftTuple.f2, 0L));
                            }
                        }

                        // right join
//                        for (Tuple3<String, String, Long> rightTuple : right) {
//                            boolean isJoin = false;
//
//                            for (Tuple3<String, String, Long> leftTuple : left) {
//                                if (Objects.equals(leftTuple.f2, rightTuple.f2)) {
//                                    isJoin = true;
//                                    collector.collect(new Tuple5<>(leftTuple.f0, leftTuple.f1, rightTuple.f1, leftTuple.f2, rightTuple.f2));
//                                }
//                            }
//
//                            if (!isJoin) {
//                                collector.collect(new Tuple5<>(rightTuple.f0, "", rightTuple.f1, 0L, rightTuple.f2));
//                            }
//                        }

//                        // inner join
//                        for (Tuple3<String, String, Long> leftTuple : left) {
//                            for (Tuple3<String, String, Long> rightTuple : right) {
//                                if (Objects.equals(leftTuple.f2, rightTuple.f2)) {
//                                    collector.collect(new Tuple5<>(leftTuple.f0, leftTuple.f1, rightTuple.f1, leftTuple.f2, rightTuple.f2));
//                                }
//                            }
//                        }
                    }
                }).print();

        env.execute();
    }
}

3、测试用例

bash 复制代码
left-join 测试数据
         
left-1

a,1,1718089200000
b,2,1718089200000
c,3,1718089200000

right-2

a,1,1718089200000
b,2,1718089200000
c,3,1718089200000

left-3

a,4,1718089202000
b,5,1718089202000
c,6,1718089202000

right-4

a,1,1718089202000
b,2,1718089202000
c,3,1718089202000

left-right-5

a,1,1718089205001
b,2,1718089205001
c,3,1718089205001

1> (a,1,1,1718089200000,1718089200000)
1> (b,2,2,1718089200000,1718089200000)
1> (c,3,3,1718089200000,1718089200000)
1> (a,4,,1718089202000,0)
2> (b,5,,1718089202000,0)
1> (c,6,,1718089202000,0)
相关推荐
lose and dream_1118 分钟前
【 2024!深入了解 大语言模型(LLM)微调方法(总结)】
大数据·人工智能·opencv·机器学习·语言模型·自然语言处理·架构
我非夏日29 分钟前
基于Hadoop平台的电信客服数据的处理与分析③项目开发:搭建基于Hadoop的全分布式集群---任务7:格式化并启动Hadoop集群
大数据·hadoop·分布式
2401_8576100329 分钟前
强强联合:Apache Kylin与Impala的集成之道
大数据·apache·kylin
2401_857636391 小时前
Scala中的尾递归优化:深入探索与实践
大数据·算法·scala
知识分享小能手1 小时前
从新手到高手:Scala函数式编程完全指南,Scala 访问修饰符(6)
大数据·开发语言·后端·python·数据分析·scala·函数式编程
猫猫爱吃小鱼粮1 小时前
54、Flink 测试工具测试 Flink 作业详解
flink
KompasAI1 小时前
客户服务的智能升级:Kompas.ai如何改善客户体验
大数据·人工智能·aigc·语音识别·ai写作
乐财业-财税赋能平台1 小时前
从手工到智能:乐财业鹦鹉系统引领财税管理新纪元
大数据·人工智能
东少子鹏2 小时前
Spark
大数据·分布式·spark
我非夏日2 小时前
基于Hadoop平台的电信客服数据的处理与分析④项目实现:任务16:数据采集/消费/存储
大数据·hadoop·大数据技术开发