目录
[(1) 主程序入口](#(1) 主程序入口)
[(2) 窗口联结(Window Join)](#(2) 窗口联结(Window Join))
[(3) 间隔联结(Interval Join)](#(3) 间隔联结(Interval Join))
[(4) 窗口同组联结(CoGroup)](#(4) 窗口同组联结(CoGroup))
[(5) 执行任务](#(5) 执行任务)
[(1) 时间戳分配](#(1) 时间戳分配)
[(2) 窗口大小](#(2) 窗口大小)
[(3) 输出格式](#(3) 输出格式)
[(4) 并行度](#(4) 并行度)

这段代码展示了 Apache Flink 中三种不同的流联结操作:窗口联结(Window Join) 、间隔联结(Interval Join) 和 窗口同组联结(CoGroup)。以下是对代码的详细解析和说明:
代码结构
-
包声明 :
package transformplus
定义了代码所在的包。
-
导入依赖 :
导入了 Flink 相关类库,包括流处理 API、窗口分配器、时间语义等。
-
WindowJoin
对象 :主程序入口,包含三种流联结操作的实现。
Scala
package transformplus
import java.lang
import org.apache.flink.api.common.functions.CoGroupFunction
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.util.Collector
import source.Event
/**
*
* @PROJECT_NAME: flink1.13
* @PACKAGE_NAME: transformplus
* @author: 赵嘉盟-HONOR
* @data: 2023-12-05 12:05
* @DESCRIPTION
*
*/
object WindowJoin {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
//TODO 窗口联结(join)
val stream1 = env.fromElements(
("a", 1000L),
("b", 1000L),
("a", 2000L),
("b", 6000L)
).assignAscendingTimestamps(_._2)
val stream2 = env.fromElements(
("a", 3000L),
("b", 3000L),
("a", 4000L),
("b", 8000L)
).assignAscendingTimestamps(_._2)
stream1.join(stream2)
.where(_._1)
.equalTo(_._1)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.apply((e1,e2)=>e1+"->"+e2)
.print("Join")
//TODO 间隔联结:用户行为事件联系(intervalJoin)
// 订单事件流
val orderStream: DataStream[(String, String, Long)] = env
.fromElements(
("Mary", "order-1", 5000L),
("Alice", "order-2", 5000L),
("Bob", "order-3", 20000L),
("Alice", "order-4", 20000L),
("Cary", "order-5", 51000L)
).assignAscendingTimestamps(_._3)
// 点击事件流
val pvStream: DataStream[Event] = env
.fromElements(
Event("Bob", "./cart", 2000L),
Event("Alice", "./prod?id=100", 3000L),
Event("Alice", "./prod?id=200", 3500L),
Event("Bob", "./prod?id=2", 2500L),
Event("Alice", "./prod?id=300", 36000L),
Event("Bob", "./home", 30000L),
Event("Bob", "./prod?id=1", 23000L),
Event("Bob", "./prod?id=3", 33000L)
).assignAscendingTimestamps(_.timestamp)
orderStream.keyBy(_._1)
.intervalJoin(pvStream.keyBy(_.user))
.between(Time.seconds(-5),Time.seconds(10))
.process(new ProcessJoinFunction[(String,String,Long),Event,String] {
override def processElement(in1: (String, String, Long), in2: Event, context: ProcessJoinFunction[(String, String, Long), Event, String]#Context, collector: Collector[String]): Unit = {
collector.collect(in1+"=>"+in2)
}
}).print("intervalJoin")
//TODO 窗口同组联结: coGroup(iterable)
stream1.coGroup(stream2)
.where(_._1)
.equalTo(_._1)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.apply(new CoGroupFunction[(String,Long),(String,Long),String] {
override def coGroup(iterable: lang.Iterable[(String, Long)], iterable1: lang.Iterable[(String, Long)], collector: Collector[String]): Unit = {
collector.collect(iterable+"=>"+iterable1)
}
}).print("coGroup")
env.execute("windowJoin")
}
}
代码解析
(1) 主程序入口
Scala
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)
- 创建 Flink 流处理环境
StreamExecutionEnvironment
,并设置并行度为 1。
(2) 窗口联结(Window Join)
Scala
val stream1 = env.fromElements(
("a", 1000L),
("b", 1000L),
("a", 2000L),
("b", 6000L)
).assignAscendingTimestamps(_._2)
val stream2 = env.fromElements(
("a", 3000L),
("b", 3000L),
("a", 4000L),
("b", 8000L)
).assignAscendingTimestamps(_._2)
stream1.join(stream2)
.where(_._1)
.equalTo(_._1)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.apply((e1, e2) => e1 + "->" + e2)
.print("Join")
- 数据流 :定义了两个流
stream1
和stream2
,分别包含键值对(String, Long)
。 - 时间戳分配 :使用
assignAscendingTimestamps
方法为事件分配时间戳。 - 窗口联结 :
- 使用
join
方法将两个流按键(_._1
)联结。 - 使用
TumblingEventTimeWindows
定义 5 秒的滚动窗口。 - 使用
apply
方法将匹配的事件对拼接成字符串并输出。
- 使用
(3) 间隔联结(Interval Join)
Scala
val orderStream: DataStream[(String, String, Long)] = env
.fromElements(
("Mary", "order-1", 5000L),
("Alice", "order-2", 5000L),
("Bob", "order-3", 20000L),
("Alice", "order-4", 20000L),
("Cary", "order-5", 51000L)
).assignAscendingTimestamps(_._3)
val pvStream: DataStream[Event] = env
.fromElements(
Event("Bob", "./cart", 2000L),
Event("Alice", "./prod?id=100", 3000L),
Event("Alice", "./prod?id=200", 3500L),
Event("Bob", "./prod?id=2", 2500L),
Event("Alice", "./prod?id=300", 36000L),
Event("Bob", "./home", 30000L),
Event("Bob", "./prod?id=1", 23000L),
Event("Bob", "./prod?id=3", 33000L)
).assignAscendingTimestamps(_.timestamp)
orderStream.keyBy(_._1)
.intervalJoin(pvStream.keyBy(_.user))
.between(Time.seconds(-5), Time.seconds(10))
.process(new ProcessJoinFunction[(String, String, Long), Event, String] {
override def processElement(in1: (String, String, Long), in2: Event, context: ProcessJoinFunction[(String, String, Long), Event, String]#Context, collector: Collector[String]): Unit = {
collector.collect(in1 + "=>" + in2)
}
}).print("intervalJoin")
- 数据流 :定义了两个流
orderStream
(订单事件)和pvStream
(点击事件)。 - 时间戳分配:为事件分配时间戳。
- 间隔联结 :
- 使用
intervalJoin
方法将两个流按键(_._1
和user
)联结。 - 使用
between
方法定义时间间隔(前 5 秒到后 10 秒)。 - 使用
process
方法将匹配的事件对拼接成字符串并输出。
- 使用
(4) 窗口同组联结(CoGroup)
Scala
stream1.coGroup(stream2)
.where(_._1)
.equalTo(_._1)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.apply(new CoGroupFunction[(String, Long), (String, Long), String] {
override def coGroup(iterable: lang.Iterable[(String, Long)], iterable1: lang.Iterable[(String, Long)], collector: Collector[String]): Unit = {
collector.collect(iterable + "=>" + iterable1)
}
}).print("coGroup")
- 窗口同组联结 :
- 使用
coGroup
方法将两个流按键(_._1
)联结。 - 使用
TumblingEventTimeWindows
定义 5 秒的滚动窗口。 - 使用
apply
方法将匹配的事件集合拼接成字符串并输出。
- 使用
(5) 执行任务
Scala
env.execute("windowJoin")
- 启动 Flink 流处理任务,任务名称为
windowJoin
。
代码优化
(1) 时间戳分配
-
assignAscendingTimestamps
方法假设事件时间戳是严格递增的。如果时间戳可能乱序,应使用assignTimestampsAndWatermarks
方法:java
Scalastream1.assignTimestampsAndWatermarks( WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(5)) .withTimestampAssigner((event: (String, Long), timestamp: Long) => event._2) )
(2) 窗口大小
- 窗口大小(5 秒)可能不适合所有场景。应根据实际需求调整窗口大小。
(3) 输出格式
-
输出格式较为简单,可以优化为更易读的形式: java
Scalacollector.collect(s"Order: ${in1._2}, Click: ${in2.url}")
(4) 并行度
-
并行度设置为 1,可能影响性能。可以根据集群资源调整并行度: java
Scalaenv.setParallelism(4)
优化后的代码
以下是优化后的完整代码:
Scala
package transformplus
import java.lang
import java.time.Duration
import org.apache.flink.api.common.eventtime.WatermarkStrategy
import org.apache.flink.api.common.functions.CoGroupFunction
import org.apache.flink.streaming.api.functions.co.ProcessJoinFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.windowing.assigners.TumblingEventTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.util.Collector
import source.Event
object WindowJoin {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(4)
// 窗口联结
val stream1 = env.fromElements(
("a", 1000L),
("b", 1000L),
("a", 2000L),
("b", 6000L)
).assignTimestampsAndWatermarks(
WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(5))
.withTimestampAssigner((event: (String, Long), timestamp: Long) => event._2)
)
val stream2 = env.fromElements(
("a", 3000L),
("b", 3000L),
("a", 4000L),
("b", 8000L)
).assignTimestampsAndWatermarks(
WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(5))
.withTimestampAssigner((event: (String, Long), timestamp: Long) => event._2)
)
stream1.join(stream2)
.where(_._1)
.equalTo(_._1)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.apply((e1, e2) => s"${e1._1} (${e1._2}) -> ${e2._1} (${e2._2})")
.print("Join")
// 间隔联结
val orderStream: DataStream[(String, String, Long)] = env
.fromElements(
("Mary", "order-1", 5000L),
("Alice", "order-2", 5000L),
("Bob", "order-3", 20000L),
("Alice", "order-4", 20000L),
("Cary", "order-5", 51000L)
).assignTimestampsAndWatermarks(
WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(5))
.withTimestampAssigner((event: (String, String, Long), timestamp: Long) => event._3)
)
val pvStream: DataStream[Event] = env
.fromElements(
Event("Bob", "./cart", 2000L),
Event("Alice", "./prod?id=100", 3000L),
Event("Alice", "./prod?id=200", 3500L),
Event("Bob", "./prod?id=2", 2500L),
Event("Alice", "./prod?id=300", 36000L),
Event("Bob", "./home", 30000L),
Event("Bob", "./prod?id=1", 23000L),
Event("Bob", "./prod?id=3", 33000L)
).assignTimestampsAndWatermarks(
WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(5))
.withTimestampAssigner((event: Event, timestamp: Long) => event.timestamp)
)
orderStream.keyBy(_._1)
.intervalJoin(pvStream.keyBy(_.user))
.between(Time.seconds(-5), Time.seconds(10))
.process(new ProcessJoinFunction[(String, String, Long), Event, String] {
override def processElement(in1: (String, String, Long), in2: Event, context: ProcessJoinFunction[(String, String, Long), Event, String]#Context, collector: Collector[String]): Unit = {
collector.collect(s"Order: ${in1._2}, Click: ${in2.url}")
}
}).print("intervalJoin")
// 窗口同组联结
stream1.coGroup(stream2)
.where(_._1)
.equalTo(_._1)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.apply(new CoGroupFunction[(String, Long), (String, Long), String] {
override def coGroup(iterable: lang.Iterable[(String, Long)], iterable1: lang.Iterable[(String, Long)], collector: Collector[String]): Unit = {
collector.collect(s"Stream1: ${iterable.toString}, Stream2: ${iterable1.toString}")
}
}).print("coGroup")
env.execute("windowJoin")
}
}