Window API
通过WindowAssigners章节的学习,我们发现Flink窗口在KeyedStreams 和 Non-KeyedStreams中使用的基本结构,两者的唯一区别是keyedStreams要调用KeyBy(...)后再调用window(...),而non-Keyed Stream只用直接调用windowAll(...)就可以,在对流调用window/windowAll设置完窗口后,还可以继续调用一系列的方法来对窗口数据进行处理,这些窗口操作的API包含Windows Trigger(窗口触发器)、Evictor(数据剔除器)、Lateness(时延设定)、Output Tag(输出标签)以及Windows Funciton(窗口函数)等组成部分,其中Windows Funciton是所有窗口算子必须指定的属性,其余的属性都是根据实际情况选择指定。
- keyed Window
stream
.keyBy(...) <- 仅 keyed 窗口需要
.window(...) <- 必填项:"assigner"
[.trigger(...)] <- 可选项:"trigger" (省略则使用默认 trigger)
[.evictor(...)] <- 可选项:"evictor" (省略则不使用 evictor)
[.allowedLateness(...)] <- 可选项:"lateness" (省略则为 0)
[.sideOutputLateData(...)] <- 可选项:"output tag" (省略则不对迟到数据使用 side output)
.reduce/aggregate/apply() <- 必填项:"function"
[.getSideOutput(...)] <- 可选项:"output tag"
- Non-Key Window
stream
.windowAll(...) <- 必填项:"assigner"
[.trigger(...)] <- 可选项:"trigger" (else default trigger)
[.evictor(...)] <- 可选项:"evictor" (else no evictor)
[.allowedLateness(...)] <- 可选项:"lateness" (else zero)
[.sideOutputLateData(...)] <- 可选项:"output tag" (else no side output for late data)
.reduce/aggregate/apply() <- 必填项:"function"
[.getSideOutput(...)] <- 可选项:"output tag"
以上Window API中"... ..."表示可选,对以上Window API解释如下:
Ø Windows Assigner:指定窗口的类型,定义如何将数据流分配到一个或多个窗口;
Ø Windows Trigger:指定窗口触发的时机,定义窗口满足什么样的条件触发计算;
Ø Evictor:用于数据剔除;
Ø allowedLateness:标记是否处理迟到数据,当迟到数据到达窗口中是否触发计算;
Ø Output Tag:标记输出标签,然后在通过getSideOutput将窗口中的数据根据标签输出;
Ø Windows Funciton:定义窗口上数据处理的逻辑,例如对数据进行sum操作。
Flink中允许使用不同的窗口API完成不同的业务需求。下面我们对trigger(窗口触发器)和evictor(数据剔除器)、allowedLateness、sideOutputLateData、getSideOutput方法进行介绍。
窗口触发器(trigger)
Trigger决定了一个窗口何时被触发进而被window function处理,默认每个窗口都有一个默认的Trigger,时间窗口默认触发器为EventTimeTrigger/ProcessTimeTrigger,计数窗口默认的触发器为CountTrigger。绝大多数情况下,我们使用默认的窗口触发器即可,如果默认的窗口触发不能满足我们需求,可以自定义窗口触发器来完成窗口触发执行。
自定义触发器是需要继承Trigger抽象类,使用形式如下,需实现如下四个抽象方法:
//使用自定义触发器
DataStream.window(WindowAssigner...)
//自定义触发器
.trigger(new MyTrigger())
//处理函数逻辑
.process(...)
//自定义触发器
class MyTrigger extends Trigger<EventType,WindowType>{
@Override
public TriggerResult onElement(EventType element, long timestamp, WindowType window, TriggerContext ctx) throws Exception {
return ...;
}
@Override
public TriggerResult onProcessingTime(long time, WindowType window, TriggerContext ctx) throws Exception {
return ...;
}
@Override
public TriggerResult onEventTime(long time, WindowType window, TriggerContext ctx) throws Exception {
return ...;
}
@Override
public void clear(WindowType window, TriggerContext ctx) throws Exception {
...
}
}
以上自定义触发器中抽象方法的解释如下:
-
onElement():当每有一个事件添加到窗口时调用。
-
onProcessingTime():该方法在注册的 processing-time 定时器触发时调用。
-
onEventTime():该方法在注册的 event-time 定时器触发时调用。
-
clear():该方法在对应窗口被移除时调用,一般可以用来清理定义的状态,只有基于事件的窗口结束时才会调用,GlobalWindow窗口结束不会执行该方法。
以上各个方法中都有TriggerContext(触发器上下文)对象,通过该对象我们可以定义基于EventTime或者ProcessTime的定时器,当定时器触发时就会调用相应的onProcessingTime()/onEventTime()方法。此外前三个方法都返回TriggerResult对象,该对象决定如何应对到达窗口的事件,有如下几种选择:
-
TriggerResult.CONTINUE:什么都不做
-
TriggerResult.FIRE:触发窗口计算,执行后续窗口处理逻辑。
-
TriggerResult.PURGE:清空窗口中的数据,销毁窗口。
-
TriggerResult.FIRE_AND_PURGE:触发窗口计算,计算结束后清空窗口元素并销毁窗口。
通过以上可以看到自定义触发器可以决定何时触发、销毁窗口,在Flink中可以对时间窗口和全局窗口调用trigger方法设置自定义触发器,当调用trigger方法指定了一个trigger时,实际上覆盖了当前WndowAssigner默认的Trigger,例如,如果你指定了一个 CountTrigger 给 TumblingEventTimeWindows,你的窗口将不再根据时间触发, 而是根据元素数量触发。下面针对时间窗口和全局窗口分别演示自定义触发器的使用方式。
时间窗口自定义触发器案例
案例:读取基站日志数据,手动指定trigger触发器,每个基站每条数据都触发窗口执行。
- Java代码:
/**
* Flink Window API - 时间窗口自定义触发器
* 案例:读取基站日志数据,手动指定trigger触发器,每个基站每条数据都触发窗口执行。
*/
public class TimeWindowTriggerTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,5000,50
* 001,181,182,busy,7000,10
* 002,182,183,fail,9000,20
* 001,183,184,busy,11000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,12000,50
* 003,181,183,busy,17000,50
*/
DataStreamSource<String> sourceDS = env.socketTextStream("node5", 9999);
//将数据转换成StationLog对象
SingleOutputStreamOperator<StationLog> stationLogDS = sourceDS.map(new MapFunction<String, StationLog>() {
@Override
public StationLog map(String s) throws Exception {
String[] arr = s.split(",");
return new StationLog(arr[0].trim(),
arr[1].trim(),
arr[2].trim(),
arr[3].trim(),
Long.valueOf(arr[4]),
Long.valueOf(arr[5]));
}
});
//设置水位线
SingleOutputStreamOperator<StationLog> dsWithWatermark = stationLogDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<StationLog>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((stationLog, timestamp) -> stationLog.callTime)
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
);
//按照基站ID进行分组,并每隔5s统计每个基站所有主叫通话总时长
KeyedStream<StationLog, String> keyedStream = dsWithWatermark.keyBy(new KeySelector<StationLog, String>() {
@Override
public String getKey(StationLog stationLog) throws Exception {
return stationLog.sid;
}
});
keyedStream.window(TumblingEventTimeWindows.of(Time.seconds(5)))
//自定义触发器,每5秒触发一次计算
.trigger(new MyTimeTrigger2())
//统计每个基站所有主叫通话总时长
.process(new ProcessWindowFunction<StationLog, String, String, TimeWindow>() {
@Override
public void process(String key,
ProcessWindowFunction<StationLog, String, String, TimeWindow>.Context context,
Iterable<StationLog> elements,
Collector<String> out) throws Exception {
//统计每个基站所有主叫通话总时长
long sumCallTime = 0L;
for (StationLog element : elements) {
sumCallTime += element.duration;
}
//获取窗口起始时间
long start = context.window().getStart() <0 ? 0 : context.window().getStart();
long end = context.window().getEnd();
out.collect("窗口范围:[" + start + "~" + end + "),基站:" + key + ",所有主叫通话总时长:" + sumCallTime);
}
}).print();
env.execute();
}
}
//MyTimeTrigger2(),来一条数据触发一次窗口计算
class MyTimeTrigger2 extends Trigger<StationLog, TimeWindow> {
@Override
public TriggerResult onElement(StationLog element, long timestamp, TimeWindow window, TriggerContext ctx) throws Exception {
System.out.println("onElement方法执行...");
//只要来一条数据就触发一次窗口
return TriggerResult.FIRE_AND_PURGE;
}
@Override
public TriggerResult onProcessingTime(long time, TimeWindow window, TriggerContext ctx) throws Exception {
System.out.println("onProcessingTime方法执行,窗口触发计算...");
return TriggerResult.CONTINUE;
}
@Override
public TriggerResult onEventTime(long time, TimeWindow window, TriggerContext ctx) throws Exception {
System.out.println("onEventTime方法执行,窗口触发计算...");
return TriggerResult.CONTINUE;
}
@Override
public void clear(TimeWindow window, TriggerContext ctx) throws Exception {
System.out.println("clear方法执行,窗口销毁...");
//这里没有状态,不需要清空
}
}
- Scala代码:
/**
* Flink Window API - 时间窗口自定义触发器
* 案例:读取基站日志数据,手动指定trigger触发器,每个基站每条数据都触发窗口执行。
*/
object TimeWindowTriggerTest {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//导入隐式转换
import org.apache.flink.streaming.api.scala._
/**
* Socket中输入数据格式如下:
*/
val sourceDS: DataStream[String] = env.socketTextStream("node5", 9999)
//将数据转换成StationLog对象
val stationLogDS: DataStream[StationLog] = sourceDS.map(line => {
val arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong)
})
//给 stationLogDS 设置水位线
val dsWithWatermark: DataStream[StationLog] = stationLogDS.assignTimestampsAndWatermarks(
//设置水位线策略
WatermarkStrategy.forBoundedOutOfOrderness[StationLog](Duration.ofSeconds(2))
//设置事件时间抽取器
.withTimestampAssigner(new SerializableTimestampAssigner[StationLog] {
override def extractTimestamp(element: StationLog, recordTimestamp: Long): Long = {
element.callTime
}
})
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
)
//按照基站id进行分组,每隔5s窗口统计每个基站所有主叫通话总时长
dsWithWatermark
.keyBy(_.sid)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.trigger(new MyTimeTrigger2())
.process(new ProcessWindowFunction[StationLog,String,String,TimeWindow] {
override def process(key: String, context: Context, elements: Iterable[StationLog], out: Collector[String]): Unit = {
//获取窗口起始时间
val startTime: Long = if(context.window.getStart <0) 0 else context.window.getStart
val endTime: Long = context.window.getEnd
//统计每个基站所有主叫通话总时长
var sumCallTime = 0L
for (elem <- elements) {
sumCallTime += elem.duration
}
out.collect("窗口范围:[" + startTime + "~" + endTime + "),基站:" + key + ",所有主叫通话总时长:" + sumCallTime)
}
}).print()
env.execute()
}
}
//MyTimeTrigger2(),来一条数据触发一次窗口计算
class MyTimeTrigger2 extends Trigger[StationLog, TimeWindow] {
override def onElement(element: StationLog, timestamp: Long, window: TimeWindow, ctx: TriggerContext): TriggerResult = {
println("onElement方法执行...")
//只要来一条数据就触发一次窗口
TriggerResult.FIRE_AND_PURGE
}
override def onProcessingTime(time: Long, window: TimeWindow, ctx: TriggerContext): TriggerResult = {
println("onProcessingTime方法执行,窗口触发计算...")
TriggerResult.CONTINUE
}
override def onEventTime(time: Long, window: TimeWindow, ctx: TriggerContext): TriggerResult = {
println("onEventTime方法执行,窗口触发计算...")
TriggerResult.CONTINUE
}
override def clear(window: TimeWindow, ctx: TriggerContext): Unit = {
println("clear方法执行,窗口销毁...")
//这里没有状态,不需要清空
}
}
以上代码编写完成执行后,像Socket中输入如下数据:
#socket中输入数据如下
001,181,182,busy,1000,10
002,182,183,fail,3000,20
001,183,184,busy,2000,30
002,184,185,busy,6000,40
003,181,183,busy,5000,50
#当输入此条数据时,设置的tumbling窗口触发,clear方法也会执行
001,181,182,busy,7000,10
以上数据输入后,可以看到每次输入一条数据后,都会按照自己定义触发器的规则触发对应的窗口,执行并输出结果,没有再按照设置window中指定的window Assigners 时间窗口触发时机触发窗口,但当通过window Assigners 指定的窗口结束时刻达到时会自动销毁窗口,并调用自定义触发器中的clear方法,只有在时间窗口中当窗口销毁时才会调用自定义触发器中的clear方法,其他窗口(Global Window)销毁时不会调用clear方法。输入以上数据后,最终结果如下:
窗口范围:[0~5000),基站:001,所有主叫通话总时长:10
窗口范围:[0~5000),基站:002,所有主叫通话总时长:20
窗口范围:[0~5000),基站:001,所有主叫通话总时长:30
窗口范围:[5000~10000),基站:002,所有主叫通话总时长:40
窗口范围:[5000~10000),基站:003,所有主叫通话总时长:50
窗口范围:[5000~10000),基站:001,所有主叫通话总时长:10
clear方法执行,窗口销毁...
clear方法执行,窗口销毁...
全局窗口自定义触发器案例
案例:读取基站日志数据,手动指定trigger触发器,每个基站数据隔5秒生成窗口并触发计算。
- Java代码
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,8000,50
* 001,181,182,busy,7000,10
* 001,181,184,busy,1000,10
* 001,182,185,busy,2000,20
* 001,183,186,busy,3000,30
* 003,181,187,busy,14000,10
*/
DataStreamSource<String> sourceDS = env.socketTextStream("node5", 9999);
//将数据转换成StationLog对象
SingleOutputStreamOperator<StationLog> stationLogDS = sourceDS.map(new MapFunction<String, StationLog>() {
@Override
public StationLog map(String s) throws Exception {
String[] arr = s.split(",");
return new StationLog(arr[0].trim(),
arr[1].trim(),
arr[2].trim(),
arr[3].trim(),
Long.valueOf(arr[4]),
Long.valueOf(arr[5]));
}
});
//设置水位线
SingleOutputStreamOperator<StationLog> dsWithWatermark = stationLogDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<StationLog>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((stationLog, timestamp) -> stationLog.callTime)
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
);
//按照基站ID进行分组,并每隔5s统计每个基站所有主叫通话总时长
KeyedStream<StationLog, String> keyedStream = dsWithWatermark.keyBy(new KeySelector<StationLog, String>() {
@Override
public String getKey(StationLog stationLog) throws Exception {
return stationLog.sid;
}
});
keyedStream.window(GlobalWindows.create())
//自定义触发器,每个事件5秒后触发一次计算
.trigger(new MyTimeTrigger1())
//自定义窗口函数,统计每个基站所有主叫通话总时长
.process(new ProcessWindowFunction<StationLog, String, String, GlobalWindow>() {
@Override
public void process(String key,
ProcessWindowFunction<StationLog, String, String, GlobalWindow>.Context context,
Iterable<StationLog> elements,
Collector<String> out) throws Exception {
//统计每个基站主叫通话总时长
long sumCallTime = 0L;
for (StationLog element : elements) {
sumCallTime += element.duration;
}
out.collect("基站:" + key + ",所有主叫通话总时长:" + sumCallTime);
}
}).print();
env.execute();
}
}
//MyTimeTrigger1() 针对每个事件每5秒触发一次计算
class MyTimeTrigger1 extends Trigger<StationLog, GlobalWindow> {
//创建状态描述符,该状态标记当前key是否有对应的定时器
private ValueStateDescriptor<Boolean> timerStateDescriptor = new ValueStateDescriptor<>("timer-state", Boolean.class);
//每来一条数据,都会调用一次
@Override
public TriggerResult onElement(StationLog element, long timestamp, GlobalWindow window, TriggerContext ctx) throws Exception {
System.out.println("onElement >>>>>>>>>>>>>>>> 方法调用了,当前事件时间"+timestamp+",当前水位线"+ctx.getCurrentWatermark());
//获取当前窗口中定时器是否存在的状态
Boolean isExist = ctx.getPartitionedState(timerStateDescriptor).value();
if(isExist == null || !isExist){
System.out.println("注册定时器,触发时间:" + (timestamp + 4999));
//注册一个基于事件时间的定时器,延迟5秒触发
ctx.registerEventTimeTimer(timestamp + 4999L);
//更新状态
ctx.getPartitionedState(timerStateDescriptor).update(true);
}
return TriggerResult.CONTINUE;
}
//注册处理时间定时器。如果基于ProcessTime处理,在onElement方法中注册了定时器,当定时器触发时,会调用onProcessingTime方法
@Override
public TriggerResult onProcessingTime(long time, GlobalWindow window, TriggerContext ctx) throws Exception {
System.out.println("onProcessingTime >>>>>>>>>>>>>>>> 方法调用了");
//不使用处理时间,这里直接返回CONTINUE
return TriggerResult.CONTINUE;
}
/**
* 注册事件时间定时器。如果基于EventTime处理,在onElement方法中注册了定时器,当定时器触发时,会调用onEventTime方法
* @param time 定时器触发时间
*/
@Override
public TriggerResult onEventTime(long time, GlobalWindow window, TriggerContext ctx) throws Exception {
System.out.println("onEventTime >>>>>>>>>>>>>>>> 方法调用了,触发器执行,触发时间:" + time);
//更新状态为false
ctx.getPartitionedState(timerStateDescriptor).update(false);
return TriggerResult.FIR_AND_PURGE;
}
//clear() 方法处理在对应窗口被移除时所需的逻辑。
@Override
public void clear(GlobalWindow window, TriggerContext ctx) throws Exception {
System.out.println("clear >>>>>>>>>>>>>>>> 方法调用了,清空状态");
ctx.getPartitionedState(timerStateDescriptor).clear();
}
}
- Scala代码
/**
* Flink Window API - GlobalWindow 自定义触发器
* 案例:读取基站日志数据,手动指定trigger触发器,每个基站数据每5秒生成窗口并触发计算。
*/
object GlobalWindowTriggerTest {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//导入隐式转换
import org.apache.flink.streaming.api.scala._
/**
* Socket中输入数据格式如下:
*/
val sourceDS: DataStream[String] = env.socketTextStream("node5", 9999)
//将数据转换成StationLog对象
val stationLogDS: DataStream[StationLog] = sourceDS.map(line => {
val arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong)
})
//给 stationLogDS 设置水位线
val dsWithWatermark: DataStream[StationLog] = stationLogDS.assignTimestampsAndWatermarks(
//设置水位线策略
WatermarkStrategy.forBoundedOutOfOrderness[StationLog](Duration.ofSeconds(2))
//设置事件时间抽取器
.withTimestampAssigner(new SerializableTimestampAssigner[StationLog] {
override def extractTimestamp(element: StationLog, recordTimestamp: Long): Long = {
element.callTime
}
})
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
)
//按照基站id进行分组,每隔5s窗口统计每个基站所有主叫通话总时长
dsWithWatermark
.keyBy(_.sid)
.window(GlobalWindows.create())
.trigger(new MyTimeTrigger1())
.process(new ProcessWindowFunction[StationLog, String, String, GlobalWindow] {
override def process(key: String, context: Context, elements: Iterable[StationLog], out: Collector[String]): Unit = {
//统计每个基站所有主叫通话总时长
var sumCallTime = 0L
for (elem <- elements) {
sumCallTime += elem.duration
}
out.collect("基站:" + key + ",近3条通话总时长:" + sumCallTime)
}
}).print()
env.execute()
}
}
//MyCountTrigger触发器,每3条数据触发一次计算
class MyTimeTrigger1 extends Trigger[StationLog, GlobalWindow] {
//创建状态描述符,该状态标记当前key是否有对应的定时器
val timerStateDescriptor = new ValueStateDescriptor[Boolean]("timer-state", classOf[Boolean])
//每来一条数据,都会调用一次
override def onElement(element: StationLog,
timestamp: Long,
window: GlobalWindow,
ctx: Trigger.TriggerContext): TriggerResult = {
//获取当前窗口中定时器是否存在的状态
val isExist: Boolean = ctx.getPartitionedState(timerStateDescriptor).value()
if (isExist == null || !isExist) {
//注册一个基于事件时间的定时器,延迟5秒触发
ctx.registerEventTimeTimer(timestamp + 4999L)
//更新状态
ctx.getPartitionedState(timerStateDescriptor).update(true)
}
TriggerResult.CONTINUE
}
//注册处理时间定时器。如果基于ProcessTime处理,在onElement方法中注册了定时器,当定时器触发时,会调用onProcessingTime方法
override def onProcessingTime(time: Long, window: GlobalWindow, ctx: Trigger.TriggerContext): TriggerResult = {
TriggerResult.CONTINUE
}
//注册事件时间定时器。如果基于EventTime处理,在onElement方法中注册了定时器,当定时器触发时,会调用onEventTime方法
override def onEventTime(time: Long, window: GlobalWindow, ctx: Trigger.TriggerContext): TriggerResult = {
//更新状态为false
ctx.getPartitionedState(timerStateDescriptor).update(false);
TriggerResult.FIRE_AND_PURGE
}
//clear() 方法处理在对应窗口被移除时所需的逻辑。
override def clear(window: GlobalWindow, ctx: Trigger.TriggerContext): Unit = {
ctx.getPartitionedState(timerStateDescriptor).clear()
}
}
以上代码中当watermark的时间达到触发器触发事件时才会执行触发器,在注册定时器时,延后时间为4999ms而非5000ms的主要原因是根据输入的事件计算的watermark是减去1ms之后的结果,所以这里我们注册定时器时指定4999ms。
代码编写完成执行后,在socket中一条条输入如下数据进行验证:
#socket中输入数据如下
001,181,182,busy,1000,10
002,182,183,fail,3000,20
001,183,184,busy,2000,30
002,184,185,busy,6000,40
#该条数据输入SID 001 对应5秒后窗口触发
003,181,183,busy,8000,50
#继续输入SID 001 数据,重新注册定时器,5秒后触发
001,181,182,busy,7000,10
#此刻wm为5999,如果有超时数据过来,后续窗口也会统计在内
001,181,184,busy,1000,10
001,182,185,busy,2000,20
001,183,186,busy,3000,30
#此条数据输入后,wm为11999,SID 002 对应窗口会触发,另外SID 001窗口也会触发
003,181,187,busy,14000,10
通过以上输入数据我们可以看到当输入"003,181,183,busy,8000,50"这条数据后,此刻watermark的值为5999,那么后续继续输入属于SID 001 的数据"001,181,184,busy,1000,10"、"001,182,185,busy,2000,20"、"001,183,186,busy,3000,30"属于watermark之前的数据,这三条数据都是属于迟到数据,当下一个SID 001 窗口触发时会将这些数据计算到结果内,结果如下:
#第一次窗口触发结果
基站:001,所有主叫通话总时长:40
#第二次窗口触发结果,基站001中统计的数据包含3条迟到数据
基站:001,所有主叫通话总时长:70
基站:002,所有主叫通话总时长:60
如果我们在后续窗口触发时,不需要将迟到数据计算到窗口内部,这时可以通过数据剔除器(evictor)来完成,具体案例可以参照数据剔除器(evictor)小节。
数据剔除器(evictor)
Flink Window API 中的数据剔除器(Evictor)可以在窗口触发前或者后对窗口内的数据进行删除。如下所示,Flink有三个内置的Evictor,默认情况下,所有内置的 evictor 逻辑都在调用窗口函数前执行。
-
CountEvictor: 仅记录用户指定数量的元素,一旦窗口中的元素超过这个数量,多余的元素会从窗口缓存的开头移除。Flink 不对窗口中元素的顺序做任何保证,也就是说,即使 evictor 从窗口缓存的开头移除一个元素,这个元素也不一定是最先到达窗口的。
-
DeltaEvictor: 接收 DeltaFunction 和 threshold 参数,计算最后一个元素与窗口缓存中所有元素的差值, 并移除差值大于或等于 threshold 的元素。
-
TimeEvictor: 接收 interval 参数,以毫秒表示。 它会找到窗口中元素的最大 timestamp max_ts 并移除比 max_ts - interval 小的所有元素。
如果以上默认的Evictor剔除器不满足我们业务情况,还可以通过实现Evictor接口自定义窗口数据剔除规则,该Evictor接口中有如下两个方法需要实现:
class MyEvictor implements Evictor<EventType,WindowType>{
@Override
public void evictBefore(Iterable<TimestampedValue<EventType>> elements, int size, WindowType window, EvictorContext evictorContext) {
}
@Override
public void evictAfter(Iterable<TimestampedValue<EventType>> elements, int size, WindowType window, EvictorContext evictorContext) {
}
}
对以上Evictor接口中两个方法解释如下:
-
evictBefore方法:在调用窗口函数之前移除窗口中某些数据。往往自定义剔除器时都是在该方法内移除窗口元素。
-
evictAfter方法:在调用窗口函数之后移除窗口中某些数据(不常用)。
下面分别通过案例演示Flink Window API中内置Evictor和自定义Evictor的使用。
内置Evictor使用案例
这里以Flink中DeltaEvictor为例来演示Flink内置Evictor使用方式。DeltaEvictor接收 DeltaFunction 和 threshold 参数,计算最后一个元素与窗口缓存中所有元素的差值, 并移除差值大于或等于 threshold 的元素。
案例:读取socket基站日志数据,设置5s滚动窗口计算每个基站通话总时长。(每个窗口中所有数据的通话时长如果与窗口最后一条数据的通话时长相差5秒以上就移除该条数据)。
- Java代码
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,2000,13
* 001,181,182,busy,1000,10
* 002,182,183,fail,2000,40
* 002,184,185,busy,3000,20
* 001,181,183,busy,1000,17
* 001,183,184,busy,4000,12
* 003,181,182,busy,7000,10
*/
DataStreamSource<String> sourceDS = env.socketTextStream("node5", 9999);
//将数据转换成StationLog对象
SingleOutputStreamOperator<StationLog> stationLogDS = sourceDS.map(new MapFunction<String, StationLog>() {
@Override
public StationLog map(String s) throws Exception {
String[] arr = s.split(",");
return new StationLog(arr[0].trim(),
arr[1].trim(),
arr[2].trim(),
arr[3].trim(),
Long.valueOf(arr[4]),
Long.valueOf(arr[5]));
}
});
//设置水位线
SingleOutputStreamOperator<StationLog> dsWithWatermark = stationLogDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<StationLog>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((stationLog, timestamp) -> stationLog.callTime)
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
);
//按照基站ID进行分组,并每隔5s统计每个基站所有主叫通话总时长
KeyedStream<StationLog, String> keyedStream = dsWithWatermark.keyBy(new KeySelector<StationLog, String>() {
@Override
public String getKey(StationLog stationLog) throws Exception {
return stationLog.sid;
}
});
//每隔5s统计每个基站所有主叫通话总时长,使用事件时间
SingleOutputStreamOperator<String> result = keyedStream
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
//指定evictor,如果两个事件通话时长相差5秒以上就移除数据
.evictor(DeltaEvictor.of(5, new DeltaFunction<StationLog>() {
//获取两个数据点的差值
@Override
public double getDelta(StationLog oldDataPoint, StationLog newDataPoint) {
return Math.abs(oldDataPoint.duration - newDataPoint.duration);
}
}))
.process(new ProcessWindowFunction<StationLog, String, String, TimeWindow>() {
@Override
public void process(String key,
ProcessWindowFunction<StationLog, String, String, TimeWindow>.Context context,
Iterable<StationLog> elements,
Collector<String> out) throws Exception {
//统计每个基站所有主叫通话总时长
long sumCallTime = 0L;
for (StationLog element : elements) {
sumCallTime += element.duration;
}
//获取窗口起始时间
long start = context.window().getStart();
long end = context.window().getEnd();
out.collect("窗口范围:[" + start + "~" + end + "),基站:" + key + ",所有主叫通话总时长:" + sumCallTime);
}
});
result.print();
env.execute();
- Scala代码
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//导入隐式转换
import org.apache.flink.streaming.api.scala._
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,2000,13
* 001,181,182,busy,1000,10
* 002,182,183,fail,2000,40
* 002,184,185,busy,3000,20
* 001,181,183,busy,1000,17
* 001,183,184,busy,4000,12
* #窗口触发
* 003,181,182,busy,7000,10
*
*/
val sourceDS: DataStream[String] = env.socketTextStream("node5", 9999)
//将数据转换成StationLog对象
val stationLogDS: DataStream[StationLog] = sourceDS.map(line => {
val arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong)
})
//给 stationLogDS 设置水位线
val dsWithWatermark: DataStream[StationLog] = stationLogDS.assignTimestampsAndWatermarks(
//设置水位线策略
WatermarkStrategy.forBoundedOutOfOrderness[StationLog](Duration.ofSeconds(2))
//设置事件时间抽取器
.withTimestampAssigner(new SerializableTimestampAssigner[StationLog] {
override def extractTimestamp(element: StationLog, recordTimestamp: Long): Long = {
element.callTime
}
})
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
)
//按照基站id进行分组,每隔5s窗口统计每个基站所有主叫通话总时长
dsWithWatermark
.keyBy(_.sid)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
//设置窗口移除策略
.evictor(DeltaEvictor.of[StationLog,TimeWindow](5, new DeltaFunction[StationLog] {
//获取两个数据点的差值
override def getDelta(oldDataPoint: StationLog, newDataPoint: StationLog): Double = {
Math.abs(newDataPoint.duration - oldDataPoint.duration)
}
}))
.process(new ProcessWindowFunction[StationLog, String, String, TimeWindow] {
override def process(key: String, context: Context, elements: Iterable[StationLog], out: Collector[String]): Unit = {
//获取窗口起始时间
val startTime: Long = context.window.getStart
val endTime: Long = context.window.getEnd
//统计每个基站所有主叫通话总时长
var sumCallTime = 0L
for (elem <- elements) {
sumCallTime += elem.duration
}
out.collect("窗口范围:[" + startTime + "~" + endTime + "),基站:" + key + ",所有主叫通话总时长:" + sumCallTime)
}
}).print()
env.execute()
以上代码指定了窗口最后一条数据与其他数据差值阈值为5,意味着如果窗口事件的通话时长与窗口最后一条事件的通话时长相差5以上(包含5),会将这些数据移除。代码执行后向Socket中输入如下数据:
#socket中输入数据如下
001,181,182,busy,2000,13
001,181,182,busy,4000,10
002,182,183,fail,2000,40
002,184,185,busy,3000,20
001,181,183,busy,1000,17
001,183,184,busy,1000,12
#窗口触发
003,181,182,busy,7000,10
通过以上数据输入,[0~5000)窗口触发时SID为001的基站最后事件的duration为12,那么对应窗口内事件的duration与其差值大于等于5的事件将会被移除窗口,不会统计到结果中,SID为002在[0~5000)窗口内的最后事件对应的duration为20,那么对应窗口内事件的duration与其差值大于等于5的事件也将会被移除窗口。窗口触发执行后,输出结果如下:
窗口范围:[0~5000),基站:001,所有主叫通话总时长:35
窗口范围:[0~5000),基站:002,所有主叫通话总时长:20
自定义Evictor使用案例
案例:读取基站日志数据,手动指定trigger触发器,每个基站数据隔5秒生成窗口并触发计算。(根据watermark来看超时的数据不再计算到窗口内)
此案例实际上与窗口触发器小节中"全局窗口自定义触发器"案例一样,在执行当时案例时,如果实时流是乱序流,我们发现迟到事件(事件时间小于watermark的时间的事件)输入后也会被计算在对应的窗口内,如果我们想要把这些"超时"事件剔除就可以通过自定义Evictor来实现。
- Java代码
/**
* 用户自定义Evictor
* 案例:读取基站日志数据,手动指定trigger触发器,每个基站数据隔5秒生成窗口并触发计算。
* (根据watermark来看超时的数据不再计算到窗口内)
*/
public class CustomEvidtorTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* #该条数据输入SID 001 对应5秒后窗口触发
* 003,181,183,busy,8000,50
* #继续输入SID 001 数据,重新注册定时器,5秒后触发
* 001,181,182,busy,7000,10
* #此刻wm为5999,如果有超时数据过来,数据会被剔除
* 001,181,184,busy,1000,10
* 001,182,185,busy,2000,20
* 001,183,186,busy,3000,30
* #此条数据输入后,wm为11999,SID 002 对应窗口会触发,另外SID 001窗口也会触发
* 003,181,187,busy,14000,10
*/
DataStreamSource<String> sourceDS = env.socketTextStream("node5", 9999);
//将数据转换成StationLog对象
SingleOutputStreamOperator<StationLog> stationLogDS = sourceDS.map(new MapFunction<String, StationLog>() {
@Override
public StationLog map(String s) throws Exception {
String[] arr = s.split(",");
return new StationLog(arr[0].trim(),
arr[1].trim(),
arr[2].trim(),
arr[3].trim(),
Long.valueOf(arr[4]),
Long.valueOf(arr[5]));
}
});
//设置水位线
SingleOutputStreamOperator<StationLog> dsWithWatermark = stationLogDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<StationLog>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((stationLog, timestamp) -> stationLog.callTime)
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
);
//按照基站ID进行分组,并每隔5s统计每个基站所有主叫通话总时长
KeyedStream<StationLog, String> keyedStream = dsWithWatermark.keyBy(new KeySelector<StationLog, String>() {
@Override
public String getKey(StationLog stationLog) throws Exception {
return stationLog.sid;
}
});
keyedStream.window(GlobalWindows.create())
//自定义触发器,每个事件5秒后触发一次计算
.trigger(new MyTimeTriggerCls())
.evictor(new Evictor<StationLog, GlobalWindow>() {
@Override
public void evictBefore(Iterable<TimestampedValue<StationLog>> elements, int size, GlobalWindow window, EvictorContext evictorContext) {
Iterator<TimestampedValue<StationLog>> iter = elements.iterator();
//如果数据的 callType 标记为"迟到数据",则移除该数据
while (iter.hasNext()) {
TimestampedValue<StationLog> next = iter.next();
if (next.getValue().callType.equals("迟到数据")) {
System.out.println("移除了迟到数据:" + next.getValue());
//移除迟到数据,删除当前指针所指向的元素
iter.remove();
}
}
}
@Override
public void evictAfter(Iterable<TimestampedValue<StationLog>> elements, int size, GlobalWindow window, EvictorContext evictorContext) {
}
})
//自定义窗口函数,统计每个基站所有主叫通话总时长
.process(new ProcessWindowFunction<StationLog, String, String, GlobalWindow>() {
@Override
public void process(String key,
ProcessWindowFunction<StationLog, String, String, GlobalWindow>.Context context,
Iterable<StationLog> elements,
Collector<String> out) throws Exception {
//统计每个基站主叫通话总时长
long sumCallTime = 0L;
for (StationLog element : elements) {
sumCallTime += element.duration;
}
out.collect("基站:" + key + ",所有主叫通话总时长:" + sumCallTime);
}
}).print();
env.execute();
}
}
//MyTimeTrigger1() 针对每个事件每5秒触发一次计算
class MyTimeTriggerCls extends Trigger<StationLog, GlobalWindow> {
//创建状态描述符,该状态标记当前key是否有对应的定时器
private ValueStateDescriptor<Boolean> timerStateDescriptor = new ValueStateDescriptor<>("timer-state", Boolean.class);
//每来一条数据,都会调用一次
@Override
public TriggerResult onElement(StationLog element, long timestamp, GlobalWindow window, TriggerContext ctx) throws Exception {
System.out.println("onElement >>>>>>>>>>>>>>>> 方法调用了,当前事件时间"+timestamp+",当前水位线"+ctx.getCurrentWatermark());
//获取当前窗口中定时器是否存在的状态
Boolean isExist = ctx.getPartitionedState(timerStateDescriptor).value();
if(isExist == null || !isExist){
System.out.println("注册定时器,触发时间:" + (timestamp + 4999));
//注册一个基于事件时间的定时器,延迟5秒触发
ctx.registerEventTimeTimer(timestamp + 4999L);
//更新状态
ctx.getPartitionedState(timerStateDescriptor).update(true);
}
//如果事件时间小于了watermark,说明迟到了,给该数据做个标记
if(timestamp < ctx.getCurrentWatermark()){
element.setCallType("迟到数据");
}
return TriggerResult.CONTINUE;
}
//注册处理时间定时器。如果基于ProcessTime处理,在onElement方法中注册了定时器,当定时器触发时,会调用onProcessingTime方法
@Override
public TriggerResult onProcessingTime(long time, GlobalWindow window, TriggerContext ctx) throws Exception {
System.out.println("onProcessingTime >>>>>>>>>>>>>>>> 方法调用了");
//不使用处理时间,这里直接返回CONTINUE
return TriggerResult.CONTINUE;
}
/**
* 注册事件时间定时器。如果基于EventTime处理,在onElement方法中注册了定时器,当定时器触发时,会调用onEventTime方法
* @param time 定时器触发时间
*/
@Override
public TriggerResult onEventTime(long time, GlobalWindow window, TriggerContext ctx) throws Exception {
System.out.println("onEventTime >>>>>>>>>>>>>>>> 方法调用了,触发器执行,触发时间:" + time);
//更新状态为false
ctx.getPartitionedState(timerStateDescriptor).update(false);
return TriggerResult.FIRE_AND_PURGE;
}
//clear() 方法处理在对应窗口被移除时所需的逻辑。
@Override
public void clear(GlobalWindow window, TriggerContext ctx) throws Exception {
System.out.println("clear >>>>>>>>>>>>>>>> 方法调用了,清空状态");
ctx.getPartitionedState(timerStateDescriptor).clear();
}
}
- Scala代码
/**
* 用户自定义Evictor
* 案例:读取基站日志数据,手动指定trigger触发器,每个基站数据隔5秒生成窗口并触发计算。
* (根据watermark来看超时的数据不再计算到窗口内)
*/
object CustomEvidtorTest {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//导入隐式转换
import org.apache.flink.streaming.api.scala._
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* #该条数据输入SID 001 对应5秒后窗口触发
* 003,181,183,busy,8000,50
* #继续输入SID 001 数据,重新注册定时器,5秒后触发
* 001,181,182,busy,7000,10
* #此刻wm为5999,如果有超时数据过来,数据会被剔除
* 001,181,184,busy,1000,10
* 001,182,185,busy,2000,20
* 001,183,186,busy,3000,30
* #此条数据输入后,wm为11999,SID 002 对应窗口会触发,另外SID 001窗口也会触发
* 003,181,187,busy,14000,10
*
*/
val sourceDS: DataStream[String] = env.socketTextStream("node5", 9999)
//将数据转换成StationLog对象
val stationLogDS: DataStream[StationLog] = sourceDS.map(line => {
val arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong)
})
//给 stationLogDS 设置水位线
val dsWithWatermark: DataStream[StationLog] = stationLogDS.assignTimestampsAndWatermarks(
//设置水位线策略
WatermarkStrategy.forBoundedOutOfOrderness[StationLog](Duration.ofSeconds(2))
//设置事件时间抽取器
.withTimestampAssigner(new SerializableTimestampAssigner[StationLog] {
override def extractTimestamp(element: StationLog, recordTimestamp: Long): Long = {
element.callTime
}
})
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
)
//按照基站id进行分组,每隔5s窗口统计每个基站所有主叫通话总时长
dsWithWatermark
.keyBy(_.sid)
.window(GlobalWindows.create())
.trigger(new MyTimeTriggerCls())
.evictor(new Evictor[StationLog, GlobalWindow] {
override def evictBefore(elements: lang.Iterable[TimestampedValue[StationLog]],
size: Int,
window: GlobalWindow,
evictorContext: Evictor.EvictorContext): Unit = {
val iter: util.Iterator[TimestampedValue[StationLog]] = elements.iterator
//如果数据的 callType 标记为"迟到数据",则移除该数据
while (iter.hasNext) {
val next: TimestampedValue[StationLog] = iter.next
if (next.getValue.callType == "迟到数据") {
System.out.println("移除了迟到数据:" + next.getValue)
//移除迟到数据,删除当前指针所指向的元素
iter.remove()
}
}
}
override def evictAfter(elements: lang.Iterable[TimestampedValue[StationLog]],
size: Int,
window: GlobalWindow,
evictorContext: Evictor.EvictorContext): Unit = {
}
})
.process(new ProcessWindowFunction[StationLog, String, String, GlobalWindow] {
override def process(key: String, context: Context, elements: Iterable[StationLog], out: Collector[String]): Unit = {
//统计每个基站所有主叫通话总时长
var sumCallTime = 0L
for (elem <- elements) {
sumCallTime += elem.duration
}
out.collect("基站:" + key + ",所有主叫通话总时长:" + sumCallTime)
}
}).print()
env.execute()
}
}
//MyCountTrigger触发器,每3条数据触发一次计算
class MyTimeTriggerCls extends Trigger[StationLog, GlobalWindow] {
//创建状态描述符,该状态标记当前key是否有对应的定时器
val timerStateDescriptor = new ValueStateDescriptor[Boolean]("timer-state", classOf[Boolean])
//每来一条数据,都会调用一次
override def onElement(element: StationLog,
timestamp: Long,
window: GlobalWindow,
ctx: Trigger.TriggerContext): TriggerResult = {
//获取当前窗口中定时器是否存在的状态
val isExist: Boolean = ctx.getPartitionedState(timerStateDescriptor).value()
if (isExist == null || !isExist) {
//注册一个基于事件时间的定时器,延迟5秒触发
ctx.registerEventTimeTimer(timestamp + 4999L)
//更新状态
ctx.getPartitionedState(timerStateDescriptor).update(true)
}
//如果事件时间小于了watermark,说明迟到了,给该数据做个标记
if (timestamp < ctx.getCurrentWatermark) {
//需要修改StationLog对象中CallType类型为var
element.callType = "迟到数据"
}
TriggerResult.CONTINUE
}
//注册处理时间定时器。如果基于ProcessTime处理,在onElement方法中注册了定时器,当定时器触发时,会调用onProcessingTime方法
override def onProcessingTime(time: Long, window: GlobalWindow, ctx: Trigger.TriggerContext): TriggerResult = {
TriggerResult.CONTINUE
}
//注册事件时间定时器。如果基于EventTime处理,在onElement方法中注册了定时器,当定时器触发时,会调用onEventTime方法
override def onEventTime(time: Long, window: GlobalWindow, ctx: Trigger.TriggerContext): TriggerResult = {
//更新状态为false
ctx.getPartitionedState(timerStateDescriptor).update(false);
TriggerResult.FIRE_AND_PURGE
}
//clear() 方法处理在对应窗口被移除时所需的逻辑。
override def clear(window: GlobalWindow, ctx: Trigger.TriggerContext): Unit = {
ctx.getPartitionedState(timerStateDescriptor).clear()
}
}
以上代码中需要在自定义触发器onElement中对超时的事件进行标记,这里直接设置StationLog对象中CallType为"迟到数据"标识,这样在自定义剔除器方便我们直接在evictBefore方法中判断窗口中事件CallType字段为"迟到数据"标识的数据直接移除即可,那么在窗口触发执行时这些"迟到"数据将不会被计算到窗口内。此外,需要注意的是在ScalaAPI中设置StationLog中CallType字段值时需要在样例类StationLog中将CallType字段设置为var类型。
代码编写完成执行后,在Socket中输入如下数据:
#socket中输入数据如下
001,181,182,busy,1000,10
002,182,183,fail,3000,20
001,183,184,busy,2000,30
002,184,185,busy,6000,40
#该条数据输入SID 001 对应5秒后窗口触发
003,181,183,busy,8000,50
#继续输入SID 001 数据,重新注册定时器,5秒后触发
001,181,182,busy,7000,10
#此刻wm为5999,如果有超时数据过来,数据会被剔除
001,181,184,busy,1000,10
001,182,185,busy,2000,20
001,183,186,busy,3000,30
#此条数据输入后,wm为11999,SID 002 对应窗口会触发,另外SID 001窗口也会触发
003,181,187,busy,14000,10
以上数据输入后可以看到如下输出结果:
#第一次窗口触发结果
基站:001,所有主叫通话总时长:40
#第二次窗口触发结果,基站001中统计的数据剔除了迟到数据
基站:002,所有主叫通话总时长:60
移除了迟到数据:StationLog(001,181,184,迟到数据,1000,10)
移除了迟到数据:StationLog(001,182,185,迟到数据,2000,20)
移除了迟到数据:StationLog(001,183,186,迟到数据,3000,30)
基站:001,所有主叫通话总时长:10
窗口聚合函数(Window Functions)
Flink DataStream通过调用window/windowAll方法设置窗口,得到的流对象是WindowedStream,对应窗口触发后就需要通过指定的Window Functions(窗口聚合函数)进行窗口数据处理,Flink中提供了三种类型的窗口函数:ReduceFunction、AggregateFunction、ProcessWindowFunction,在Flink中还有一个过时的窗口函数WindowFunction,该窗口函数完全可以由ProcessWindowFunction替代,虽然在当下Flink版本中还可以使用,在未来Flink版本中该函数将会被移除。
以上窗口函数中又可以分为两大类:增量聚合函数和全量聚合函数。ReduceFunction、AggregateFunction称为增量聚合函数,可以在每条数据到达窗口后基于中间状态结果进行增量计算,所以执行起来更加高效,这种窗口函数只缓存中间结果状态不需要缓存所有数据;ProcessWindowFunction、WindowFunction称为全量聚合函数,会将窗口内的数据缓存起来组成Iterable集合,当窗口结束时统一全量聚合,由于需要缓存窗口所有数据所以执行效率不如增量聚合函数效率高。
下面分别对以上不同类型的窗口函数进行介绍及案例演示。
reduce-ReduceFunction
对WindowStream可以通过ReduceFunction增量聚合函数定义如何处理窗口数据的逻辑,ReduceFunction指定两条输入数据如何合并在一起产生一条输出数据,输入数据和输出数据类型必须相同。
可以对WindowStream调用reduce方法传入ReduceFunction,使用方式如下:
DataStream<Tuple2<String, Long>> input = ...;
input.keyBy(<key selector>)
.window(<window assigner>)
.reduce(new ReduceFunction<Tuple2<String, Long>>() {
public Tuple2<String, Long> reduce(Tuple2<String, Long> v1, Tuple2<String, Long> v2) {
return new Tuple2<>(v1.f0, v1.f1 + v2.f1);
}
});
案例:读取socket基站日志数据,每隔5s统计每个基站通话总时长。
- Java代码
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,5000,50
* 001,181,182,busy,7000,10
* 002,182,183,fail,9000,20
* 001,183,184,busy,11000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,12000,50
*
*/
DataStreamSource<String> sourceDS = env.socketTextStream("node5", 9999);
//将数据转换成StationLog对象
SingleOutputStreamOperator<StationLog> stationLogDS = sourceDS.map(new MapFunction<String, StationLog>() {
@Override
public StationLog map(String s) throws Exception {
String[] arr = s.split(",");
return new StationLog(arr[0].trim(),
arr[1].trim(),
arr[2].trim(),
arr[3].trim(),
Long.valueOf(arr[4]),
Long.valueOf(arr[5]));
}
});
//设置水位线
SingleOutputStreamOperator<StationLog> dsWithWatermark = stationLogDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<StationLog>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((stationLog, timestamp) -> stationLog.callTime)
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
);
//按照基站ID进行分组,并每隔5s统计每个基站所有主叫通话总时长
KeyedStream<Tuple2<String, Long>, String> keyedDStream = dsWithWatermark.map(new MapFunction<StationLog, Tuple2<String, Long>>() {
@Override
public Tuple2<String, Long> map(StationLog stationLog) throws Exception {
return Tuple2.of(stationLog.sid, stationLog.duration);
}
}).keyBy(t -> t.f0);
keyedDStream
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.reduce(new ReduceFunction<Tuple2<String, Long>>() {
@Override
public Tuple2<String, Long> reduce(Tuple2<String, Long> value1, Tuple2<String, Long> value2) throws Exception {
return Tuple2.of(value1.f0, value1.f1 + value2.f1);
}
})
// .reduce((t1,t2) -> Tuple2.of(t1.f0,t1.f1+t2.f1))
.print();
env.execute();
- Scala代码
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//导入隐式转换
import org.apache.flink.streaming.api.scala._
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,5000,50
* 001,181,182,busy,7000,10
* 002,182,183,fail,9000,20
* 001,183,184,busy,11000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,12000,50
*/
val sourceDS: DataStream[String] = env.socketTextStream("node5", 9999)
//将数据转换成StationLog对象
val stationLogDS: DataStream[StationLog] = sourceDS.map(line => {
val arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong)
})
//给 stationLogDS 设置水位线
val dsWithWatermark: DataStream[StationLog] = stationLogDS.assignTimestampsAndWatermarks(
//设置水位线策略
WatermarkStrategy.forBoundedOutOfOrderness[StationLog](Duration.ofSeconds(2))
//设置事件时间抽取器
.withTimestampAssigner(new SerializableTimestampAssigner[StationLog] {
override def extractTimestamp(element: StationLog, recordTimestamp: Long): Long = {
element.callTime
}
})
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
)
//按照基站id进行分组,每隔5s窗口统计每个基站所有主叫通话总时长
dsWithWatermark.map(stationLog => (stationLog.sid, stationLog.duration))
.keyBy(_._1)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.reduce((t1, t2) => (t1._1, t1._2 + t2._2))
.print()
env.execute()
以上代码编写完成启动后,向socket中输入如下数据:
#socket中输入数据如下
001,181,182,busy,1000,10
002,182,183,fail,3000,20
001,183,184,busy,2000,30
002,184,185,busy,6000,40
003,181,183,busy,5000,50
#[0~5000)窗口触发
001,181,182,busy,7000,10
002,182,183,fail,9000,20
001,183,184,busy,11000,30
002,184,185,busy,6000,40
#[5000~10000)窗口触发
003,181,183,busy,12000,50
结果如下:
7> (002,20)
1> (001,40)
7> (002,100)
1> (001,10)
8> (003,50)
aggregate-AggregateFunction
和ReduceFunction相似,AggregateFunction也是基于中间状态计算结果的增量计算函数,但AggregateFunction在窗口计算上更加通用,AggregateFunction接口相对ReduceFunction更加灵活,实现复杂度也相对较高。
可以对WindowStream调用aggregate方法传入AggregateFunction,使用方式如下:
private static class AverageAggregate
implements AggregateFunction<Tuple2<String, Long>, Tuple2<Long, Long>, Double> {
@Override
public Tuple2<Long, Long> createAccumulator() {
return new Tuple2<>(0L, 0L);
}
@Override
public Tuple2<Long, Long> add(Tuple2<String, Long> value, Tuple2<Long, Long> accumulator) {
return new Tuple2<>(accumulator.f0 + value.f1, accumulator.f1 + 1L);
}
@Override
public Double getResult(Tuple2<Long, Long> accumulator) {
return ((double) accumulator.f0) / accumulator.f1;
}
@Override
public Tuple2<Long, Long> merge(Tuple2<Long, Long> a, Tuple2<Long, Long> b) {
return new Tuple2<>(a.f0 + b.f0, a.f1 + b.f1);
}}
DataStream<Tuple2<String, Long>> input = ...;
input
.keyBy(<key selector>)
.window(<window assigner>)
.aggregate(new AverageAggregate());
使用AggregateFunction接口时需要指定三种类型:输入数据类型(IN)、累加器类型(ACC)、输出数据类型(OUT),输入数据类型和输出数据类型对应AggregateFunction输入和输出的数据类型,这里ACC 累加器类型实际上是进行数据增量聚合计算时的中间状态。在AggregateFunction接口中定义了四个需要复写的方法,如下:
-
createAccumulator():创建累加器并初始化累加器,该累加器即是参与整个增量计算的中间状态。
-
add(Event,Acc):窗口中每有一条数据会调用一次add方法,在add方法中可以设置进入元素和累加器如何进行聚合的逻辑。
-
getResult(Acc):从累加器状态中返回最后结果。
-
merge(Acc,Acc):在窗口合并场景下,将两个窗口的累加器合并为一个累加器结果并返回,例如在SessionWindow中,每条数据对应一个窗口,当两条数据时间差没有超过gap时间时,会自动合并为一个窗口,在这种情况下merge()方法会被调用。
AggregateFunction被执行时首先会调用createAccumulator方法初始化累加器,该累加器参与窗口增量计算,当窗口中来一条数据时会调用add方法将此条数据与Accumulator进行合并增量计算,如果这个过程中有窗口合并会调用merge方法对多个窗口的Accumulator进行合并,最后调用getResult方法对Accumulator中累计的数据进行处理,返回最终结果。此外,需要注意的是WindowStream可以直接调用min/minBy/max/maxBy/sum等方法实现聚合功能,这些方法的底层都是通过AggregateFunction接口实现。
案例:读取socket基站日志数据,每隔5s统计每个基站通话总时长。
- Java代码
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,5000,50
* 001,181,182,busy,7000,10
* 002,182,183,fail,9000,20
* 001,183,184,busy,11000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,12000,50
*
*/
DataStreamSource<String> sourceDS = env.socketTextStream("node5", 9999);
//将数据转换成StationLog对象
SingleOutputStreamOperator<StationLog> stationLogDS = sourceDS.map(new MapFunction<String, StationLog>() {
@Override
public StationLog map(String s) throws Exception {
String[] arr = s.split(",");
return new StationLog(arr[0].trim(),
arr[1].trim(),
arr[2].trim(),
arr[3].trim(),
Long.valueOf(arr[4]),
Long.valueOf(arr[5]));
}
});
//设置水位线
SingleOutputStreamOperator<StationLog> dsWithWatermark = stationLogDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<StationLog>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((stationLog, timestamp) -> stationLog.callTime)
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
);
//按照基站ID进行分组,并每隔5s统计每个基站所有主叫通话总时长
KeyedStream<StationLog, String> keyedStream = dsWithWatermark.keyBy(new KeySelector<StationLog, String>() {
@Override
public String getKey(StationLog stationLog) throws Exception {
return stationLog.sid;
}
});
//每隔5s统计每个基站所有主叫通话总时长,使用事件时间
keyedStream.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.aggregate(new AggregateFunction<StationLog, Tuple2<String,Long>, String>() {
//创建累加器
@Override
public Tuple2<String, Long> createAccumulator() {
return Tuple2.of("",0L);
}
//累加器的累加逻辑
@Override
public Tuple2<String, Long> add(StationLog value, Tuple2<String, Long> accumulator) {
return Tuple2.of(value.sid,accumulator.f1 + value.duration);
}
//获取结果
@Override
public String getResult(Tuple2<String, Long> accumulator) {
return "基站:"+accumulator.f0 + ",通话时长:" + accumulator.f1;
}
//合并累加器
@Override
public Tuple2<String, Long> merge(Tuple2<String, Long> a, Tuple2<String, Long> b) {
return null;
}
}).print();
env.execute();
- Scala代码
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//导入隐式转换
import org.apache.flink.streaming.api.scala._
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,5000,50
* 001,181,182,busy,7000,10
* 002,182,183,fail,9000,20
* 001,183,184,busy,11000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,12000,50
*/
val sourceDS: DataStream[String] = env.socketTextStream("node5", 9999)
//将数据转换成StationLog对象
val stationLogDS: DataStream[StationLog] = sourceDS.map(line => {
val arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong)
})
//给 stationLogDS 设置水位线
val dsWithWatermark: DataStream[StationLog] = stationLogDS.assignTimestampsAndWatermarks(
//设置水位线策略
WatermarkStrategy.forBoundedOutOfOrderness[StationLog](Duration.ofSeconds(2))
//设置事件时间抽取器
.withTimestampAssigner(new SerializableTimestampAssigner[StationLog] {
override def extractTimestamp(element: StationLog, recordTimestamp: Long): Long = {
element.callTime
}
})
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
)
//按照基站id进行分组,每隔5s窗口统计每个基站所有主叫通话总时长
dsWithWatermark
.keyBy(_.sid)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.aggregate(new AggregateFunction[StationLog,(String,Long),String] {
//创建累加器
override def createAccumulator(): (String, Long) = ("", 0L)
//累加器累加
override def add(value: StationLog, accumulator: (String, Long)): (String, Long) =
(value.sid, accumulator._2 + value.duration)
//获取结果
override def getResult(accumulator: (String, Long)): String =
"基站:" + accumulator._1 + ",通话时长:" + accumulator._2
//合并累加器
override def merge(a: (String, Long), b: (String, Long)): (String, Long) = (a._1, a._2 + b._2)
}).print()
env.execute()
以上代码逻辑和ReduceFunction处理逻辑一样,由于时间窗口不涉及窗口合并,所以这里AggregateFunction中的merge方法不会被调用,merge方法内逻辑可以不写处理逻辑。执行完成后可以在socket中输入如下数据:
#socket中输入数据如下
001,181,182,busy,1000,10
002,182,183,fail,3000,20
001,183,184,busy,2000,30
002,184,185,busy,6000,40
003,181,183,busy,5000,50
#[0~5000)窗口触发
001,181,182,busy,7000,10
002,182,183,fail,9000,20
001,183,184,busy,11000,30
002,184,185,busy,6000,40
#[5000~10000)窗口触发
003,181,183,busy,12000,50
输出结果如下:
1> 基站:001,通话时长:40
7> 基站:002,通话时长:20
7> 基站:002,通话时长:100
1> 基站:001,通话时长:10
8> 基站:003,通话时长:50
process-ProcessWindowFunction
ProcessWindowFunction是全量窗口函数,可以获取包含窗口内的所有元素的Iterable、Window、数据Key、Context上下文等信息,由于该函数是全量窗口函数,所以在窗口触发前需要缓存窗口所有数据。
在使用ProcessWindowFunction时需要对WindowStream调用process方法传入ProcessWindowFunction,在Flink API章节中我们介绍过process方法,该方法是Flink底层API重要的方法,在Flink中很多数据流对象都可以调用process方法传入对应的ProcessFunction接口实现对应的业务处理逻辑,如下:
DataStream.process(new ProcessFunction...)
KeyedDataStream.process(new KeyedProcessFunction...)
ConnectionedStreams.process(new CoProcessFunction...)
BroadcastConnectedStream.process(new BroadcastProcessFunction.../new KeyedBroadcastProcessFunction...)
WindowedStream.process(new processWindowFunction.../new ProcessAllWindowFunction...)
#对于IntervalJoin后的结果进行process处理,后续章节讲解
KeyedStream.IntervalJoined.process(new ProcessJoinFunction...)
通过以上可见,ProcessWindowFunction的使用方式如下:
DataStream<Tuple2<String, Long>> input = ...;
input
.keyBy(t -> t.f0)
.window(TumblingEventTimeWindows.of(Time.minutes(5)))
.process(new MyProcessWindowFunction());
/* ... */
public class MyProcessWindowFunction
extends ProcessWindowFunction<Tuple2<String, Long>, String, String, TimeWindow> {
@Override
public void process(String key, Context context, Iterable<Tuple2<String, Long>> input, Collector<String> out) {
long count = 0;
for (Tuple2<String, Long> in: input) {
count++;
}
out.collect("Window: " + context.window() + "count: " + count);
}
}
ProcessWindowFunction接口需要指定四个参数:<IN, OUT, KEY, Window>,分别为ProcessWindowFunction输入数据类型、输出数据类型、key数据类型、窗口类型,在该接口process方法中可以获取窗口所有数据对应的Iterable对象,可以通过context上下文对象获取窗口、当前处理时间、当前watermark、状态等相关信息,具体使用可以参考案例。
案例:读取socket基站日志数据,每隔5s统计每个基站通话总时长。
- Java代码:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,5000,50
* 001,181,182,busy,7000,10
* 002,182,183,fail,9000,20
* 001,183,184,busy,11000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,12000,50
*
*/
DataStreamSource<String> sourceDS = env.socketTextStream("node5", 9999);
//将数据转换成StationLog对象
SingleOutputStreamOperator<StationLog> stationLogDS = sourceDS.map(new MapFunction<String, StationLog>() {
@Override
public StationLog map(String s) throws Exception {
String[] arr = s.split(",");
return new StationLog(arr[0].trim(),
arr[1].trim(),
arr[2].trim(),
arr[3].trim(),
Long.valueOf(arr[4]),
Long.valueOf(arr[5]));
}
});
//设置水位线
SingleOutputStreamOperator<StationLog> dsWithWatermark = stationLogDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<StationLog>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((stationLog, timestamp) -> stationLog.callTime)
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
);
//按照基站ID进行分组,并每隔5s统计每个基站所有主叫通话总时长
KeyedStream<StationLog, String> keyedStream = dsWithWatermark.keyBy(new KeySelector<StationLog, String>() {
@Override
public String getKey(StationLog stationLog) throws Exception {
return stationLog.sid;
}
});
//每隔5s统计每个基站所有主叫通话总时长,使用事件时间
keyedStream.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.process(new ProcessWindowFunction<StationLog, String, String, TimeWindow>() {
@Override
public void process(String key,
ProcessWindowFunction<StationLog, String, String, TimeWindow>.Context context,
Iterable<StationLog> elements,
Collector<String> out) throws Exception {
//获取当期的watermark
long watermark = context.currentWatermark();
//统计每个基站所有主叫通话总时长
long sumCallTime = 0L;
for (StationLog element : elements) {
sumCallTime += element.duration;
}
//获取窗口起始时间
long start = context.window().getStart() <0 ? 0 : context.window().getStart();
long end = context.window().getEnd();
out.collect("窗口范围:[" + start + "~" + end + "),基站:" + key + ",所有主叫通话总时长:" + sumCallTime + ",此刻watermark:" + watermark);
}
}).print();
env.execute();
- Scala代码:
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//导入隐式转换
import org.apache.flink.streaming.api.scala._
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,5000,50
* 001,181,182,busy,7000,10
* 002,182,183,fail,9000,20
* 001,183,184,busy,11000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,12000,50
*/
val sourceDS: DataStream[String] = env.socketTextStream("node5", 9999)
//将数据转换成StationLog对象
val stationLogDS: DataStream[StationLog] = sourceDS.map(line => {
val arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong)
})
//给 stationLogDS 设置水位线
val dsWithWatermark: DataStream[StationLog] = stationLogDS.assignTimestampsAndWatermarks(
//设置水位线策略
WatermarkStrategy.forBoundedOutOfOrderness[StationLog](Duration.ofSeconds(2))
//设置事件时间抽取器
.withTimestampAssigner(new SerializableTimestampAssigner[StationLog] {
override def extractTimestamp(element: StationLog, recordTimestamp: Long): Long = {
element.callTime
}
})
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
)
//按照基站id进行分组,每隔5s窗口统计每个基站所有主叫通话总时长
dsWithWatermark
.keyBy(_.sid)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.process(new ProcessWindowFunction[StationLog, String, String, TimeWindow] {
override def process(key: String,
context: Context,
elements: Iterable[StationLog],
out: Collector[String]): Unit = {
// 获取当前的watermark
val watermark: Long = context.currentWatermark
// 统计每个基站所有主叫通话总时长
var sumCallTime: Long = 0L
for (element <- elements) {
sumCallTime += element.duration
}
// 获取窗口起始时间
val start: Long = if (context.window.getStart < 0) 0 else context.window.getStart
val end: Long = context.window.getEnd
out.collect(
s"窗口范围:[$start~$end),基站:$key,所有主叫通话总时长:$sumCallTime,此刻watermark:$watermark"
)
}
}).print()
env.execute()
以上代码执行后,向socket中输入如下数据:
#socket中输入数据如下
001,181,182,busy,1000,10
002,182,183,fail,3000,20
001,183,184,busy,2000,30
002,184,185,busy,6000,40
003,181,183,busy,5000,50
#[0~5000)窗口触发
001,181,182,busy,7000,10
002,182,183,fail,9000,20
001,183,184,busy,11000,30
002,184,185,busy,6000,40
#[5000~10000)窗口触发
003,181,183,busy,12000,50
结果如下:
1> 窗口范围:[0~5000),基站:001,所有主叫通话总时长:40,此刻watermark:4999
7> 窗口范围:[0~5000),基站:002,所有主叫通话总时长:20,此刻watermark:4999
1> 窗口范围:[5000~10000),基站:001,所有主叫通话总时长:10,此刻watermark:9999
8> 窗口范围:[5000~10000),基站:003,所有主叫通话总时长:50,此刻watermark:9999
7> 窗口范围:[5000~10000),基站:002,所有主叫通话总时长:100,此刻watermark:9999
apply-WindowFunction(已过时)
WindowFunction也是全量窗口函数,与ProcessWindowFunction类似,但没有提供Context上下文对象,功能没有ProcessWindowFunction强大,可以完全使用ProcessWindowFunction替代WindowFunction,Flink1.17版本WindowFunction不建议再使用,未来Flink版本可能会被弃用。
可以对WindowStream调用apply方法传入WindowFunction,使用方式如下:
DataStream<Tuple2<String, Long>> input = ...;
input
.keyBy(<key selector>)
.window(<window assigner>)
.apply(new MyWindowFunction());
public class MyWindowFunction
extends WindowFunction<Tuple2<String, Long>, String, String, TimeWindow> {
@Override
public void apply(String key, TimeWindow window, Iterable<Tuple2<String, Long>> input, Collector<String> out) {
... ...
}
}
实现WindowFunction接口时需要实现接口的apply方法,通过方法中window对象可以获取窗口相关信息,Iterable对象代表窗口中全量的数据集合。
案例:读取socket基站日志数据,每隔5s统计每个基站通话总时长。
- Java代码
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,5000,50
* 001,181,182,busy,7000,10
* 002,182,183,fail,9000,20
* 001,183,184,busy,11000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,12000,50
*
*/
DataStreamSource<String> sourceDS = env.socketTextStream("node5", 9999);
//将数据转换成StationLog对象
SingleOutputStreamOperator<StationLog> stationLogDS = sourceDS.map(new MapFunction<String, StationLog>() {
@Override
public StationLog map(String s) throws Exception {
String[] arr = s.split(",");
return new StationLog(arr[0].trim(),
arr[1].trim(),
arr[2].trim(),
arr[3].trim(),
Long.valueOf(arr[4]),
Long.valueOf(arr[5]));
}
});
//设置水位线
SingleOutputStreamOperator<StationLog> dsWithWatermark = stationLogDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<StationLog>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((stationLog, timestamp) -> stationLog.callTime)
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
);
//按照基站ID进行分组,并每隔5s统计每个基站所有主叫通话总时长
KeyedStream<Tuple2<String, Long>, String> keyedDStream = dsWithWatermark.map(new MapFunction<StationLog, Tuple2<String, Long>>() {
@Override
public Tuple2<String, Long> map(StationLog stationLog) throws Exception {
return Tuple2.of(stationLog.sid, stationLog.duration);
}
}).keyBy(t -> t.f0);
keyedDStream
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.apply(new WindowFunction<Tuple2<String, Long>, String, String, TimeWindow>() {
@Override
public void apply(String key, TimeWindow window, Iterable<Tuple2<String, Long>> input, Collector<String> out) throws Exception {
long sum = 0L;
for (Tuple2<String, Long> tuple2 : input) {
sum += tuple2.f1;
}
out.collect("基站ID:" + key + "," +
"窗口范围:[" + window.getStart() + "," + window.getEnd() + "), 通话时长:" + sum);
}
}).print();
env.execute();
- Scala代码
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//导入隐式转换
import org.apache.flink.streaming.api.scala._
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,5000,50
* 001,181,182,busy,7000,10
* 002,182,183,fail,9000,20
* 001,183,184,busy,11000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,12000,50
*/
val sourceDS: DataStream[String] = env.socketTextStream("node5", 9999)
//将数据转换成StationLog对象
val stationLogDS: DataStream[StationLog] = sourceDS.map(line => {
val arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong)
})
//给 stationLogDS 设置水位线
val dsWithWatermark: DataStream[StationLog] = stationLogDS.assignTimestampsAndWatermarks(
//设置水位线策略
WatermarkStrategy.forBoundedOutOfOrderness[StationLog](Duration.ofSeconds(2))
//设置事件时间抽取器
.withTimestampAssigner(new SerializableTimestampAssigner[StationLog] {
override def extractTimestamp(element: StationLog, recordTimestamp: Long): Long = {
element.callTime
}
})
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
)
//按照基站id进行分组,每隔5s窗口统计每个基站所有主叫通话总时长
dsWithWatermark.map(stationLog => (stationLog.sid, stationLog.duration))
.keyBy(_._1)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.apply(new WindowFunction[(String,Long),String,String,TimeWindow] {
override def apply(key: String, window: TimeWindow, input: Iterable[(String, Long)], out: Collector[String]): Unit = {
//获取窗口开始时间
val startTime: Long = window.getStart
//获取窗口结束时间
val endTime: Long = window.getEnd
//获取窗口中的数据
val datas: Iterator[(String, Long)] = input.iterator
//定义一个变量,用于统计通话总时长
var sumDuration = 0L
//遍历数据,统计通话总时长
for (data <- datas) {
sumDuration += data._2
}
//输出结果
out.collect(s"窗口起始时间:[${startTime}~${endTime}),基站id:${key},通话总时长:${sumDuration}")
}
}).print()
env.execute()
以上代码编写完成执行后,在socket中输入如下数据:
#socket中输入数据如下
001,181,182,busy,1000,10
002,182,183,fail,3000,20
001,183,184,busy,2000,30
002,184,185,busy,6000,40
003,181,183,busy,5000,50
#[0~5000)窗口触发
001,181,182,busy,7000,10
002,182,183,fail,9000,20
001,183,184,busy,11000,30
002,184,185,busy,6000,40
#[5000~10000)窗口触发
003,181,183,busy,12000,50
输出结果如下:
1> 窗口起始时间:[0~5000),基站id:001,通话总时长:40
7> 窗口起始时间:[0~5000),基站id:002,通话总时长:20
1> 窗口起始时间:[5000~10000),基站id:001,通话总时长:10
7> 窗口起始时间:[5000~10000),基站id:002,通话总时长:100
8> 窗口起始时间:[5000~10000),基站id:003,通话总时长:50
增量和全量窗口函数结合
针对增量窗口函数ReduceFunction和AggregateFunction使用时还可以配合全量函数ProcessWindowFunction/WindowFunction一起使用,这样可以获取关于窗口、数据key、context上下文更多信息,使用方式如下:
#ReduceFunction和ProcessWindowFunction组合
DataStream<SensorReading> input = ...;
input
.keyBy(<key selector>)
.window(<window assigner>)
.reduce(new ReduceFunction(...), new ProcessWindowFunction(...));
#AggregateFunction和ProcessWindowFuncation组合
DataStream<Tuple2<String, Long>> input = ...;
input
.keyBy(<key selector>)
.window(<window assigner>)
.aggregate(new AggregateFuncation(...), new ProcessWindowFunction(...));
以上reduce和aggregate方法除了传入ReduceFunction和AggregateFunction增量窗口函数,还传入了ProcessFunction全量窗口函数,这样增量窗口函数计算后的结果会直接传给全量窗口函数,下面分别通过案例来演示reduce/aggregate方法传入增量和全量窗口函数使用方式。
1) ReduceFunction + ProcessWindowFunction
案例:读取socket基站日志数据,每隔5s统计每个基站最大通话时长。
- Java代码
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,5000,50
* 001,181,182,busy,7000,10
* 002,182,183,fail,9000,20
* 001,183,184,busy,11000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,12000,50
*
*/
DataStreamSource<String> sourceDS = env.socketTextStream("node5", 9999);
//将数据转换成StationLog对象
SingleOutputStreamOperator<StationLog> stationLogDS = sourceDS.map(new MapFunction<String, StationLog>() {
@Override
public StationLog map(String s) throws Exception {
String[] arr = s.split(",");
return new StationLog(arr[0].trim(),
arr[1].trim(),
arr[2].trim(),
arr[3].trim(),
Long.valueOf(arr[4]),
Long.valueOf(arr[5]));
}
});
//设置水位线
SingleOutputStreamOperator<StationLog> dsWithWatermark = stationLogDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<StationLog>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((stationLog, timestamp) -> stationLog.callTime)
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
);
//按照基站ID进行分组,并每隔5s统计每个基站所有主叫通话总时长
KeyedStream<Tuple2<String, Long>, String> keyedDStream = dsWithWatermark.map(new MapFunction<StationLog, Tuple2<String, Long>>() {
@Override
public Tuple2<String, Long> map(StationLog stationLog) throws Exception {
return Tuple2.of(stationLog.sid, stationLog.duration);
}
}).keyBy(t -> t.f0);
keyedDStream
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.reduce(new ReduceFunction<Tuple2<String, Long>>() {
@Override
public Tuple2<String, Long> reduce(Tuple2<String, Long> value1, Tuple2<String, Long> value2) throws Exception {
//判断窗口中数据哪个通话时长长?
return value1.f1 > value2.f1 ? value1 : value2;
}
}, new ProcessWindowFunction<Tuple2<String, Long>, String, String, TimeWindow>() {
@Override
public void process(String key,
ProcessWindowFunction<Tuple2<String, Long>, String, String, TimeWindow>.Context context,
Iterable<Tuple2<String, Long>> elements,
Collector<String> out) throws Exception {
//获取窗口信息
TimeWindow window = context.window();
//获取窗口开始时间
long start = window.getStart() <0 ? 0 : window.getStart();
//获取窗口结束时间
long end = window.getEnd();
//获取窗口中数据
Long maxDuration = elements.iterator().next().f1;
//输出结果
out.collect("窗口起始时间:[" + start + "~" + end + "),基站ID:" + key + ",通话最大时长:" +maxDuration);
}
})
.print();
env.execute();
- Scala代码
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,5000,50
* 001,181,182,busy,7000,10
* 002,182,183,fail,9000,20
* 001,183,184,busy,11000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,12000,50
*
*/
DataStreamSource<String> sourceDS = env.socketTextStream("node5", 9999);
//将数据转换成StationLog对象
SingleOutputStreamOperator<StationLog> stationLogDS = sourceDS.map(new MapFunction<String, StationLog>() {
@Override
public StationLog map(String s) throws Exception {
String[] arr = s.split(",");
return new StationLog(arr[0].trim(),
arr[1].trim(),
arr[2].trim(),
arr[3].trim(),
Long.valueOf(arr[4]),
Long.valueOf(arr[5]));
}
});
//设置水位线
SingleOutputStreamOperator<StationLog> dsWithWatermark = stationLogDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<StationLog>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((stationLog, timestamp) -> stationLog.callTime)
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
);
//按照基站ID进行分组,并每隔5s统计每个基站所有主叫通话总时长
KeyedStream<Tuple2<String, Long>, String> keyedDStream = dsWithWatermark.map(new MapFunction<StationLog, Tuple2<String, Long>>() {
@Override
public Tuple2<String, Long> map(StationLog stationLog) throws Exception {
return Tuple2.of(stationLog.sid, stationLog.duration);
}
}).keyBy(t -> t.f0);
keyedDStream
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.reduce(new ReduceFunction<Tuple2<String, Long>>() {
@Override
public Tuple2<String, Long> reduce(Tuple2<String, Long> value1, Tuple2<String, Long> value2) throws Exception {
//判断窗口中数据哪个通话时长长?
return value1.f1 > value2.f1 ? value1 : value2;
}
}, new ProcessWindowFunction<Tuple2<String, Long>, String, String, TimeWindow>() {
@Override
public void process(String key,
ProcessWindowFunction<Tuple2<String, Long>, String, String, TimeWindow>.Context context,
Iterable<Tuple2<String, Long>> elements,
Collector<String> out) throws Exception {
//获取窗口信息
TimeWindow window = context.window();
//获取窗口开始时间
long start = window.getStart() <0 ? 0 : window.getStart();
//获取窗口结束时间
long end = window.getEnd();
//获取窗口中数据
Long maxDuration = elements.iterator().next().f1;
//输出结果
out.collect("窗口起始时间:[" + start + "~" + end + "),基站ID:" + key + ",通话最大时长:" +maxDuration);
}
})
.print();
env.execute();
以上代码中,窗口触发后,ReduceFunction 将会增量聚合得到一条数据,传递给ProcessWindowFunction,所以ProcessWindowFunction中的Iterable中只有一条数据。代码执行后,向Socket中输入如下数据:
#socket中输入数据如下
001,181,182,busy,1000,10
002,182,183,fail,3000,20
001,183,184,busy,2000,30
002,184,185,busy,6000,40
003,181,183,busy,5000,50
#[0~5000)窗口触发
001,181,182,busy,7000,10
002,182,183,fail,9000,20
001,183,184,busy,11000,30
002,184,185,busy,6000,40
#[5000~10000)窗口触发
003,181,183,busy,12000,50
执行结果如下:
#socket中输入数据如下
001,181,182,busy,1000,10
002,182,183,fail,3000,20
001,183,184,busy,2000,30
002,184,185,busy,6000,40
003,181,183,busy,5000,50
#[0~5000)窗口触发
001,181,182,busy,7000,10
002,182,183,fail,9000,20
001,183,184,busy,11000,30
002,184,185,busy,6000,40
#[5000~10000)窗口触发
003,181,183,busy,12000,50
2) AggregateFunction + ProcessWindowFunction
案例:读取socket基站日志数据,每隔5s统计每个基站平均通话时长。
- Java代码
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,5000,50
* 001,181,182,busy,7000,10
* 002,182,183,fail,9000,20
* 001,183,184,busy,11000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,12000,50
*
*/
DataStreamSource<String> sourceDS = env.socketTextStream("node5", 9999);
//将数据转换成StationLog对象
SingleOutputStreamOperator<StationLog> stationLogDS = sourceDS.map(new MapFunction<String, StationLog>() {
@Override
public StationLog map(String s) throws Exception {
String[] arr = s.split(",");
return new StationLog(arr[0].trim(),
arr[1].trim(),
arr[2].trim(),
arr[3].trim(),
Long.valueOf(arr[4]),
Long.valueOf(arr[5]));
}
});
//设置水位线
SingleOutputStreamOperator<StationLog> dsWithWatermark = stationLogDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<StationLog>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((stationLog, timestamp) -> stationLog.callTime)
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
);
//按照基站ID进行分组,并每隔5s统计每个基站所有主叫通话总时长
KeyedStream<StationLog, String> keyedStream = dsWithWatermark.keyBy(new KeySelector<StationLog, String>() {
@Override
public String getKey(StationLog stationLog) throws Exception {
return stationLog.sid;
}
});
//每隔5s统计每个基站所有主叫通话总时长,使用事件时间
keyedStream.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.aggregate(new AggregateFunction<StationLog, Tuple2<Long, Long>, Tuple2<Long, Long>>() {
//创建累加器
@Override
public Tuple2<Long, Long> createAccumulator() {
return Tuple2.of(0L, 0L);
}
//累加器的累加逻辑
@Override
public Tuple2<Long, Long> add(StationLog value, Tuple2<Long, Long> accumulator) {
return Tuple2.of(accumulator.f0 + value.duration, accumulator.f1 + 1L);
}
//获取结果
@Override
public Tuple2<Long, Long> getResult(Tuple2<Long, Long> accumulator) {
return accumulator;
}
//累加器的合并逻辑
@Override
public Tuple2<Long, Long> merge(Tuple2<Long, Long> a, Tuple2<Long, Long> b) {
return null;
}
}, new ProcessWindowFunction<Tuple2<Long, Long>, String, String, TimeWindow>() {
@Override
public void process(String key,
ProcessWindowFunction<Tuple2<Long, Long>, String, String, TimeWindow>.Context context,
Iterable<Tuple2<Long, Long>> elements,
Collector<String> out) throws Exception {
//获取累加器中的结果
Tuple2<Long, Long> tuple2 = elements.iterator().next();
out.collect("基站ID:" + key + "," +
"窗口范围:[" + context.window().getStart() + "," + context.window().getEnd() + ")," +
"平均通话时长:" + Double.valueOf(tuple2.f0 / tuple2.f1));
}
}).print();
env.execute();
- Scala代码
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//导入隐式转换
import org.apache.flink.streaming.api.scala._
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,5000,50
* 001,181,182,busy,7000,10
* 002,182,183,fail,9000,20
* 001,183,184,busy,11000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,12000,50
*/
val sourceDS: DataStream[String] = env.socketTextStream("node5", 9999)
//将数据转换成StationLog对象
val stationLogDS: DataStream[StationLog] = sourceDS.map(line => {
val arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong)
})
//给 stationLogDS 设置水位线
val dsWithWatermark: DataStream[StationLog] = stationLogDS.assignTimestampsAndWatermarks(
//设置水位线策略
WatermarkStrategy.forBoundedOutOfOrderness[StationLog](Duration.ofSeconds(2))
//设置事件时间抽取器
.withTimestampAssigner(new SerializableTimestampAssigner[StationLog] {
override def extractTimestamp(element: StationLog, recordTimestamp: Long): Long = {
element.callTime
}
})
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
)
//按照基站id进行分组,每隔5s统计每个基站平均通话时长
dsWithWatermark
.keyBy(_.sid)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.aggregate(new AggregateFunction[StationLog,(Long,Long),(Long,Long)] {
//创建累加器
override def createAccumulator(): (Long, Long) = (0L,0L)
//累加器累加
override def add(value: StationLog, accumulator: (Long, Long)): (Long, Long) =
(accumulator._1 + value.duration,accumulator._2 + 1L)
//获取结果
override def getResult(accumulator: (Long, Long)): (Long, Long) = accumulator
//合并累加器
override def merge(a: (Long, Long), b: (Long, Long)): (Long, Long) = (a._1 + b._1,a._2 + b._2)
},new ProcessWindowFunction[(Long,Long),String,String,TimeWindow] {
override def process(key: String, context: Context, elements: Iterable[(Long, Long)], out: Collector[String]): Unit = {
val avgDuration: Double = (elements.head._1 / elements.head._2).toDouble
out.collect("基站ID:" + key + "," + "窗口范围:[" + context.window.getStart + "," + context.window.getEnd + ")," + "平均通话时长:" + avgDuration)
}
}).print()
env.execute()
以上代码中,窗口触发后,AggregateFunction将会增量聚合得到一条数据,传递给ProcessWindowFunction,所以ProcessWindowFunction中的Iterable中只有一条数据。代码执行后,向Socket中输入如下数据:
#socket中输入数据如下
001,181,182,busy,1000,10
002,182,183,fail,3000,20
001,183,184,busy,2000,30
002,184,185,busy,6000,40
003,181,183,busy,5000,50
#[0~5000)窗口触发
001,181,182,busy,7000,10
002,182,183,fail,9000,20
001,183,184,busy,11000,30
002,184,185,busy,6000,40
#[5000~10000)窗口触发
003,181,183,busy,12000,50
输出结果如下:
7> 基站ID:002,窗口范围:[0,5000),平均通话时长:20.0
1> 基站ID:001,窗口范围:[0,5000),平均通话时长:20.0
1> 基站ID:001,窗口范围:[5000,10000),平均通话时长:10.0
8> 基站ID:003,窗口范围:[5000,10000),平均通话时长:50.0
7> 基站ID:002,窗口范围:[5000,10000),平均通话时长:33.0
允许延迟(Allowed Lateness)
在前面章节中我们讲解了Flink Watermark机制,watermark是一种衡量事件时间进展的机制,对Flink乱序流设置窗口后,窗口触发时机也是以watermark时间为基准,大多数情况下我们会考虑Flink流大部分数据乱序程度来设置watermark,但在某些情况下数据可能延时会非常严重,即使通过Watermark机制也无法等到数据全部进入窗口再进行处理。一个窗口触发后,流中再来属于该窗口的数据时,Flink中默认会将这些迟到的数据做丢弃处理。下面以如下案例为例,来说明这个问题。
案例:读取socket基站日志数据,每隔5s统计每个基站通话总时长。
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> sourceDS = env.socketTextStream("node5", 9999);
//将数据转换成StationLog对象
SingleOutputStreamOperator<StationLog> stationLogDS = sourceDS.map(new MapFunction<String, StationLog>() {
@Override
public StationLog map(String s) throws Exception {
String[] arr = s.split(",");
return new StationLog(arr[0].trim(),
arr[1].trim(),
arr[2].trim(),
arr[3].trim(),
Long.valueOf(arr[4]),
Long.valueOf(arr[5]));
}
});
//设置水位线
SingleOutputStreamOperator<StationLog> dsWithWatermark = stationLogDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<StationLog>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((stationLog, timestamp) -> stationLog.callTime)
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
);
//按照基站ID进行分组,并每隔5s统计每个基站所有主叫通话总时长
KeyedStream<StationLog, String> keyedStream = dsWithWatermark.keyBy(new KeySelector<StationLog, String>() {
@Override
public String getKey(StationLog stationLog) throws Exception {
return stationLog.sid;
}
});
//每隔5s统计每个基站所有主叫通话总时长,使用事件时间
SingleOutputStreamOperator<String> result = keyedStream
.window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))
.process(new ProcessWindowFunction<StationLog, String, String, TimeWindow>() {
@Override
public void process(String key,
ProcessWindowFunction<StationLog, String, String, TimeWindow>.Context context,
Iterable<StationLog> elements,
Collector<String> out) throws Exception {
//统计每个基站所有主叫通话总时长
long sumCallTime = 0L;
for (StationLog element : elements) {
sumCallTime += element.duration;
}
//获取窗口起始时间
long start = context.window().getStart() <0 ? 0 : context.window().getStart();
long end = context.window().getEnd();
out.collect("窗口范围:[" + start + "~" + end + "),基站:" + key + ",所有主叫通话总时长:" + sumCallTime);
}
});
result.print();
env.execute();
以上代码中我们设置watermark基于最大事件时间延迟2秒后得到,每隔5秒触发一个窗口执行。执行以上代码后,向Socket中输入如下数据:
#socket中输入数据如下
001,181,182,busy,1000,10
002,182,183,fail,3000,20
001,183,184,busy,2000,30
002,184,185,busy,6000,40
003,181,183,busy,5000,50
#触发[0~5000)窗口
001,181,182,busy,7000,10
#又来属于[0~5000)窗口的迟到严重数据,直接丢弃
002,182,183,fail,1000,20
001,183,184,busy,2000,30
#触发[5000~10000)窗口
003,181,183,busy,12000,50
通过以上案例我们可以看到当[0~5000)窗口触发后,该窗口执行后会被销毁,再来属于该窗口迟到的数据,Flink默认直接丢弃这些数据。但是有些时候用户希望即使数据延迟到达的情况下,也能够正常按照流程处理并输出结果,此时就需要使用Allowed Lateness(允许延迟)机制来对迟到的数据进行额外的处理。
Flink Allowed Lateness机制可以通过allowedLateness方法来指定一个延迟时间,窗口触发后,在该延迟时间范围内不会被销毁,属于该窗口的延迟数据到达后依然可以触发窗口计算,直到watermark推进到"窗口结束时间+延迟时间"后,再销毁该窗口,后续再有迟到"非常严重"的数据默认丢弃,Allowed Lateness机制使用方式如下:
DataStream<T> input = ...;
input
.keyBy(<key selector>)
.window(<window assigner>)
.allowedLateness(<time>)
.<windowed transformation>(<window function>);
假设现在每隔10分钟设置一个滚动窗口,如果watermark延迟时间设置2分钟,同时为了保证迟到数据正确计算到对应窗口中我们设置allowedLateness允许延迟时间为3分钟(allowedLateness(Time.minutes(3))),若一个正常窗口时间范围为[08:00:00~08:10:00) ,当watermark达到08:10:00时,该窗口会触发执行(此刻Flink最大事件时间为08:10:02),由于设置了AllowedLateness,所以该窗口触发执行后不会被销毁,而是会延长3分钟销毁,当watermark达到08:10:03时(Flink接收到最大的事件时间为08:10:05),该窗口才会别真正销毁,期间属于该窗口内的数据每来一条都会重新触发窗口的执行。
通过以上这个场景,可以理解计算watermark设置的延迟时间(2分钟)是针对Flink流中绝大多数数据乱序情况而设置、解决的是数据流乱序问题(也是一种数据迟到问题),而allowedLateness机制是在乱序的基础上再等一等"迟到更严重"的数据,保证这些数据能正确计算到对应的窗口内。具体AllowedLateness机制使用方式及测试,参考如下案例。
案例:读取socket基站日志数据,每隔5s统计每个基站通话总时长。
针对这个案例我们同样设置watermark的延迟时间为2秒,同时设置allowedLateness允许延迟时间为2秒。
- Java代码
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> sourceDS = env.socketTextStream("node5", 9999);
//将数据转换成StationLog对象
SingleOutputStreamOperator<StationLog> stationLogDS = sourceDS.map(new MapFunction<String, StationLog>() {
@Override
public StationLog map(String s) throws Exception {
String[] arr = s.split(",");
return new StationLog(arr[0].trim(),
arr[1].trim(),
arr[2].trim(),
arr[3].trim(),
Long.valueOf(arr[4]),
Long.valueOf(arr[5]));
}
});
//设置水位线
SingleOutputStreamOperator<StationLog> dsWithWatermark = stationLogDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<StationLog>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((stationLog, timestamp) -> stationLog.callTime)
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
);
//按照基站ID进行分组,并每隔5s统计每个基站所有主叫通话总时长
KeyedStream<StationLog, String> keyedStream = dsWithWatermark.keyBy(new KeySelector<StationLog, String>() {
@Override
public String getKey(StationLog stationLog) throws Exception {
return stationLog.sid;
}
});
//每隔5s统计每个基站所有主叫通话总时长,使用事件时间
SingleOutputStreamOperator<String> result = keyedStream
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
//watermark 基础之上,再延迟2s
.allowedLateness(Time.seconds(2))
.process(new ProcessWindowFunction<StationLog, String, String, TimeWindow>() {
@Override
public void process(String key,
ProcessWindowFunction<StationLog, String, String, TimeWindow>.Context context,
Iterable<StationLog> elements,
Collector<String> out) throws Exception {
//统计每个基站所有主叫通话总时长
long sumCallTime = 0L;
for (StationLog element : elements) {
sumCallTime += element.duration;
}
//获取窗口起始时间
long start = context.window().getStart() <0 ? 0 : context.window().getStart();
long end = context.window().getEnd();
out.collect("窗口范围:[" + start + "~" + end + "),基站:" + key + ",所有主叫通话总时长:" + sumCallTime);
}
});
result.print();
env.execute();
- Scala代码
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//导入隐式转换
import org.apache.flink.streaming.api.scala._
val sourceDS: DataStream[String] = env.socketTextStream("node5", 9999)
//将数据转换成StationLog对象
val stationLogDS: DataStream[StationLog] = sourceDS.map(line => {
val arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong)
})
//给 stationLogDS 设置水位线
val dsWithWatermark: DataStream[StationLog] = stationLogDS.assignTimestampsAndWatermarks(
//设置水位线策略
WatermarkStrategy.forBoundedOutOfOrderness[StationLog](Duration.ofSeconds(2))
//设置事件时间抽取器
.withTimestampAssigner(new SerializableTimestampAssigner[StationLog] {
override def extractTimestamp(element: StationLog, recordTimestamp: Long): Long = {
element.callTime
}
})
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
)
//按照基站id进行分组,每隔5s窗口统计每个基站所有主叫通话总时长
dsWithWatermark
.keyBy(_.sid)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
//在watermark基础之上,再延迟2s触发窗口计算
.allowedLateness(Time.seconds(2))
.process(new ProcessWindowFunction[StationLog, String, String, TimeWindow] {
override def process(key: String, context: Context, elements: Iterable[StationLog], out: Collector[String]): Unit = {
//获取窗口起始时间
val startTime: Long = context.window.getStart
val endTime: Long = context.window.getEnd
//统计每个基站所有主叫通话总时长
var sumCallTime = 0L
for (elem <- elements) {
sumCallTime += elem.duration
}
out.collect("窗口范围:[" + startTime + "~" + endTime + "),基站:" + key + ",所有主叫通话总时长:" + sumCallTime)
}
}).print()
env.execute()
以上代码中,我们设置了allowedLateness时间为2秒钟,意为在watermark触发窗口基础之上,再多等2秒销毁窗口,在这期间再有属于该窗口的数据都会触发窗口再次执行,直到watermark达到"窗口触发时间+2秒"时刻,对应窗口才会被销毁。
以上代码编写完成执行,可以在socket中输入如下数据:
#socket中输入数据如下
001,181,182,busy,1000,10
002,182,183,fail,3000,20
001,183,184,busy,2000,30
002,184,185,busy,6000,40
003,181,183,busy,5000,50
#触发[0~5000)窗口
001,181,182,busy,7000,10
#再次来属于[0~5000)窗口迟到数据,每来一条数据会再次触发窗口执行
002,182,183,fail,1000,20
002,182,183,fail,3000,20
001,183,184,busy,2000,30
#此条数据输入后,达到了[0~5000)窗口销毁时刻:wm(7000)+2000
003,181,183,busy,9000,50
#继续输入属于[0~5000)窗口数据,窗口不再触发,数据丢弃。
002,182,183,fail,1000,20
001,183,184,busy,2000,30
侧流输出迟到数据(sideOutputLateData)
通过AllowedLateness小节的学习,我们发现即使设置了AllowedLateness机制后,当watermark达到"窗口触发时间+allowedLateness允许延迟时间"后,再有属于该窗口数据达到,这些数据依然会被丢弃,也就是说设置AllowedLateness机制并不能保证Flink可以正确处理"迟到"非常严重的事件。这种情况下,如果我们不希望丢弃这些"迟到"非常严重的事件而是将这些数据单独保留后续进行处理,这就需要使用Flink中SideOutput(侧输出流)机制。
关于侧输出流在前面章节讲解过,这里处理延迟非常严重的事件,也可以通过Side Output侧输出流机制来处理。SideOutputLateData使用方式如下,针对窗口流可以调用sideOutputLateData(lateOutputTag)方法来标记迟到严重的数据,然后使用getSideOutput(lateOutputTag)获取lateOutputTag标签对应的迟到严重的数据,之后转成独立的DataStream数据集进行处理,这个过程需要创建late-data的OutputTag,然后通过该标签从数据流中将迟到数据筛选出来。
final OutputTag<T> lateOutputTag = new OutputTag<T>("late-data"){};
DataStream<T> input = ...;
SingleOutputStreamOperator<T> result = input
.keyBy(<key selector>)
.window(<window assigner>)
.allowedLateness(<time>)
.sideOutputLateData(lateOutputTag)
.<windowed transformation>(<window function>);
DataStream<T> lateStream = result.getSideOutput(lateOutputTag);
下面通过AllowedLateness小节案例来演示SideOutputLateData机制使用。
案例:读取socket基站日志数据,每隔5s统计每个基站通话总时长。
针对这个案例我们同样设置watermark的延迟时间为2秒,同时设置allowedLateness允许延迟时间为2秒,将迟到严重的数据通过侧输出流方式进行收集。
- Java代码
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
DataStreamSource<String> sourceDS = env.socketTextStream("node5", 9999);
//将数据转换成StationLog对象
SingleOutputStreamOperator<StationLog> stationLogDS = sourceDS.map(new MapFunction<String, StationLog>() {
@Override
public StationLog map(String s) throws Exception {
String[] arr = s.split(",");
return new StationLog(arr[0].trim(),
arr[1].trim(),
arr[2].trim(),
arr[3].trim(),
Long.valueOf(arr[4]),
Long.valueOf(arr[5]));
}
});
//设置水位线
SingleOutputStreamOperator<StationLog> dsWithWatermark = stationLogDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<StationLog>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((stationLog, timestamp) -> stationLog.callTime)
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
);
//按照基站ID进行分组,并每隔5s统计每个基站所有主叫通话总时长
KeyedStream<StationLog, String> keyedStream = dsWithWatermark.keyBy(new KeySelector<StationLog, String>() {
@Override
public String getKey(StationLog stationLog) throws Exception {
return stationLog.sid;
}
});
//定义侧输出流标签,最后必须是“{}”形式,避免类型擦除
OutputTag<StationLog> lateOutputTag = new OutputTag<StationLog>("late-data"){};
//每隔5s统计每个基站所有主叫通话总时长,使用事件时间
SingleOutputStreamOperator<String> result = keyedStream
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
//watermark 基础之上,再延迟2s
.allowedLateness(Time.seconds(2))
//迟到的数据,通过侧输出流方式进行收集
.sideOutputLateData(lateOutputTag)
.process(new ProcessWindowFunction<StationLog, String, String, TimeWindow>() {
@Override
public void process(String key,
ProcessWindowFunction<StationLog, String, String, TimeWindow>.Context context,
Iterable<StationLog> elements,
Collector<String> out) throws Exception {
//统计每个基站所有主叫通话总时长
long sumCallTime = 0L;
for (StationLog element : elements) {
sumCallTime += element.duration;
}
//获取窗口起始时间
long start = context.window().getStart() <0 ? 0 : context.window().getStart();
long end = context.window().getEnd();
out.collect("窗口范围:[" + start + "~" + end + "),基站:" + key + ",所有主叫通话总时长:" + sumCallTime);
}
});
result.print("正常窗口数据");
//获取迟到数据
result.getSideOutput(lateOutputTag).print("迟到数据");
env.execute();
- Scala代码
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//导入隐式转换
import org.apache.flink.streaming.api.scala._
val sourceDS: DataStream[String] = env.socketTextStream("node5", 9999)
//将数据转换成StationLog对象
val stationLogDS: DataStream[StationLog] = sourceDS.map(line => {
val arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong)
})
//给 stationLogDS 设置水位线
val dsWithWatermark: DataStream[StationLog] = stationLogDS.assignTimestampsAndWatermarks(
//设置水位线策略
WatermarkStrategy.forBoundedOutOfOrderness[StationLog](Duration.ofSeconds(2))
//设置事件时间抽取器
.withTimestampAssigner(new SerializableTimestampAssigner[StationLog] {
override def extractTimestamp(element: StationLog, recordTimestamp: Long): Long = element.callTime
})
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
)
val lateOutputTag = new OutputTag[StationLog]("late-data")
//按照基站id进行分组,每隔5s窗口统计每个基站所有主叫通话总时长
val result: DataStream[String] = dsWithWatermark
.keyBy(_.sid)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
//在watermark基础之上,再延迟2s触发窗口计算
.allowedLateness(Time.seconds(2))
//迟到的数据,通过侧输出流方式进行收集
.sideOutputLateData(lateOutputTag)
.process(new ProcessWindowFunction[StationLog, String, String, TimeWindow] {
override def process(key: String, context: Context, elements: Iterable[StationLog], out: Collector[String]): Unit = {
//获取窗口起始时间
val startTime: Long = context.window.getStart
val endTime: Long = context.window.getEnd
//统计每个基站所有主叫通话总时长
var sumCallTime = 0L
for (elem <- elements) {
sumCallTime += elem.duration
}
out.collect("窗口范围:[" + startTime + "~" + endTime + "),基站:" + key + ",所有主叫通话总时长:" + sumCallTime)
}
})
//获取结果
result.print("正常窗口数据")
//获取迟到的数据
result.getSideOutput(lateOutputTag).print("迟到的数据")
env.execute()
以上代码需要注意的是java api中定义outputTag时需要使用"OutputTag<T> lateOutputTag = new OutputTag<T>("late-data"){};"这种方式,目的是为了避免类型擦除问题。代码执行后,向socket中输入如下数据:
#socket中输入数据如下
001,181,182,busy,1000,10
002,182,183,fail,3000,20
001,183,184,busy,2000,30
002,184,185,busy,6000,40
003,181,183,busy,5000,50
#触发[0~5000)窗口
001,181,182,busy,7000,10
#再次来属于[0~5000)窗口迟到数据,每来一条数据会再次触发窗口执行
002,182,183,fail,1000,20
002,182,183,fail,3000,20
001,183,184,busy,2000,30
#此条数据输入后,达到了[0~5000)窗口销毁时刻:wm(7000)+2000
003,181,183,busy,9000,50
#继续输入属于[0~5000)窗口数据,窗口不再触发,侧流输出。
002,182,183,fail,1000,20
001,183,184,busy,2000,30
数据输出结果如下:
正常窗口数据:7> 窗口范围:[0~5000),基站:002,所有主叫通话总时长:20
正常窗口数据:1> 窗口范围:[0~5000),基站:001,所有主叫通话总时长:40
正常窗口数据:7> 窗口范围:[0~5000),基站:002,所有主叫通话总时长:40
正常窗口数据:7> 窗口范围:[0~5000),基站:002,所有主叫通话总时长:60
正常窗口数据:1> 窗口范围:[0~5000),基站:001,所有主叫通话总时长:70
迟到数据:7> StationLog{sid='002', callOut='182', callIn='183', callType='fail', callTime=1000, duration=20}
迟到数据:1> StationLog{sid='001', callOut='183', callIn='184', callType='busy', callTime=2000, duration=30}
事件时间下的流关联
关于Flink流关联,在前面的章节已经了解了Union和Connect算子,它们能够将多个流进行简单的合并操作,然而,在引入事件时间和watermark概念后,流合并变得更加复杂和有挑战性。这是因为在基于事件时间的流处理中,我们需要考虑事件时间和watermark机制,例如:乱序流中基于事件时间触发器的触发时机。
在Flink引入窗口后,我们也可以对多流进行关联进而设置窗口来实现更高级别的流操作,例如:Window Join 、Interal Join、Window Cogroup操作,这种情况下也必须特别注意事件时间和watermark机制的作用,下面我们结合案例来说明基于事件时间下的流关联操作及特点。
Union合并
Union可以对相同类型的数据流进行合并,输出一个新的DataStream数据流,合并之后的数据流包含两个或多个流中所有元素,并且数据类型不变。当引入事件时间和watermark后,两个流各自会有对应的watermark,通过前面水位线传递特点的学习,两流合并后Flink会以两流中watermark小的为基准确定watermark。
案例:读取socket中数据流形成两个流,进行Union关联后设置窗口,每隔5秒统计每个基站通话次数。
- Java代码
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//方便测试,并行度设置为1
env.setParallelism(1);
/**
* 读取socket中数据流形成A流,并对A流设置watermark
*/
SingleOutputStreamOperator<StationLog> ADS = env.socketTextStream("node5", 8888)
.map(new MapFunction<String, StationLog>() {
@Override
public StationLog map(String s) throws Exception {
String[] arr = s.split(",");
return new StationLog(arr[0].trim(),
arr[1].trim(),
arr[2].trim(),
arr[3].trim(),
Long.valueOf(arr[4]),
Long.valueOf(arr[5]));
}
});
//设置水位线
SingleOutputStreamOperator<StationLog> AdsWithWatermark = ADS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<StationLog>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((stationLog, timestamp) -> stationLog.callTime)
);
/**
* 读取socket中数据流形成B流,并对B流设置watermark
*/
SingleOutputStreamOperator<StationLog> BDS = env.socketTextStream("node5", 9999)
.map(new MapFunction<String, StationLog>() {
@Override
public StationLog map(String s) throws Exception {
String[] arr = s.split(",");
return new StationLog(arr[0].trim(),
arr[1].trim(),
arr[2].trim(),
arr[3].trim(),
Long.valueOf(arr[4]),
Long.valueOf(arr[5]));
}
});
//设置水位线
SingleOutputStreamOperator<StationLog> BdsWithWatermark = BDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<StationLog>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((stationLog, timestamp) -> stationLog.callTime)
);
//两流进行union操作
AdsWithWatermark.union(BdsWithWatermark)
.keyBy(new KeySelector<StationLog, String>() {
@Override
public String getKey(StationLog stationLog) throws Exception {
return stationLog.sid;
}
})
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.process(new ProcessWindowFunction<StationLog, String, String, TimeWindow>() {
@Override
public void process(String key,
ProcessWindowFunction<StationLog, String, String, TimeWindow>.Context context,
Iterable<StationLog> elements,
Collector<String> out) throws Exception {
System.out.println("window-watermark:" + context.currentWatermark());
//获取窗口起始时间
long start = context.window().getStart()<0?0:context.window().getStart();
//获取窗口结束时间
long end = context.window().getEnd();
//统计窗口内通话次数
int count = 0;
for (StationLog element : elements) {
count++;
}
out.collect("窗口范围:[" + start + "~" + end + "),基站:" + key + ",通话总次数为:" + count);
}
}).print();
env.execute();
- Scala代码
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//为了方便看出效果,这里设置并行度为1
env.setParallelism(1)
//设置隐式转换
import org.apache.flink.streaming.api.scala._
//读取socket中数据流形成两个流
val ADS: DataStream[StationLog] = env.socketTextStream("node5", 8888).map(line => {
val arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong)
})
//给 stationLogDS 设置水位线
val adsWithWatermark: DataStream[StationLog] = ADS.assignTimestampsAndWatermarks(
//设置水位线策略
WatermarkStrategy.forBoundedOutOfOrderness[StationLog](Duration.ofSeconds(2))
//设置事件时间抽取器
.withTimestampAssigner(new SerializableTimestampAssigner[StationLog] {
override def extractTimestamp(element: StationLog, recordTimestamp: Long): Long = {
element.callTime
}
})
)
val BDS: DataStream[StationLog] = env.socketTextStream("node5", 9999).map(line => {
val arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong)
})
//给 stationLogDS 设置水位线
val bdsWithWatermark: DataStream[StationLog] = BDS.assignTimestampsAndWatermarks(
//设置水位线策略
WatermarkStrategy.forBoundedOutOfOrderness[StationLog](Duration.ofSeconds(2))
//设置事件时间抽取器
.withTimestampAssigner(new SerializableTimestampAssigner[StationLog] {
override def extractTimestamp(element: StationLog, recordTimestamp: Long): Long = {
element.callTime
}
})
)
//两流进行union关联
adsWithWatermark.union(bdsWithWatermark)
.keyBy(_.sid)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.process(new ProcessWindowFunction[StationLog,String,String,TimeWindow] {
override def process(key: String, context: Context, elements: Iterable[StationLog], out: Collector[String]): Unit = {
println("window-watermark:" + context.currentWatermark)
//获取窗口起始时间
val startTime: Long = if(context.window.getStart<0) 0 else context.window.getStart
//获取窗口结束时间
val endTime: Long = context.window.getEnd
//统计窗口内的通话次数
val count: Int = elements.size
//输出结果
out.collect("窗口范围:[" + startTime + "~" + endTime + "),基站:" + key + ",通话总次数:" + count)
}
}).print()
env.execute()
编写以上代码时需要注意代码中没有再设置withIdleness(Duration.ofSeconds(5))参数,而是设置整体并行度为1,这样方便进行测试,否则达到并行度空闲时间,会自动推进水位线。在编写代码完成进行测试前需要启动socket-8888和socket-9999服务,并按照如下方式输入数据:
#socket-8888 输入数据
001,181,182,busy,1000,10
002,182,183,fail,2000,20
002,182,183,fail,3000,20
#socket-9999 输入数据
001,183,184,busy,3000,30
002,184,185,busy,4000,40
001,181,183,busy,5000,50
#socket-8888 输入数据,窗口不会触发
001,181,182,busy,7000,10
#socket-9999 输入数据后,窗口触发
001,181,182,busy,7000,10
当向socket-8888中输入"002,182,183,fail,3000,20"和向socket-9999中输入"001,181,183,busy,5000,50"数据时,此刻watermark以两个流中最小的watermark为准,即:3000-2000-1 = 999,再次向socket-8888中输入"001,181,182,busy,7000,10"数据时,此刻watermark为5000-2000-1=2999,当向socket-9999中输入"001,181,182,busy,7000,10"数据时,此刻watermark达到7000-2000-1=4999,[0~5000)窗口触发执行。触发后的结果如下:
window-watermark:4999
窗口范围:[0~5000),基站:001,通话总次数:2
window-watermark:4999
窗口范围:[0~5000),基站:002,通话总次数:3
Connect合并
connect算子将两个输入的DataStream数据流作为参数,将两个不同数据类型的DataStream数据流连接在一起,生成一个ConnectedStreams对象作为结果,与union算子不同,union只是简单的将两个类型一样的流合并在一起,而connect算子可以将不同类型的DataStream连接在一起,并且connect只能连接两个流。
与Union一样,当引入事件时间和watermark后,多个流各自会有对应的watermark,多流合并后Flink会以多流中watermark小的为基准确定watermark。
案例1:读取socket中数据流形成两个流,进行Connect关联观察水位线。
- Java代码
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//方便测试,并行度设置为1
env.setParallelism(1);
/**
* 读取socket中数据流形成A流,并对A流设置watermark
* 格式:001,181,182,busy,1000,10
*/
SingleOutputStreamOperator<String> ADS = env.socketTextStream("node5", 8888)
.map(new MapFunction<String, String>() {
@Override
public String map(String s) throws Exception {
String[] arr = s.split(",");
//返回拼接字符串
return arr[0].trim() + "," +
arr[1].trim() + "," +
arr[2].trim() + "," +
arr[3].trim() + "," +
arr[4].trim() + "," +
arr[5].trim();
}
});
//设置水位线
SingleOutputStreamOperator<String> AdsWithWatermark = ADS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<String>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((str, timestamp) -> Long.valueOf(str.split(",")[4]))
);
/**
* 读取socket中数据流形成B流,并对B流设置watermark
* 格式:1,3000
*/
SingleOutputStreamOperator<String> BDS = env.socketTextStream("node5", 9999);
//设置水位线
SingleOutputStreamOperator<String> BdsWithWatermark = BDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<String>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((str, timestamp) -> Long.valueOf(str.split(",")[1]))
);
//两流进行connect操作
AdsWithWatermark.connect(BdsWithWatermark)
.process(new CoProcessFunction<String, String, String>() {
@Override
public void processElement1(String value,
CoProcessFunction<String, String, String>.Context ctx,
Collector<String> out) throws Exception {
out.collect("A流数据:" + value + ",当前watermark:" + ctx.timerService().currentWatermark());
}
@Override
public void processElement2(String value,
CoProcessFunction<String, String, String>.Context ctx,
Collector<String> out) throws Exception {
out.collect("B流数据:" + value + ",当前watermark:" + ctx.timerService().currentWatermark());
}
}).print();
env.execute();
- Scala代码
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//为了方便看出效果,这里设置并行度为1
env.setParallelism(1)
//设置隐式转换
import org.apache.flink.streaming.api.scala._
/**
* 读取socket中数据流形成A流,并对A流设置watermark
* 格式:001,181,182,busy,1000,10
*/
val ADS: DataStream[String] = env.socketTextStream("node5", 8888)
.map(line=>{
val arr = line.split(",")
//返回拼接字符串
arr(0).trim() + "," +
arr(1).trim() + "," +
arr(2).trim() + "," +
arr(3).trim() + "," +
arr(4).trim() + "," +
arr(5).trim()
})
//设置水位线
val AdsWithWatermark: DataStream[String] = ADS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy
.forBoundedOutOfOrderness[String](Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner(new SerializableTimestampAssigner[String] {
override def extractTimestamp(str: String, recordTimestamp: Long): Long = {
str.split(",")(4).toLong
}
})
)
/**
* 读取socket中数据流形成B流,并对B流设置watermark
* 格式:1,3000
*/
val BDS: DataStream[String] = env.socketTextStream("node5", 9999)
//设置水位线
val BdsWithWatermark: DataStream[String] = BDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy
.forBoundedOutOfOrderness[String](Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner(new SerializableTimestampAssigner[String] {
override def extractTimestamp(str: String, recordTimestamp: Long): Long = {
str.split(",")(1).toLong
}
})
)
//两流进行connect操作
AdsWithWatermark.connect(BdsWithWatermark)
.process(new CoProcessFunction[String, String, String]() {
override def processElement1(value: String,
ctx: CoProcessFunction[String, String, String]#Context,
out: Collector[String]): Unit = {
out.collect("A流数据:" + value + ",当前watermark:" + ctx.timerService().currentWatermark())
}
override def processElement2(value: String,
ctx: CoProcessFunction[String, String, String]#Context,
out: Collector[String]): Unit = {
out.collect("B流数据:" + value + ",当前watermark:" + ctx.timerService().currentWatermark())
}
}).print()
env.execute()
编写以上代码时需要注意代码中没有再设置withIdleness(Duration.ofSeconds(5))参数,而是设置整体并行度为1,这样方便进行测试,否则达到并行度空闲时间,会自动推进水位线。在编写代码完成进行测试前需要启动socket-8888和socket-9999服务,并按照如下方式输入数据:
#socket-8888 输入数据
001,181,182,busy,1000,10
002,182,183,fail,2000,20
#socket-9999 输入数据,wm:-1
001,3000(需要输入两遍看效果)
#socket-8888 输入数据,wm:999
002,182,183,fail,3000,20(需要输入两遍看效果)
#socket-9999 输入数据,wm不变化
002,4000
001,5000
当在socket-8888中输入两条数据后,可以发现watermark不会变化,当在socket-9999中输入"001,3000"数据时,此时watermark计算方式是以两个流中较小的为准,即:2000-2000-1=-1,需要输入两遍"001,3000"数据的目的是此刻打印的watermark是输入本条数据之前时刻的watermark。当在socket-8888再次输入"002,182,183,fail,3000,20"数据时,两流中的watermark都为3000-2000-1=999,后续继续在socket-9999中输入数据时,由于socket-8888流中没有新的数据,所以watermark一直为999。
A流数据:001,181,182,busy,1000,10,当前watermark:-9223372036854775808
A流数据:002,182,183,fail,2000,20,当前watermark:-9223372036854775808
B流数据:001,3000,当前watermark:-9223372036854775808
B流数据:001,3000,当前watermark:-1
A流数据:002,182,183,fail,3000,20,当前watermark:-1
A流数据:002,182,183,fail,3000,20,当前watermark:999
B流数据:002,4000,当前watermark:999
B流数据:001,5000,当前watermark:999
案例2:读取订单流和支付流,超过一定时间订单没有支付进行报警提示。
本案例中将读取socket-8888和socket-9999形成订单流和支付流,两流进行connect连接,如果一个订单到达后,超过5s中没有对应的支付信息那么将会进行报警提示,由于事件到达有先后,也有可能支付流提前到达,而订单流晚到达,这时也会进行报警提示。晚上以上这个业务需要涉及到状态编程和定时器,通过这个案例我们可以看到基于事件时间设置watermark情况下,定时器触发时机是由watermark决定。
- Java代码
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//方便测试,并行度设置为1
env.setParallelism(1);
/**
* 读取socket中订单流,并对订单流设置watermark
* 订单流数据格式:订单ID,用户ID,订单金额,时间戳
* order1,user_1,10,1000
* order2,user_2,20,2000
* order3,user_3,30,3000
*/
SingleOutputStreamOperator<String> orderDS = env.socketTextStream("node5", 8888);
//设置水位线
SingleOutputStreamOperator<String> orderDSWithWatermark = orderDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<String>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((orderInfo, timestamp) -> Long.valueOf(orderInfo.split(",")[3]))
);
/**
* 读取socket中支付流,并对支付流设置watermark
* 支付流数据格式:订单ID,支付金额,时间戳
* order1,10,1000
* order2,20,2000
* order3,30,3000
*/
SingleOutputStreamOperator<String> payDS = env.socketTextStream("node5", 9999);
//设置水位线
SingleOutputStreamOperator<String> payDSWithWatermark = payDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<String>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((payInfo, timestamp) -> Long.valueOf(payInfo.split(",")[2]))
);
//将订单流和支付流进行关联
orderDSWithWatermark.keyBy(orderInfo -> orderInfo.split(",")[0])
.connect(payDSWithWatermark.keyBy(payInfo -> payInfo.split(",")[0]))
.process(new KeyedCoProcessFunction<String, String, String, String>() {
//订单状态,存储订单信息
private ValueState<String> orderState=null;
//支付状态,存储支付信息
private ValueState<String> payState=null;
@Override
public void open(Configuration parameters) throws Exception {
//定义两个状态,一个用来存放订单信息,一个用来存放支付信息
ValueStateDescriptor<String> orderStateDescriptor = new ValueStateDescriptor<>("order-state", String.class);
ValueStateDescriptor<String> payStateDescriptor = new ValueStateDescriptor<>("pay-state", String.class);
orderState = getRuntimeContext().getState(orderStateDescriptor);
payState = getRuntimeContext().getState(payStateDescriptor);
}
//处理订单流
@Override
public void processElement1(String orderInfo,
KeyedCoProcessFunction<String, String, String, String>.Context ctx,
Collector<String> out) throws Exception {
//当来一条订单数据后,判断支付状态是否为空,如果为空,说明订单没有支付,注册定时器,5秒后提示
if (payState.value() == null) {
//获取订单时间戳
long orderTimestamp = Long.valueOf(orderInfo.split(",")[3]);
//注册定时器,设置定时器触发时间延后5s触发
ctx.timerService().registerEventTimeTimer(orderTimestamp + 5 * 1000L);
//更新当前订单状态
orderState.update(orderInfo);
}else{
//如果支付状态不为空,说明订单已经支付,删除定时器
//获取定时器触发的时间
Long triggerTime = Long.valueOf(payState.value().split(",")[2])+5*1000L;
//删除定时器
ctx.timerService().deleteEventTimeTimer(triggerTime);
//删除完支付状态中的定时器后,清空支付状态
payState.clear();
}
}
//处理支付流
@Override
public void processElement2(String payInfo,
KeyedCoProcessFunction<String, String, String, String>.Context ctx,
Collector<String> out) throws Exception {
//当来一条支付数据后,判断订单状态是否为空,如果为空,说明订单没有支付,注册定时器,5秒后提示
if (orderState.value() == null) {
//获取支付时间戳
long payTimestamp = Long.valueOf(payInfo.split(",")[2]);
//注册定时器,设置定时器触发时间延后5s触发
ctx.timerService().registerEventTimeTimer(payTimestamp + 5 * 1000L);
//更新当前支付状态
payState.update(payInfo);
}else{
//如果订单状态不为空,说明订单已经支付,删除定时器
//获取定时器触发的时间
Long triggerTime = Long.valueOf(orderState.value().split(",")[3])+5*1000L;
//删除定时器
ctx.timerService().deleteEventTimeTimer(triggerTime);
//删除完订单状态中的定时器后,清空订单状态
orderState.clear();
}
}
//定时器触发后,执行的方法
@Override
public void onTimer(long timestamp,
KeyedCoProcessFunction<String, String, String, String>.OnTimerContext ctx,
Collector<String> out) throws Exception {
//判断订单状态是否为空,如果不为空,说明订单没有支付
if (orderState.value() != null) {
//输出提示信息
out.collect("订单ID:" + orderState.value().split(",")[0] + "已经超过5s没有支付,请尽快支付!");
//清空订单状态
orderState.clear();
}
//判断支付状态是否为空,如果不为空,说明订单已经支付,但没有订单信息!
if (payState.value() != null) {
//输出提示信息
out.collect("订单ID:" + payState.value().split(",")[0] + "有异常,有支付信息没有订单信息!");
//清空支付状态
payState.clear();
}
}
}).print();
env.execute();
- Scala代码
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//为了方便看出效果,这里设置并行度为1
env.setParallelism(1)
//设置隐式转换
import org.apache.flink.streaming.api.scala._
/**
* 读取socket中订单流,并对订单流设置watermark
* 订单流数据格式:订单ID,用户ID,订单金额,时间戳
* order1,user_1,10,1000
* order2,user_2,20,2000
* order3,user_3,30,3000
*/
val orderDS: DataStream[String] = env.socketTextStream("node5", 8888)
//设置水位线
val orderDSWithWatermark: DataStream[String] = orderDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy
.forBoundedOutOfOrderness[String](Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner(new SerializableTimestampAssigner[String] {
override def extractTimestamp(str: String, recordTimestamp: Long): Long = {
str.split(",")(3).toLong
}
})
)
/**
* 读取socket中支付流,并对支付流设置watermark
* 支付流数据格式:订单ID,支付金额,时间戳
* order1,10,1000
* order2,20,2000
* order3,30,3000
*/
val payDS: DataStream[String] = env.socketTextStream("node5", 9999)
//设置水位线
val payDSWithWatermark: DataStream[String] = payDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy
.forBoundedOutOfOrderness[String](Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner(new SerializableTimestampAssigner[String] {
override def extractTimestamp(str: String, recordTimestamp: Long): Long = {
str.split(",")(2).toLong
}
})
)
//两流进行connect操作
orderDSWithWatermark.keyBy(_.split(",")(0))
.connect(payDSWithWatermark.keyBy(_.split(",")(0)))
.process(new KeyedCoProcessFunction[String,String,String,String] {
//订单状态,存储订单信息
var orderState:ValueState[String] = _
//支付状态,存储支付信息
var payState:ValueState[String] = _
override def open(parameters: Configuration): Unit = {
//定义两个状态,一个用来存放订单信息,一个用来存放支付信息
val orderStateDescriptor: ValueStateDescriptor[String] = new ValueStateDescriptor[String]("order-state", classOf[String])
val payStateDescriptor: ValueStateDescriptor[String] = new ValueStateDescriptor[String]("pay-state", classOf[String])
orderState = getRuntimeContext.getState(orderStateDescriptor)
payState = getRuntimeContext.getState(payStateDescriptor)
}
//处理订单流
override def processElement1(orderInfo: String,
ctx: KeyedCoProcessFunction[String, String, String, String]#Context,
out: Collector[String]): Unit = {
//当来一条订单数据后,判断支付状态是否为空,如果为空,说明订单没有支付,注册定时器,5秒后提示
if (payState.value() == null) {
//获取订单时间戳
val orderTimestamp: Long = orderInfo.split(",")(3).toLong
//注册定时器,设置定时器触发时间延后5s触发
ctx.timerService().registerEventTimeTimer(orderTimestamp + 5*1000L)
//更新订单状态
orderState.update(orderInfo)
}else{
//如果支付状态不为空,说明订单已经支付,删除定时器
ctx.timerService().deleteEventTimeTimer(payState.value().split(",")(2).toLong + 5*1000L)
//清空订单状态
payState.clear()
}
}
//处理支付流
override def processElement2(payInfo: String,
ctx: KeyedCoProcessFunction[String, String, String, String]#Context,
out: Collector[String]): Unit = {
//当来一条支付数据后,判断订单状态是否为空,如果为空,说明订单没有支付,注册定时器,5秒后提示
if (orderState.value() == null) {
//注册定时器,设置定时器触发时间延后5s触发
ctx.timerService().registerEventTimeTimer(payInfo.split(",")(2).toLong + 5*1000L)
//更新支付状态
payState.update(payInfo)
}else{
//如果订单状态不为空,说明订单已经支付,删除定时器
ctx.timerService().deleteEventTimeTimer(orderState.value().split(",")(3).toLong + 5*1000L)
//清空订单状态
orderState.clear()
}
}
//定时器触发后,执行的方法
override def onTimer(timestamp: Long,
ctx: KeyedCoProcessFunction[String, String, String, String]#OnTimerContext,
out: Collector[String]): Unit = {
//判断订单状态是否为空,如果不为空,说明订单没有支付
if (orderState.value() != null) {
//输出提示信息
out.collect("订单ID:" + orderState.value().split(",")(0) + "已经超过5s没有支付,请尽快支付!")
//清空订单状态
orderState.clear()
}
//判断支付状态是否为空,如果不为空,说明订单已经支付,但是没有订单信息!
if (payState.value() != null) {
//输出提示信息
out.collect("订单ID:" + payState.value().split(",")(0) + "有异常,有支付信息没有订单信息!")
//清空支付状态
payState.clear()
}
}
}).print()
env.execute()
以上代码编写完成后,启动代码前启动socket-8888和socket-9999服务,代码启动后向socket中按顺序输入如下数据:
#socket-8888中数据流
order1,user_1,10,1000
order2,user_2,20,2000
order4,user_3,30,3000
order6,user_4,40,7000
order7,user_4,40,9000
#socket-9999中数据流
order1,10,1000
order3,20,2000
order6,30,7000
order7,50,9000
#socket-8888中数据流
order8,user_4,40,9001
#socket-9999中数据流
order8,50,9001
以上数据中socket-8888中订单order2只有订单信息没有支付信息,socket-9999中订单order3只有支付信息没有订单信息,可以看到最终输出结果如下:
订单ID:order2已经超过5s没有支付,请尽快支付!
订单ID:order3有异常,有支付信息没有订单信息!
Window Join关联
在 Flink 中,Window Join 是一种流处理操作,用于将两个流中的元素基于窗口分组并进行关联。Window Join 可以在事件时间或处理时间上进行操作,并提供了不同类型的窗口(如滚动窗口、滑动窗口、会话窗口)来控制关联操作的时间范围。
Window Join 的基本思想是根据指定的关联条件将流中的元素进行分组,并将同一个窗口中的元素进行关联操作,实现类似SQL中两表Join内关联的效果(Select a.id,a.name,a.age,b.score from a join b on a.id = b.id),其使用方式如下:
dataStream.join(otherStream)
.where(<key selector>).equalTo(<key selector>)
.window(TumblingEventTimeWindows.of(Time.seconds(3)))
.apply (new JoinFunction () {...});
以上代码中首先将两流通过join方法关联形成JoinedStream,然后分别通过where和equalTo方法设置两流关联的条件,where方法可以从第一个流中选择要关联的数据,equalTo方法从第二个流中选择要关联的数据,接着设置window方法进行窗口设置,可以在两流相同窗口内按照从两个流中选择的数据进行关联,最后通过apply方法传入JoinFunction接口进行关联数据处理。
Window Join 的工作流程如下:
-
从不同的输入流中读取数据,提取两流中关联条件数据。
-
设置适当的窗口类型(如滚动窗口、滑动窗口、会话窗口),并定义窗口的大小和滑动间隔,根据窗口的时间范围对数据进行分组。
-
在每个窗口中将相应窗口关联的元素按照自定义的关联操作(JoinFunction)进行处理生成最终结果。
Flink 提供了丰富的 API 和内置函数来实现 Window Join 操作,可以根据需求选择不同的窗口类型、触发器、时间语义等来进行定制化的操作。此外,Window Join也支持处理迟到数据(Late Data)的机制,以应对数据延迟到达或乱序的情况。
在 Window Join 中,如果使用事件时间(Event Time)进行关联,Flink 会选择两个输入流中watermark较小的作为基准来触发窗口操作,下面通过案例来学习Window Join使用方式。
案例:读取订单流和支付流,将订单流和支付流进行关联,输出关联后的数据。
- Java代码
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//方便测试,并行度设置为1
env.setParallelism(1);
/**
* 读取socket中订单流,并对订单流设置watermark
* 订单流数据格式:订单ID,用户ID,订单金额,时间戳
* order1,user_1,10,1000
* order2,user_2,20,2000
* order3,user_3,30,3000
*/
SingleOutputStreamOperator<String> orderDS = env.socketTextStream("node5", 8888);
//设置水位线
SingleOutputStreamOperator<String> orderDSWithWatermark = orderDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<String>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((orderInfo, timestamp) -> Long.valueOf(orderInfo.split(",")[3]))
);
/**
* 读取socket中支付流,并对支付流设置watermark
* 支付流数据格式:订单ID,支付金额,时间戳
* order1,10,1000
* order2,20,2000
* order3,30,3000
*/
SingleOutputStreamOperator<String> payDS = env.socketTextStream("node5", 9999);
//设置水位线
SingleOutputStreamOperator<String> payDSWithWatermark = payDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<String>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((payInfo, timestamp) -> Long.valueOf(payInfo.split(",")[2]))
);
//将订单流和支付流进行关联,并设置窗口
DataStream<String> result = orderDSWithWatermark.join(payDSWithWatermark)
//设置关联条件,where是订单流,equalTo是支付流
.where(new KeySelector<String, String>() {
@Override
public String getKey(String orderInfo) throws Exception {
return orderInfo.split(",")[0];
}
})
.equalTo(new KeySelector<String, String>() {
@Override
public String getKey(String payInfo) throws Exception {
return payInfo.split(",")[0];
}
})
//设置窗口
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
//关联后的数据处理
.apply(new JoinFunction<String, String, String>() {
@Override
public String join(String orderInfo, String payInfo) throws Exception {
return "订单信息:" + orderInfo + " 支付信息:" + payInfo;
}
});
result.print();
env.execute();
- Scala代码
val env = StreamExecutionEnvironment.getExecutionEnvironment
//方便测试,并行度设置为1
env.setParallelism(1)
//导入隐式转换
import org.apache.flink.streaming.api.scala._
/**
* 读取socket中订单流,并对订单流设置watermark
* 订单流数据格式:订单ID,用户ID,订单金额,时间戳
* order1,user_1,10,1000
* order2,user_2,20,2000
* order3,user_3,30,3000
*/
val orderDS: DataStream[String] = env.socketTextStream("node5", 8888)
//设置水位线
val orderDSWithWatermark: DataStream[String] = orderDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy
.forBoundedOutOfOrderness[String](Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner(new SerializableTimestampAssigner[String] {
override def extractTimestamp(str: String, recordTimestamp: Long): Long =
str.split(",")(3).toLong
})
)
/**
* 读取socket中支付流,并对支付流设置watermark
* 支付流数据格式:订单ID,支付金额,时间戳
* order1,10,1000
* order2,20,2000
* order3,30,3000
*/
val payDS: DataStream[String] = env.socketTextStream("node5", 9999)
//设置水位线
val payDSWithWatermark: DataStream[String] = payDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy
.forBoundedOutOfOrderness[String](Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner(new SerializableTimestampAssigner[String] {
override def extractTimestamp(str: String, recordTimestamp: Long): Long =
str.split(",")(2).toLong
})
)
//将订单流和支付流进行关联,并设置窗口
val result: DataStream[String] = orderDSWithWatermark.join(payDSWithWatermark)
//设置关联条件,where是订单流,equalTo是支付流
.where(value=>value.split(",")(0))
.equalTo(value=>value.split(",")(0))
//设置窗口
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
//关联后的数据处理
.apply(new JoinFunction[String, String, String] {
override def join(orderInfo: String, payInfo: String): String =
s"订单信息:$orderInfo - 支付信息:$payInfo"
})
result.print()
env.execute()
以上代码中设置并行度为1,并对两流分别设置watermark延迟为2秒,在代码执行前启动socket,输入如下数据:
#socket-8888中数据流
order1,user_1,10,1000
order2,user_2,20,2000
order3,user_3,30,3000
order4,user_4,40,4000
#socket-9999中数据流
order1,10,1000
order2,20,2000
order3,30,3000
order5,50,4000
#socket-8888中数据流
order10,user_1,10,7000
#socket-9999中数据流,触发窗口操作
order10,90,7000
按照以上顺序分别向socket-8888,socket-9999中输入数据,当watermark达到窗口触发时刻时(即在socket-9999中输入"order10,90,7000"数据),窗口触发输出数据结果如下:
订单信息:order1,user_1,10,1000 - 支付信息:order1,10,1000
订单信息:order2,user_2,20,2000 - 支付信息:order2,20,2000
订单信息:order3,user_3,30,3000 - 支付信息:order3,30,3000
Interval Join关联
在 Flink 中,Interval Join 这种流处理操作用于在指定的时间区间内关联两个或多个输入流的元素,与 Window Join 不同,Interval Join 基于时间区间而不是固定的窗口来进行关联操作,这使得 Interval Join 更适用于需要根据时间区间进行灵活关联的场景。
在 Interval Join 中,每个输入流中的元素都被视为具有开始时间和结束时间的区间,数据关联发生在两个区间有重叠的情况下。例如输入两个流A、B,B流中元素timestamp只要处于A流中指定的一段时间区间内,那么两流中的数据就会被关联,这个时间范围可以通过在代码中指定lower bound(下界)和upper bound(上界)来确定,但是要求下界时间一定小于等于上界时间。

上图中A流和B流进行关联,数据关联关系可以表述为: a.timestamp + lowerBound <= b.timestamp <= a.timestamp + upperBound,即只要B流中b元素事件时间符合a.timestamp + lowerBound,a.timestamp + upperBound,就会被A流中a事件关联。同样,站在B流角度来看,b事件可以关联A流中b.timestamp + upperBound<= a.timestamp <= b.timestamp - lowerBound时间范围的事件。
例如,我们设置A流和B流进行IntervalJoin,设置lower bound时间为-2秒,upperbound时间为1秒,那么A流中事件时间为2的事件可以和B流事件时间范围0,3的事件进行关联;B流中事件时间为3的事件可以和A流事件时间范围2,5的事件进行关联。
编写代码时使用Interval Join方式如下:
keyedStream.intervalJoin(otherKeyedStream)
.between(Time.milliseconds(-2), Time.milliseconds(2)) // lower and upper bound
.upperBoundExclusive(true) // optional
.lowerBoundExclusive(true) // optional
.process(new IntervalJoinFunction() {...});
Flink中IntervalJoin仅支持EventTime事件时间,且只能基于KeyedStream使用IntervalJoin,两流中只有相同的key且符合时间范围关系的数据会被关联,between方法中第一个参数指定的是lower bound的值,第二个参数指定的是upper bound的值,两个数据流数据进行关联时是包含lower bound 和upper bound边界数据,如果不行包含时间边界数据可以通过upperBoundExclusive(true)和lowerBoundExclusive(true)方法去掉边界,两流关联时就不会关联边界的数据。
下面通过案例方式来学习Interval Join的使用方式。假设用户登录某个网站时,网站方都会推送大量的广告,目标用户会在登录前后进行广告点击,针对用户登录数据我们会生成用户登录流,用户点击广告会生成用户点击广告流,用户登录流包括用户ID、登录时间等,点击广告流包括用户ID,广告ID,点击时间,我们需要分析用户登录时间范围内,用户点击广告情况,就可以通过Interval Join来完成。
案例:读取用户登录流和广告点击流,通过Interval Join分析用户点击广告的行为。
- Java代码
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//方便测试,并行度设置为1
env.setParallelism(1);
/**
* 读取socket中用户登录流,并对用户登录流设置watermark
* 用户登录流数据格式:用户ID,登录时间
* user_1,1000
*/
SingleOutputStreamOperator<String> loginDS = env.socketTextStream("node5", 8888);
//设置水位线
SingleOutputStreamOperator<String> loginDSWithWatermark = loginDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<String>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((loginInfo, timestamp) -> Long.valueOf(loginInfo.split(",")[1]))
);
/**
* 读取socket中广告点击流,并对广告点击流设置watermark
* 广告点击流数据格式:用户ID,广告ID,点击时间
* user_1,product_1,1000
*/
SingleOutputStreamOperator<String> clickDS = env.socketTextStream("node5", 9999);
//设置水位线
SingleOutputStreamOperator<String> clickDSWithWatermark = clickDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<String>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((clickInfo, timestamp) -> Long.valueOf(clickInfo.split(",")[2]))
);
//Interval Join
loginDSWithWatermark.keyBy(loginInfo -> loginInfo.split(",")[0])
.intervalJoin(clickDSWithWatermark.keyBy(clickInfo -> clickInfo.split(",")[0]))
//设置时间范围
.between(Time.seconds(-2), Time.seconds(2))
//设置处理函数
.process(new ProcessJoinFunction<String, String, String>() {
@Override
public void processElement(String left, String right, ProcessJoinFunction<String, String, String>.Context ctx, Collector<String> out) throws Exception {
//获取用户ID
String userId = left.split(",")[0];
out.collect("用户ID为:" + userId + "的用户点击了广告:" + right);
}
})
.print();
env.execute();
- Scala代码
val env = StreamExecutionEnvironment.getExecutionEnvironment
// 设置并行度为1,方便测试
env.setParallelism(1)
//导入隐式转换
import org.apache.flink.streaming.api.scala._
/**
* 读取socket中用户登录流,并对用户登录流设置watermark
* 用户登录流数据格式: 用户ID,登录时间
* user_1,1000
*/
val loginDS: DataStream[String] = env.socketTextStream("node5", 8888)
// 设置水位线
val loginDSWithWatermark: DataStream[String] = loginDS.assignTimestampsAndWatermarks(
// 设置watermark ,延迟时间为2s
WatermarkStrategy
.forBoundedOutOfOrderness[String](Duration.ofSeconds(2))
.withTimestampAssigner(new SerializableTimestampAssigner[String] {
override def extractTimestamp(element: String, recordTimestamp: Long): Long = {
element.split(",")(1).toLong
}
})
)
/**
* 读取socket中广告点击流,并对广告点击流设置watermark
* 广告点击流数据格式: 用户ID,广告ID,点击时间
* user_1,product_1,1000
*/
val clickDS: DataStream[String] = env.socketTextStream("node5", 9999)
// 设置水位线
val clickDSWithWatermark: DataStream[String] = clickDS.assignTimestampsAndWatermarks(
// 设置watermark ,延迟时间为2s
WatermarkStrategy
.forBoundedOutOfOrderness[String](Duration.ofSeconds(2))
.withTimestampAssigner(new SerializableTimestampAssigner[String] {
override def extractTimestamp(element: String, recordTimestamp: Long): Long = {
element.split(",")(2).toLong
}
})
)
// Interval Join
loginDSWithWatermark.keyBy(loginInfo => loginInfo.split(",")(0))
.intervalJoin(clickDSWithWatermark.keyBy(clickInfo => clickInfo.split(",")(0)))
// 设置时间范围
.between(Time.seconds(-2), Time.seconds(2))
// 设置处理函数
.process(new ProcessJoinFunction[String, String, String] {
override def processElement(left: String,
right: String,
ctx: ProcessJoinFunction[String, String, String]#Context,
out: Collector[String]): Unit = {
// 获取用户ID
val userId: String = left.split(",")(0)
out.collect(s"用户ID为:$userId 的用户点击了广告:$right")
}
})
.print()
env.execute()
以上代码设置两流进行Interval Join,lower bound下界设置为-2秒,upper bound上界设置为2秒,并且关联时包含边界。启动代码前,需要启动socket-8888,socket-9999,启动代码后向socket中输入如下数据:
#socket-9999中数据流
user_1,product_1,3000
user_1,product_2,4000
user_1,product_3,5000
user_1,product_4,6000
user_1,product_5,7000
user_1,product_6,8000
user_1,product_7,9000
#socket-8888中数据流
user_1,6000
#socket-8888中数据流
user_2,9000
#socket-9999中数据流
user_2,product_11,6000
user_2,product_12,7000
user_2,product_13,8000
user_2,product_14,9000
user_2,product_15,10000
user_2,product_16,11000
user_2,product_17,12000
当在socket-9999中输入属于用户user_1的3000~9000时间范围的广告流数据后,再在socket-8888中输入"user_1,6000"数据,可以看到会匹配出上下2秒的数据结果,同理,再次在socket-8888和socket-9999中输入user_2中的数据,也可以看到匹配结果。
用户ID为:user_1 的用户点击了广告:user_1,product_2,4000
用户ID为:user_1 的用户点击了广告:user_1,product_4,6000
用户ID为:user_1 的用户点击了广告:user_1,product_6,8000
用户ID为:user_1 的用户点击了广告:user_1,product_3,5000
用户ID为:user_1 的用户点击了广告:user_1,product_5,7000
用户ID为:user_2 的用户点击了广告:user_2,product_12,7000
用户ID为:user_2 的用户点击了广告:user_2,product_13,8000
用户ID为:user_2 的用户点击了广告:user_2,product_14,9000
用户ID为:user_2 的用户点击了广告:user_2,product_15,10000
用户ID为:user_2 的用户点击了广告:user_2,product_16,11000
Window Cogroup关联
Flink的Window Cogroup是一种强大且灵活的流处理操作,与Window Join类似,可以根据窗口内的关联条件将多个输入流的元素进行分组和关联,并根据自定义的逻辑输出结果,只需要在使用时调用.coGroup()方法来代替.join()方法即可,使用方式如下:
dataStream.coGroup(otherStream)
.where(0).equalTo(1)
.window(TumblingEventTimeWindows.of(Time.seconds(3)))
.apply (new CoGroupFunction () {...});
在Window Cogroup中,也是通过where和equalTo分别指定两流中关联的条件,并设置窗口,最后调用apply方法传入CoGroupFunction接口重写其coGroup方法,在该方法内来处理两流在相同窗口内关联的数据。CoGroupFunction接口如下:
new CoGroupFunction<IN1, IN2, OUT>() {
@Override
public void coGroup(Iterable<IN1> first, Iterable<IN2> second, Collector<OUT> out) throws Exception {
out.collect(first + "=====" + second);
}
}
与Window Join不同的是,Window Cogroup中coGroup方法的参数是两流对应可遍历的数据集合,如上所示,first是第一个流中在某个窗口中输入的所有数据,second是第二个流在对应窗口内输入的所有数据,这里的first和second集合不仅仅是相同窗口中根据关联条件匹配的数据,而是两流在对应窗口的所有数据,可以根据自己的需求实现类似SQL中的内连接(inner join)、左外连接(left outer join)、右外连接(right outer join)和全外连接(full outer join)等不同类型的关联操作。实际上Window Join底层的实现也是基于Window Cogroup来完成的,Window Cogroup相比于Window Join更适用于各种复杂的业务场景。
案例:读取订单流和支付流,将订单流和支付流进行关联,输出关联后的数据。
- Java代码
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
//方便测试,并行度设置为1
env.setParallelism(1);
/**
* 读取socket中订单流,并对订单流设置watermark
* 订单流数据格式:订单ID,用户ID,订单金额,时间戳
* order1,user_1,10,1000
* order2,user_2,20,2000
* order3,user_3,30,3000
*/
SingleOutputStreamOperator<String> orderDS = env.socketTextStream("node5", 8888);
//设置水位线
SingleOutputStreamOperator<String> orderDSWithWatermark = orderDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<String>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((orderInfo, timestamp) -> Long.valueOf(orderInfo.split(",")[3]))
);
/**
* 读取socket中支付流,并对支付流设置watermark
* 支付流数据格式:订单ID,支付金额,时间戳
* order1,10,1000
* order2,20,2000
* order3,30,3000
*/
SingleOutputStreamOperator<String> payDS = env.socketTextStream("node5", 9999);
//设置水位线
SingleOutputStreamOperator<String> payDSWithWatermark = payDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<String>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((payInfo, timestamp) -> Long.valueOf(payInfo.split(",")[2]))
);
//将订单流和支付流进行关联,并设置窗口
DataStream<String> result = orderDSWithWatermark.coGroup(payDSWithWatermark)
//设置关联条件,where是订单流,equalTo是支付流
.where(new KeySelector<String, String>() {
@Override
public String getKey(String orderInfo) throws Exception {
return orderInfo.split(",")[0];
}
})
.equalTo(new KeySelector<String, String>() {
@Override
public String getKey(String payInfo) throws Exception {
return payInfo.split(",")[0];
}
})
//设置窗口
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
//关联后的数据处理
.apply(new CoGroupFunction<String, String, String>() {
@Override
public void coGroup(Iterable<String> first, Iterable<String> second, Collector<String> out) throws Exception {
out.collect(first + "=====" + second);
}
});
result.print();
env.execute();
- Scala代码
val env = StreamExecutionEnvironment.getExecutionEnvironment
//方便测试,并行度设置为1
env.setParallelism(1)
//导入隐式转换
import org.apache.flink.streaming.api.scala._
/**
* 读取socket中订单流,并对订单流设置watermark
* 订单流数据格式:订单ID,用户ID,订单金额,时间戳
* order1,user_1,10,1000
* order2,user_2,20,2000
* order3,user_3,30,3000
*/
val orderDS: DataStream[String] = env.socketTextStream("node5", 8888)
//设置水位线
val orderDSWithWatermark: DataStream[String] = orderDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy
.forBoundedOutOfOrderness[String](Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner(new SerializableTimestampAssigner[String] {
override def extractTimestamp(str: String, recordTimestamp: Long): Long =
str.split(",")(3).toLong
})
)
/**
* 读取socket中支付流,并对支付流设置watermark
* 支付流数据格式:订单ID,支付金额,时间戳
* order1,10,1000
* order2,20,2000
* order3,30,3000
*/
val payDS: DataStream[String] = env.socketTextStream("node5", 9999)
//设置水位线
val payDSWithWatermark: DataStream[String] = payDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy
.forBoundedOutOfOrderness[String](Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner(new SerializableTimestampAssigner[String] {
override def extractTimestamp(str: String, recordTimestamp: Long): Long =
str.split(",")(2).toLong
})
)
//将订单流和支付流进行关联,并设置窗口
val result: DataStream[String] = orderDSWithWatermark.coGroup(payDSWithWatermark)
//设置关联条件,where是订单流,equalTo是支付流
.where(value=>value.split(",")(0))
.equalTo(value=>value.split(",")(0))
//设置窗口
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
//关联后的数据处理
.apply(new CoGroupFunction[String,String,String] {
override def coGroup(first: lang.Iterable[String], second: lang.Iterable[String], out: Collector[String]): Unit = {
out.collect(first+"====="+second)
}
})
result.print()
env.execute()
此案例于Window Join案例一样,以上代码编写完成启动前,首先启动socket-8888、socket-9999端口,输入如下数据:
#socket-8888中数据流
order1,user_1,10,1000
order2,user_2,20,2000
order3,user_3,30,3000
order4,user_4,40,4000
#socket-9999中数据流
order1,10,1000
order2,20,2000
order3,30,3000
order5,50,4000
#socket-8888中数据流
order10,user_1,10,7000
#socket-9999中数据流,触发窗口操作
order10,90,7000
按照如上顺序输入数据后,可以看到输出如下结果:
[order1,user_1,10,1000]=====[order1,10,1000]
[order2,user_2,20,2000]=====[order2,20,2000]
[]=====[order5,50,4000]
[order4,user_4,40,4000]=====[]
[order3,user_3,30,3000]=====[order3,30,3000]