watermark使用方式
Flink中可以通过调用DataStream的assignTimestampsAndWatermarks方法来生成watermark,该方法可以通过用户指定的事件时间、数据延迟时间按照一定规律产生watermark。
assignTimestampsAndWatermarks方法调用需要在keyBy和Window函数之前,assignTimestampsAndWatermarks方法中可以传入WatermarkStrategy对象,该对象提供了常见的watermark策略,其中包含有序流中watermark设置、乱序流中watermark设置、自定义设置watermark,下面分别进行介绍。
有序流中设置watermark
有序流中watermark的生成比较简单,在有序流中事件是按照事件时间顺序依次到达Flink中,由于数据已经按照事件时间有序,因此不需要进一步等待延迟数据,直接使用当前最大的事件时间作为watermark即可。可以通过"WatermarkStrategy.forMonotonousTimestamps()"watermark生成策略指定事件流为有序流,并通过"withTimestampAssigner"方法来指定事件中的事件时间列进而生成watermark,需要注意的是选中的事件时间列时间单位必须是毫秒。
在有序流中设置watermark方式可以参照如下案例,以下案例中使用到了窗口编程,窗口编程细节后续章节还会介绍。
案例:读取Socket基站日志数据,每隔5s窗口统计每个基站所有主叫通话总时长。
- Java代码:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// env.setParallelism(1);
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,2000,20
* 003,183,184,busy,3000,30
* 004,184,185,busy,4000,40
* 005,181,183,busy,5000,50
*/
DataStreamSource<String> sourceDS = env.socketTextStream("node5", 9999);
//将数据转换成StationLog对象
SingleOutputStreamOperator<StationLog> stationLogDS = sourceDS.map(new MapFunction<String, StationLog>() {
@Override
public StationLog map(String s) throws Exception {
String[] arr = s.split(",");
return new StationLog(arr[0].trim(),
arr[1].trim(),
arr[2].trim(),
arr[3].trim(),
Long.valueOf(arr[4]),
Long.valueOf(arr[5]));
}
});
//给 stationLogDS 流设置watermark
// stationLogDS.assignTimestampsAndWatermarks(
// WatermarkStrategy.<StationLog>forMonotonousTimestamps()
// .withTimestampAssigner((stationLog, timestamp) -> stationLog.callTime));
SingleOutputStreamOperator<StationLog> dsWithWatermark = stationLogDS.assignTimestampsAndWatermarks(
WatermarkStrategy.<StationLog>forMonotonousTimestamps()
.withTimestampAssigner(new SerializableTimestampAssigner<StationLog>() {
//stationLog是输入的数据,timestamp是当前元素时间戳,如果没有分配过时间戳,默认值为Long.MinValue,即-9223372036854775808
@Override
public long extractTimestamp(StationLog stationLog, long timestamp) {
return stationLog.callTime;
}
})
/**
* 多个并行度时,如果某个并行度长时间没有数据,会导致watermark不会推进,
* 这时可以设置一个最大的空闲时间,如果超过这个时间,watermark就会推进,
* 该时间是基于当前机器的系统时间来计时
*/
.withIdleness(Duration.ofSeconds(5))
);
//按照 基站id 进行分组,每隔5s窗口统计每个基站所有主叫通话总时长
dsWithWatermark.keyBy(new KeySelector<StationLog, String>() {
@Override
public String getKey(StationLog stationLog) throws Exception {
return stationLog.sid;
}
})
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.sum("duration")
.print();
env.execute();
- Scala代码:
val env = StreamExecutionEnvironment.getExecutionEnvironment
//导入隐式转换
import org.apache.flink.streaming.api.scala._
// env.setParallelism(1)
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,2000,20
* 003,183,184,busy,3000,30
* 004,184,185,busy,4000,40
* 005,181,183,busy,5000,50
*/
val sourceDS: DataStream[String] = env.socketTextStream("node5", 9999)
//将数据转换成StationLog对象
val stationLogDS: DataStream[StationLog] = sourceDS.map(line => {
val arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong)
})
//给 stationLogDS 设置水位线
val dsWithWatermark: DataStream[StationLog] = stationLogDS.assignTimestampsAndWatermarks(
//设置水位线策略
WatermarkStrategy.forMonotonousTimestamps[StationLog]()
//设置事件时间抽取器
.withTimestampAssigner(new SerializableTimestampAssigner[StationLog] {
override def extractTimestamp(element: StationLog, recordTimestamp: Long): Long = {
element.callTime
}
})
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
)
//按照基站id进行分组,每隔5s窗口统计每个基站所有主叫通话总时长
dsWithWatermark
.keyBy(_.sid)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.sum("duration")
.print()
env.execute()
Flink多并行度下,全局watermark的值是所有并行度中watermark最小值,默认每个并行度watermark初始值为Long.MIN_VALUE值,即-9223372036854775808。以上代码在本地运行时,默认的并行度与运行机器的core的线程数相同,由于是读取socket中数据,会有一些并行度中一直没有读取数据,从而导致Flink全局watermark的值不会更新。所以为了方便测试,可以将代码的并行度设置为1,这样由于只有一个并行度,watermark值会随着事件的输入周期性更新(默认200ms)watermark的值。
此外,为了解决多多并行度中一些并行度长时间没有输入输入导致的watermark不更新问题,Flink1.13版本开始,我们也可以在对应的WatermarkStrategy策略最后调用"withIdleness(Duration.ofSeconds(...))"方法来指定一些并行度超过指定时间为空闲状态,这时会忽略该并行度watermark值,按照其他有数据的并行度继续计算并往后推进watermark的值,需要注意的是该指定的时间是以系统时间为基准进行计算。
以上代码运行后,可以在socket中输入如下数据,可以看到如果没有设置并行度为1时,当对应的窗口结束数据过来后,结果也不会输出,除非设置并行度为1或者在策略后设置withIdleness方法指定并行度空闲时间才会有结果输出。
#socket中输入数据如下,按照事件时间顺序输入
001,181,182,busy,1000,10
002,182,183,fail,2000,20
003,183,184,busy,3000,30
004,184,185,busy,4000,40
#此条数据输入后,会有窗口结果输出
005,181,183,busy,5000,50
乱序流中设置watermark
在乱序流中由于数据没有按照事件时间顺序到达Flink,所以基于事件时间生成watermark时需要额外指定一个"等待迟到"数据的延迟时间,该延迟时间尽可能保证所有延迟数据能到达Flink,可以通过"WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(...))"watermark生成策略指定事件流为乱序流并指定延迟时间,后续通过调用"withTimestampAssigner"方法来指定事件中的事件时间列进而生成watermark,需要注意的是选中的事件时间列时间单位必须是毫秒。下面通过同样的案例来学习乱序流中设置watermark的方式。
案例:读取Socket基站日志数据,每隔5s窗口统计每个基站所有主叫通话总时长。
- Java代码:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// env.setParallelism(1);
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 004,184,185,busy,4000,40
* 005,181,183,busy,5000,50
* 002,182,183,fail,2000,20
* 001,181,185,success,6000,60
* 003,182,184,busy,7000,30
* 003,183,181,busy,3000,30
*/
DataStreamSource<String> sourceDS = env.socketTextStream("node5", 9999);
//将数据转换成StationLog对象
SingleOutputStreamOperator<StationLog> stationLogDS = sourceDS.map(new MapFunction<String, StationLog>() {
@Override
public StationLog map(String s) throws Exception {
String[] arr = s.split(",");
return new StationLog(arr[0].trim(),
arr[1].trim(),
arr[2].trim(),
arr[3].trim(),
Long.valueOf(arr[4]),
Long.valueOf(arr[5]));
}
});
//给stationLogDS 流设置Watermark
SingleOutputStreamOperator<StationLog> dsWithWatermark = stationLogDS.assignTimestampsAndWatermarks(
//设置Watermark,最大延迟时间为2s
WatermarkStrategy.<StationLog>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置EventTime对应字段
.withTimestampAssigner(new SerializableTimestampAssigner<StationLog>() {
@Override
public long extractTimestamp(StationLog element, long recordTimestamp) {
return element.callTime;
}
}).withIdleness(Duration.ofSeconds(5))
);
//按照 基站id 进行分组,每隔5s窗口统计每个基站所有主叫通话总时长
dsWithWatermark.keyBy(new KeySelector<StationLog, String>() {
@Override
public String getKey(StationLog stationLog) throws Exception {
return stationLog.sid;
}
})
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.sum("duration")
.print();
env.execute();
- Scala代码:
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//导入隐式转换
import org.apache.flink.streaming.api.scala._
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 004,184,185,busy,4000,40
* 005,181,183,busy,5000,50
* 002,182,183,fail,2000,20
* 001,181,185,success,6000,60
* 003,182,184,busy,7000,30
*/
val sourceDS: DataStream[String] = env.socketTextStream("node5", 9999)
//将数据转换成StationLog对象
val stationLogDS: DataStream[StationLog] = sourceDS.map(line => {
val arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong)
})
//给 stationLogDS 设置水位线
val dsWithWatermark: DataStream[StationLog] = stationLogDS.assignTimestampsAndWatermarks(
//设置水位线策略
WatermarkStrategy.forBoundedOutOfOrderness[StationLog](Duration.ofSeconds(2))
//设置事件时间抽取器
.withTimestampAssigner(new SerializableTimestampAssigner[StationLog] {
override def extractTimestamp(element: StationLog, recordTimestamp: Long): Long = {
element.callTime
}
})
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
)
//按照基站id进行分组,每隔5s窗口统计每个基站所有主叫通话总时长
dsWithWatermark
.keyBy(_.sid)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.sum("duration")
.print()
env.execute()
以上代码设置数据最大延迟时间为2s,根据watermark在乱序事件中生成的规则,watermark就等于当前Flink接收到的事件时间减去2秒后的值,实际上真正的watermark还会基于此值之上减去1ms,主要原因是当计算出watermark时间t后,代表事件时间小于等于t时刻的事件全部到达,但后续还有可能有事件时间为t时刻的事件到达,所以这里减去1ms的愿意是保证事件时间小于t时刻的事件全部到达Flink。例如:此刻接收到一个事件时间为5000ms的事件,指定最大延迟时间为2s,那么真正的watermark的值为5000ms-2000ms-1ms = 2999ms,这一点也可以在forBoundedOutOfOrderness源码中得到印证,具体源码如下:
public class BoundedOutOfOrdernessWatermarks<T> implements WatermarkGenerator<T> {
/** The maximum timestamp encountered so far. */
private long maxTimestamp;
/** The maximum out-of-orderness that this watermark generator assumes. */
private final long outOfOrdernessMillis;
/**
* Creates a new watermark generator with the given out-of-orderness bound.
*
* @param maxOutOfOrderness The bound for the out-of-orderness of the event timestamps.
*/
public BoundedOutOfOrdernessWatermarks(Duration maxOutOfOrderness) {
checkNotNull(maxOutOfOrderness, "maxOutOfOrderness");
checkArgument(!maxOutOfOrderness.isNegative(), "maxOutOfOrderness cannot be negative");
this.outOfOrdernessMillis = maxOutOfOrderness.toMillis();
// start so that our lowest watermark would be Long.MIN_VALUE.
this.maxTimestamp = Long.MIN_VALUE + outOfOrdernessMillis + 1;
}
// ------------------------------------------------------------------------
@Override
public void onEvent(T event, long eventTimestamp, WatermarkOutput output) {
maxTimestamp = Math.max(maxTimestamp, eventTimestamp);
}
@Override
public void onPeriodicEmit(WatermarkOutput output) {
output.emitWatermark(new Watermark(maxTimestamp - outOfOrdernessMillis - 1));
}
}
通过以上乱序流中指定延迟时间和watermark生成的规律,我们发现和有序流中生成watermark的规律类似,实际上,在有序流中我们可以使用"WatermarkStrategy.forBoundedOutOfOrderness(Duration.ofSeconds(...))"watermark生成策略,只是将延迟时间指定为0,表示数据没有延迟,这和在有序流中使用"WatermarkStrategy.forMonotonousTimestamps()"效果一模一样。
以上代码编写完成后,可以直接启动,在socket中输入如下数据,可以看到当watermark达到窗口结束时间时,窗口会被触发输出对应的结果。
#socket中输入数据如下,事件时间是乱序的
001,181,182,busy,1000,10
004,184,185,busy,4000,40
005,181,183,busy,5000,50
002,182,183,fail,2000,20
001,181,185,success,6000,60
#输入此条数据后,窗口被触发
003,182,184,busy,7000,30
自定义设置watermark
以上有序流和无序流watermark生成策略满足绝大多数Flink场景,但是在一些场景中我们希望自己控制watermark的生成方式,在Flink中也提供了自定义生成watermark策略,可以通过调用"WatermarkStrategy.forGenerator(new WatermarkGeneratorSupplier(...))"来指定自己实现watermark的策略,后续通过调用"withTimestampAssigner"方法来指定事件中的事件时间列进而按照自己指定的策略生成watermark,需要注意的是选中的事件时间列时间单位必须是毫秒。
在自定义watermark时,需要实现WatermarkGeneratorSupplier接口中createWatermarkGenerator方法,该方法返回WatermarkGenerator对象,用户需要是实现WatermarkGenerator接口,实现该接口中onEvent和onPeriodicEmit方法,实现WatermarkGenerator接口自定义生成watermark的形式如下:
public class CustomWatermarkGenerator implements WatermarkGenerator<MyEvent> {
// 定义事件最大延迟时间
private final long maxOutOfOrderness = 5000;
//定义当前最大时间戳
private long currentMaxTimestamp = Long.MIN_VALUE + maxOutOfOrderness + 1;
// 每来一个事件调用一次,可以检查或者记录事件的时间,或者也可以基于事件数据本身去生成watermark
@Override
public void onEvent(MyEvent event, long eventTimestamp, WatermarkOutput output) {
currentMaxTimestamp = Math.max(currentMaxTimestamp, eventTimestamp);
}
// 周期性的调用,也许会生成新的 watermark,也许不会。
@Override
public void onPeriodicEmit(WatermarkOutput output) {
// 发出的 watermark = 当前最大时间戳 - 最大乱序时间 - 1
output.emitWatermark(new Watermark(currentMaxTimestamp - maxOutOfOrderness - 1));
}
}
以上代码中属性及方法的解释如下:
-
maxOutOfOrderness:可选,定义事件最大的延迟时间,根据自己情况来指定即可。
-
currentMaxTimestamp:可选,定义Flink已接收到的最大事件时间,后续基于此时间计算watermark。代码中指定该值初始值为Long.MIN_VALUE+最大延迟时间+1的原因是假设当前watermark为Long.MIN_VALUE,那么watermark = currentMaxTimestamp - maxOutOfOrderness- 1,所以 currentMaxTimestamp = Long.MIN_VALUE + maxOutOfOrderness+ 1。
-
onEvent:必须实现,该方法每来一个事件都会调用一次,可以检查或者记录事件的时间,或者也可以基于事件数据本身去生成watermark。
-
onPeriodicEmit:必须实现,可以没有实现业务逻辑。该方法周期性调用,可以通过该方法中output.emitWatermark方法周期性生成watermark或者不生成watermark。周期性生成 watermark 的间隔时间由env.getConfig().setAutoWatermarkInterval(100)设置,默认是200ms。
Flink中通过实现WatermarkGenerator接口自定义生成watermark有两种实现场景:周期性生成watermark(Periodic WatermarkGenerator)和间断性生成watermark(Punctuated WatermarkGenerator),下面分别进行介绍。
周期性生成watermark
周期性生成watermark就是通过实现WatermarkGenerator接口中onPeriodicEmit方法,默认200ms产生一次watermark。
案例:读取Socket基站日志数据,每隔5s窗口统计每个基站所有主叫通话总时长。(自定义实现watermark生成)
- Java代码:
public class PeriodicWatermarkGeneratorTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// //设置watermark生成周期为100ms
// env.getConfig().setAutoWatermarkInterval(100);
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 004,184,185,busy,4000,40
* 005,181,183,busy,5000,50
* 002,182,183,fail,2000,20
* 001,181,185,success,6000,60
* 003,182,184,busy,7000,30
* 003,183,181,busy,3000,30
*/
DataStreamSource<String> sourceDS = env.socketTextStream("node5", 9999);
//将数据转换成StationLog对象
SingleOutputStreamOperator<StationLog> stationLogDS = sourceDS.map(new MapFunction<String, StationLog>() {
@Override
public StationLog map(String s) throws Exception {
String[] arr = s.split(",");
return new StationLog(arr[0].trim(),
arr[1].trim(),
arr[2].trim(),
arr[3].trim(),
Long.valueOf(arr[4]),
Long.valueOf(arr[5]));
}
});
//给stationLogDS 流设置Watermark ,这里使用自定义watermark
SingleOutputStreamOperator<StationLog> dsWithWatermark = stationLogDS.assignTimestampsAndWatermarks(
//使用自定义 Periodic watermark
WatermarkStrategy.forGenerator(new WatermarkGeneratorSupplier<StationLog>() {
@Override
public WatermarkGenerator<StationLog> createWatermarkGenerator(Context context) {
return new CustomPeriodicWatermark();
}
})
//从事件中抽取时间戳作为事件时间
.withTimestampAssigner(new SerializableTimestampAssigner<StationLog>() {
@Override
public long extractTimestamp(StationLog stationLog, long l) {
return stationLog.callTime;
}
})
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
);
//按照 基站id 进行分组,每隔5s窗口统计每个基站所有主叫通话总时长
dsWithWatermark.keyBy(new KeySelector<StationLog, String>() {
@Override
public String getKey(StationLog stationLog) throws Exception {
return stationLog.sid;
}
})
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.sum("duration")
.print();
env.execute();
}
}
class CustomPeriodicWatermark implements WatermarkGenerator<StationLog> {
//定义最大延迟时间
long maxOutOfOrderness = 2000;
//定义当前最大时间戳,初始值为最小值+最大延迟时间+1,为什么要加1?因为假设当前watermark为Long.MIN_VALUE,那么watermark = currentMaxTimestamp - maxDelay - 1,所以 currentMaxTimestamp = Long.MIN_VALUE + maxDelay + 1
long currentMaxTimestamp = Long.MIN_VALUE + maxOutOfOrderness + 1;
/**
* 每来一条数据,调用一次,更新最大时间戳
*/
@Override
public void onEvent(StationLog stationLog, long eventTimestamp, WatermarkOutput output) {
//更新最大时间戳
currentMaxTimestamp = Math.max(currentMaxTimestamp, stationLog.callTime);
}
/**
* 周期性的调用,生成新的Watermark
* 调用此方法生成 watermark 的间隔时间由env.getConfig().setAutoWatermarkInterval(100)设置,默认是200ms
*/
@Override
public void onPeriodicEmit(WatermarkOutput output) {
//生成Watermark,这里为什么要减1?,假设当前watermark时间为t,代表时间戳<=t的数据都已经到达了,此刻有可能后续还会来一个时间戳为t的数据,所以要减1,代表时间戳<t的数据都已经到达了
output.emitWatermark(new Watermark(currentMaxTimestamp - maxOutOfOrderness - 1));
}
}
- Scala代码:
object PeriodicWatermarkGeneratorTest {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
//导入隐式转换
import org.apache.flink.streaming.api.scala._
// env.setParallelism(1)
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 004,184,185,busy,4000,40
* 005,181,183,busy,5000,50
* 002,182,183,fail,2000,20
* 001,181,185,success,6000,60
* 003,182,184,busy,7000,30
* 003,183,181,busy,3000,30
*/
val sourceDS: DataStream[String] = env.socketTextStream("node5", 9999)
//将数据转换成StationLog对象
val stationLogDS: DataStream[StationLog] = sourceDS.map(line => {
val arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong)
})
//给 stationLogDS 设置水位线
val dsWithWatermark: DataStream[StationLog] = stationLogDS.assignTimestampsAndWatermarks(
//设置水位线策略
WatermarkStrategy.forGenerator[StationLog](new WatermarkGeneratorSupplier[StationLog]{
override def createWatermarkGenerator(context: WatermarkGeneratorSupplier.Context): WatermarkGenerator[StationLog] = {
new CustomPeriodicWatermark()
}
} )
//设置事件时间抽取器
.withTimestampAssigner(new SerializableTimestampAssigner[StationLog] {
override def extractTimestamp(element: StationLog, recordTimestamp: Long): Long = {
element.callTime
}
})
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
)
//按照基站id进行分组,每隔5s窗口统计每个基站所有主叫通话总时长
dsWithWatermark
.keyBy(_.sid)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.sum("duration")
.print()
env.execute()
}
}
class CustomPeriodicWatermark extends WatermarkGenerator[StationLog] {
//定义最大允许的无序度
val maxOutOfOrderness = 2000L
//定义当前最大的时间戳
var currentMaxTimestamp = Long.MinValue + maxOutOfOrderness + 1
//定义生成Watermark的逻辑
override def onEvent(stationLog: StationLog, eventTimestamp: Long, output: WatermarkOutput): Unit = {
//更新当前最大的时间戳
currentMaxTimestamp = Math.max(currentMaxTimestamp, stationLog.callTime)
}
//定义周期性生成Watermark的逻辑
override def onPeriodicEmit(output: WatermarkOutput): Unit = {
//生成Watermark
output.emitWatermark(new Watermark(currentMaxTimestamp - maxOutOfOrderness - 1))
}
}
以上代码逻辑实际上就是"WatermarkStrategy.forBoundedOutOfOrderness(...)"实现原理,代码中指定了事件最大延迟时间为2s,编写完成运行后,在socket中输入如下数据进行测试:
#socket中输入数据如下,事件时间是乱序的
001,181,182,busy,1000,10
004,184,185,busy,4000,40
005,181,183,busy,5000,50
002,182,183,fail,2000,20
001,181,185,success,6000,60
#输入此条数据后,对应窗口会触发
003,182,184,busy,7000,30
003,183,181,busy,3000,30
间断性生成watermark
间断性的生成Watermark一般是基于某些事件触发Watermark的生成和发送,比如:在我们的基站数据中,有一个基站的CallTime总是没有按照顺序传入,其他基站的时间都是正常的,那我们需要对这个基站来专门生成Watermark。
案例:读取Socket基站日志数据,每隔5s窗口统计每个基站所有主叫通话总时长(自定义生成watermark,以基站ID 001的事件时间生成Watermark)
- Java代码:
public class PunctuatedWatermarkGeneratorTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
// //设置watermark生成周期为100ms
// env.getConfig().setAutoWatermarkInterval(100);
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 004,184,185,busy,4000,40
* 005,181,183,busy,5000,50
* 002,182,183,fail,2000,20
* 001,181,185,success,6000,60
* 003,182,184,busy,7000,30
* 001,183,184,busy,7000,30
*/
DataStreamSource<String> sourceDS = env.socketTextStream("node5", 9999);
//将数据转换成StationLog对象
SingleOutputStreamOperator<StationLog> stationLogDS = sourceDS.map(new MapFunction<String, StationLog>() {
@Override
public StationLog map(String s) throws Exception {
String[] arr = s.split(",");
return new StationLog(arr[0].trim(),
arr[1].trim(),
arr[2].trim(),
arr[3].trim(),
Long.valueOf(arr[4]),
Long.valueOf(arr[5]));
}
});
//给stationLogDS 流设置Watermark ,这里使用自定义watermark
SingleOutputStreamOperator<StationLog> dsWithWatermark = stationLogDS.assignTimestampsAndWatermarks(
//使用自定义 Periodic watermark
WatermarkStrategy.forGenerator(new WatermarkGeneratorSupplier<StationLog>() {
@Override
public WatermarkGenerator<StationLog> createWatermarkGenerator(Context context) {
return new CustomPunctuatedWatermark();
}
})
//从事件中抽取时间戳作为事件时间
.withTimestampAssigner(new SerializableTimestampAssigner<StationLog>() {
@Override
public long extractTimestamp(StationLog stationLog, long l) {
return stationLog.callTime;
}
})
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
);
//按照 基站id 进行分组,每隔5s窗口统计每个基站所有主叫通话总时长
dsWithWatermark.keyBy(new KeySelector<StationLog, String>() {
@Override
public String getKey(StationLog stationLog) throws Exception {
return stationLog.sid;
}
})
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.sum("duration")
.print();
env.execute();
}
}
class CustomPunctuatedWatermark implements WatermarkGenerator<StationLog> {
//定义最大延迟时间
long maxOutOfOrderness = 2000;
long currentMaxTimestamp = Long.MIN_VALUE + maxOutOfOrderness +1L;
@Override
public void onEvent(StationLog stationLog, long l, WatermarkOutput watermarkOutput) {
//如果是基站001的数据,就生成Watermark
if("001".equals(stationLog.getSid())){
//根据事件来获取 currentMaxTimestamp
currentMaxTimestamp = Math.max(currentMaxTimestamp,stationLog.callTime);
watermarkOutput.emitWatermark(new Watermark(currentMaxTimestamp - maxOutOfOrderness - 1L));
}
}
@Override
public void onPeriodicEmit(WatermarkOutput watermarkOutput) {
//这里什么逻辑都不需要实现,watermark是基于事件来生成的,不是周期性生成的,已经在onEvent方法中生成了
}
}
- Scala代码:
object PunctuatedWatermarkGeneratorTest {
def main(args: Array[String]): Unit = {
val env = StreamExecutionEnvironment.getExecutionEnvironment
//导入隐式转换
import org.apache.flink.streaming.api.scala._
// env.setParallelism(1)
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 004,184,185,busy,4000,40
* 005,181,183,busy,5000,50
* 002,182,183,fail,2000,20
* 001,181,185,success,6000,60
* 003,182,184,busy,7000,30
* 001,183,184,busy,7000,30
*/
val sourceDS: DataStream[String] = env.socketTextStream("node5", 9999)
//将数据转换成StationLog对象
val stationLogDS: DataStream[StationLog] = sourceDS.map(line => {
val arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong)
})
//给 stationLogDS 设置水位线
val dsWithWatermark: DataStream[StationLog] = stationLogDS.assignTimestampsAndWatermarks(
//设置水位线策略
WatermarkStrategy.forGenerator[StationLog](new WatermarkGeneratorSupplier[StationLog]{
override def createWatermarkGenerator(context: WatermarkGeneratorSupplier.Context): WatermarkGenerator[StationLog] = {
new CustomPunctuatedWatermark()
}
} )
//设置事件时间抽取器
.withTimestampAssigner(new SerializableTimestampAssigner[StationLog] {
override def extractTimestamp(element: StationLog, recordTimestamp: Long): Long = {
element.callTime
}
})
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
)
//按照基站id进行分组,每隔5s窗口统计每个基站所有主叫通话总时长
dsWithWatermark
.keyBy(_.sid)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.sum("duration")
.print()
env.execute()
}
}
//自定义PunctuatedWatermark实现
class CustomPunctuatedWatermark extends WatermarkGenerator[StationLog] {
//定义最大延迟时间
val maxDelayTime = 2000L
var currentMaxTimeStamp = Long.MinValue + maxDelayTime + 1L
//定义水位线
override def onEvent(stationLog: StationLog, eventTimestamp: Long, watermarkOutput: WatermarkOutput): Unit = {
//如果是基站id为001的事件,就生成水位线
if ("001".equals(stationLog.sid) ) {
currentMaxTimeStamp = Math.max(currentMaxTimeStamp,stationLog.callTime)
//生成水位线
watermarkOutput.emitWatermark(new Watermark(currentMaxTimeStamp - maxDelayTime - 1L))
}
}
//定义空闲时间
override def onPeriodicEmit(output: WatermarkOutput): Unit = {
//这里什么逻辑都不需要实现,watermark是基于事件来生成的,不是周期性生成的,已经在onEvent方法中生成了
}
}
在以上案例中,实现WatermarkGenerator自定义实现间断性watermark生成时,onPeriodicEmit方法中没有对应的业务逻辑实现,因为根据基站"001"的事件时间来生成watermark直接在onEvent方法中判断并发送watermark即可。
代码运行后,向socket中输入如下数据进行测试:
#socket中输入数据如下,事件时间是乱序的
001,181,182,busy,1000,10
004,184,185,busy,4000,40
005,181,183,busy,5000,50
002,182,183,fail,2000,20
001,181,185,success,6000,60
#输入该条数据不会触发窗口,因为对应的sid为003
003,182,184,busy,7000,30
#输入此条数据后会触发窗口执行
001,183,184,busy,7000,30
Window窗口及分类
在Flink中窗口计算是一种非常重要的数据处理方式,Flink实时无界流可以按照固定的时间或长度将数据流切分成不同的窗口中,然后对每个窗口数据进行相应的聚合计算得到对应结果,所以Flink窗口是处理无界数据流的核心,例如,在实时读取基站日志数据实时流中,我们可以定义每5秒统计某基站主叫总数量,那么Flink整个无界流就被划分成如下一个个窗口对应的有界流。

上图中数据流可以是基于ProcessTime处理时间或者基于EventTime事件时间被Flink处理,在Flink窗口处理中如果实时流数据有延迟乱序数据,我们可以设置对应的watermark,当对应的watermark到达时会触发对应窗口计算。
在Flink窗口处理中,默认使用的时区是UTC-0,也就是Flink窗口时间起始范围是从1970-01-01 00:00:00开始固定往后划分的,假设5秒划分一个窗口,那么每个窗口对应的起始时间点是固定的,如:第一个窗口起始时间为[1970-01-01 00:00:00~1970-01-01 00:00:05),第二个窗口起始时间为[1970-01-01 00:00:05~1970-01-01 00:00:10)... ,并且这些窗口"含头不含尾",即第一个窗口中不包含1970-01-01 00:00:05时间产生的数据,该时刻产生的数据会归到下一个窗口,下个窗口同样也是"含头不含尾"的方式处理窗口内的数据,窗口的"含头不含尾"处理方式就是我们在watermark小节中提到watermark计算方式减去1毫秒处理方式形成的。
Keyed 和 Non-Keyed Window
在Flink中可以对KeyedStream或者非KeyedStream设置窗口,对应的就有Keyed Window和Non-Keyed Window。
Keyed Window
数据流经过keyby算子操作形成KeyedStream后,进行窗口设置就形成了KeyedWindow,所谓KeyedWindow是针对每个Key都会单独设置窗口,相同的key数据会被同一个并行任务处理,窗口操作会针对每个key单独进行窗口划分,最终针对每个key输出结果。
在编写Flink代码时,KeyedWindow使用形式如下,后续小节会针对这部分进行案例演示。
stream
.keyBy(...)
.window(...)/countWindow(...)
Non-KeyedWindow
数据流如果没有经过keyby算子处理也可以直接应用窗口操作,这就是Non-KeyedWindow,这种情况下所有的数据都会划分到同一个窗口中被一个task任务进行处理,得到全局数据对应窗口统计结果,值得注意的是这种Non-KeyedWindow是非并行计算与Flink程序设置的并行度无关,所以在实际场景中这种Non-KeyWindow使用不多。
在编写代码时,Non-KeyedWindow使用形式如下,后续小节会针对这部分进行过案例演示。
stream.windowAll(...)/countWindow(...)
Flink 窗口分类
Flink中支持两种窗口类型:一种是基于时间的窗口,一种是基于数量统计的窗口。基于时间的窗口是根据对应窗口起始时间戳来决定窗口大小,基于数量统计的窗口是根据进入窗口中的固定事件量来决定窗口触发,例如:每100条数据形成一个窗口。
Keyed Window和Non-KeyedWindow都支持时间窗口和数量窗口,时间窗口需要通过Window Assigners(窗口分配器)来分配不同的时间窗口类型触发窗口,数量窗口需要通过调用countWindow方法并传入窗口事件数来触发执行。
此外,需要注意的是WindowAssigner(窗口分配器)中还支持全局窗口,全局窗口需要自定义触发逻辑来触发窗口,触发窗口可以是基于时间的,也可以是基于事件数量。后续章节将对时间窗口和数量窗口进行详细讲解。
窗口分配器(Window Assigners)
在Flink中可以通过WindowAssigners(窗口分配器)将数据分配到不同的时间窗口,然后跟上对应的窗口函数(如process,aggregate...)进行具体业务逻辑处理。 WindowAssigners支持四种窗口类型,分别是滚动窗口(Tumbling Windows)、滑动窗口(Sliding Windows)、会话窗口(Session Windows)、全局窗口(Global Windows),下面对这几种时间窗口分别进行介绍。
滚动窗口(Tumbling Window)
滚动窗口是根据固定时间大小进行切分,窗口大小固定并且各窗口首尾相接,每个窗口时间范围之间不重叠,例如:指定滚动窗口大小为5秒,那么每5秒都会有一个窗口生成并计算,如下图所示:

以上每5秒固定时间生成窗口的方式即滚动窗口的长度为5s,在编写Flink代码时,在滚动窗口中只需要指定该参数即可,代码中指定使用Tumbling Window方式如下(基于KeyedStream为例):
#基于Process Time处理时间
keyedDs.window(TumblingProcessingTimeWindows.of(Time.seconds(5)))
#基于EventTime事件时间
keyedDs.window(TumblingEventTimeWindows.of(Time.seconds(5)))
在指定TumblingEventTimeWindows.of(Time.seconds(5))滚动窗口时,of(Time)方法同时有一个重载方法of(Time,offset),该方法 可以同时指定一个offset,例如:TumblingEventTimeWindows.of(Time.seconds(5), Time.hours(-8)),offset参数用于对齐窗口,例如:不设置offset时,长度为1小时的窗口为: 1:00:00.000 - 1:59:59.999、2:00:00.000 - 2:59:59.999...,如果我们想要改变窗口对齐方式,可以设置offset,假设指定了offset为15分钟(TumblingEventTimeWindows.of(Time.hours(1), Time.minutes(15))),那么窗口起始时间为1:15:00.000 - 2:14:59.999、2:15:00.000 - 3:14:59.999...,该参数重要的一个使用场景是根据UTC-0调整窗口时差,Flink中窗口划分时间使用的是标准时间戳:1970-01-01 00:00:00,即UTC-0,中国属于东八区,那么对应的时间是UTC+8,也就是1970-01-01 08:00:00,如果每1小时划分一个窗口,在中国基于UTC+8时间每天划分的第一个窗口是08~09点,所以为了能使一天的窗口时间从0点开始就需要对齐窗口,可以在中国设置改offset为-8即可。
下面分别演示在KeyedStream和NoKeyedStream 实时数据流上Tumbling Window滚动窗口代码编写及测试,这里选择时间语义为EventTime事件时间,不再演示ProcessTime 时间语义。
KeyedStream
案例:读取基站日志数据,每隔10s统计每个基站所有主叫通话总时长。
- Java代码
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,5000,50
* 001,181,182,busy,7000,10
* 002,182,183,fail,9000,20
* 001,183,184,busy,11000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,12000,50
*
*/
DataStreamSource<String> sourceDS = env.socketTextStream("node5", 9999);
//将数据转换成StationLog对象
SingleOutputStreamOperator<StationLog> stationLogDS = sourceDS.map(new MapFunction<String, StationLog>() {
@Override
public StationLog map(String s) throws Exception {
String[] arr = s.split(",");
return new StationLog(arr[0].trim(),
arr[1].trim(),
arr[2].trim(),
arr[3].trim(),
Long.valueOf(arr[4]),
Long.valueOf(arr[5]));
}
});
//设置水位线
SingleOutputStreamOperator<StationLog> dsWithWatermark = stationLogDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<StationLog>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((stationLog, timestamp) -> stationLog.callTime)
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
);
//按照基站ID进行分组,并每隔5s统计每个基站所有主叫通话总时长
KeyedStream<StationLog, String> keyedStream = dsWithWatermark.keyBy(new KeySelector<StationLog, String>() {
@Override
public String getKey(StationLog stationLog) throws Exception {
return stationLog.sid;
}
});
//每隔5s统计每个基站所有主叫通话总时长,使用事件时间
SingleOutputStreamOperator<String> result = keyedStream.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.process(new ProcessWindowFunction<StationLog, String, String, TimeWindow>() {
@Override
public void process(String key,
ProcessWindowFunction<StationLog, String, String, TimeWindow>.Context context,
Iterable<StationLog> elements,
Collector<String> out) throws Exception {
//统计每个基站所有主叫通话总时长
long sumCallTime = 0L;
for (StationLog element : elements) {
sumCallTime += element.duration;
}
//获取窗口起始时间
long start = context.window().getStart();
long end = context.window().getEnd();
out.collect("窗口范围:[" + start + "~" + end + "),基站:" + key + ",所有主叫通话总时长:" + sumCallTime);
}
});
result.print();
env.execute();
- Scala代码
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//导入隐式转换
import org.apache.flink.streaming.api.scala._
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,5000,50
* 001,181,182,busy,7000,10
* 002,182,183,fail,9000,20
* 001,183,184,busy,11000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,12000,50
*/
val sourceDS: DataStream[String] = env.socketTextStream("node5", 9999)
//将数据转换成StationLog对象
val stationLogDS: DataStream[StationLog] = sourceDS.map(line => {
val arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong)
})
//给 stationLogDS 设置水位线
val dsWithWatermark: DataStream[StationLog] = stationLogDS.assignTimestampsAndWatermarks(
//设置水位线策略
WatermarkStrategy.forBoundedOutOfOrderness[StationLog](Duration.ofSeconds(2))
//设置事件时间抽取器
.withTimestampAssigner(new SerializableTimestampAssigner[StationLog] {
override def extractTimestamp(element: StationLog, recordTimestamp: Long): Long = {
element.callTime
}
})
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
)
//按照基站id进行分组,每隔5s窗口统计每个基站所有主叫通话总时长
dsWithWatermark
.keyBy(_.sid)
.window(TumblingEventTimeWindows.of(Time.seconds(5)))
.process(new ProcessWindowFunction[StationLog,String,String,TimeWindow] {
override def process(key: String, context: Context, elements: Iterable[StationLog], out: Collector[String]): Unit = {
//获取窗口起始时间
val startTime: Long = context.window.getStart
val endTime: Long = context.window.getEnd
//统计每个基站所有主叫通话总时长
var sumCallTime = 0L
for (elem <- elements) {
sumCallTime += elem.duration
}
out.collect("窗口范围:[" + startTime + "~" + endTime + "),基站:" + key + ",所有主叫通话总时长:" + sumCallTime)
}
}).print()
env.execute()
以上代码中使用到了process函数,在该函数中我们可以通过TimeWindow对象获取窗口起始时间,并对窗口内的数据进行处理。代码编写完成执行后,在socket中输入如下数据,可以看到每隔5秒会滚动生成窗口。
#socket中输入数据如下
001,181,182,busy,1000,10
002,182,183,fail,3000,20
001,183,184,busy,2000,30
002,184,185,busy,6000,40
003,181,183,busy,5000,50
#输入此条数据后会生成第一个窗口结果
001,181,182,busy,7000,10
002,182,183,fail,9000,20
001,183,184,busy,11000,30
002,184,185,busy,6000,40
#输入此条数据后会生成第二个窗口结果
003,181,183,busy,12000,50
输出结果如下:
窗口范围:[0~5000),基站:001,所有主叫通话总时长:40
窗口范围:[0~5000),基站:002,所有主叫通话总时长:20
窗口范围:[5000~10000),基站:003,所有主叫通话总时长:50
窗口范围:[5000~10000),基站:002,所有主叫通话总时长:100
窗口范围:[5000~10000),基站:001,所有主叫通话总时长:10
Non-KeyedStream
案例:读取基站日志数据,每隔10s统计所有基站全部主叫通话总时长。
针对以上该案例,窗口统计的是所有输入数据对应的总时长结果,虽然业务类似于KeyedStream中的业务,这里不必再按照基站id进行keyby分组,同时调用窗口方法也有所不同,这里针对Non-KeyedStream直接调用windowAll方法即可。
- Java代码:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,5000,50
* 001,181,182,busy,7000,10
* 002,182,183,fail,9000,20
* 001,183,184,busy,11000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,12000,50
*
*/
DataStreamSource<String> sourceDS = env.socketTextStream("node5", 9999);
//将数据转换成StationLog对象
SingleOutputStreamOperator<StationLog> stationLogDS = sourceDS.map(new MapFunction<String, StationLog>() {
@Override
public StationLog map(String s) throws Exception {
String[] arr = s.split(",");
return new StationLog(arr[0].trim(),
arr[1].trim(),
arr[2].trim(),
arr[3].trim(),
Long.valueOf(arr[4]),
Long.valueOf(arr[5]));
}
});
//设置水位线
SingleOutputStreamOperator<StationLog> dsWithWatermark = stationLogDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<StationLog>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((stationLog, timestamp) -> stationLog.callTime)
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
);
//每隔5s统计每个基站所有主叫通话总时长,使用事件时间
dsWithWatermark.windowAll(TumblingEventTimeWindows.of(Time.seconds(5)))
.process(new ProcessAllWindowFunction<StationLog, String, TimeWindow>() {
@Override
public void process(ProcessAllWindowFunction<StationLog, String, TimeWindow>.Context context,
Iterable<StationLog> elements,
Collector<String> out) throws Exception {
//统计每个基站所有主叫通话总时长
long sumCallTime = 0L;
for (StationLog element : elements) {
sumCallTime += element.duration;
}
//获取窗口起始时间
long start = context.window().getStart();
long end =context.window().getEnd();
out.collect("窗口范围:[" + start + "~" + end + "),所有基站主叫通话总时长:" + sumCallTime);
}
}).print();
env.execute();
- Scala代码:
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//导入隐式转换
import org.apache.flink.streaming.api.scala._
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,5000,50
* 001,181,182,busy,7000,10
* 002,182,183,fail,9000,20
* 001,183,184,busy,11000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,12000,50
*/
val sourceDS: DataStream[String] = env.socketTextStream("node5", 9999)
//将数据转换成StationLog对象
val stationLogDS: DataStream[StationLog] = sourceDS.map(line => {
val arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong)
})
//给 stationLogDS 设置水位线
val dsWithWatermark: DataStream[StationLog] = stationLogDS.assignTimestampsAndWatermarks(
//设置水位线策略
WatermarkStrategy.forBoundedOutOfOrderness[StationLog](Duration.ofSeconds(2))
//设置事件时间抽取器
.withTimestampAssigner(new SerializableTimestampAssigner[StationLog] {
override def extractTimestamp(element: StationLog, recordTimestamp: Long): Long = {
element.callTime
}
})
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
)
//按照基站id进行分组,每隔5s窗口统计每个基站所有主叫通话总时长
dsWithWatermark
.windowAll(TumblingEventTimeWindows.of(Time.seconds(5)))
.process(new ProcessAllWindowFunction[StationLog,String,TimeWindow] {
override def process(context: Context, elements: Iterable[StationLog], out: Collector[String]): Unit = {
//获取窗口起始时间
val startTime: Long = context.window.getStart
val endTime: Long = context.window.getEnd
//统计每个基站所有主叫通话总时长
var sumCallTime = 0L
for (elem <- elements) {
sumCallTime += elem.duration
}
out.collect("窗口范围:[" + startTime + "~" + endTime + "),所有基站主叫通话总时长:" + sumCallTime)
}
}).print()
env.execute()
以上代码代码编写完成执行后,在socket中输入如下数据,可以看到每隔5秒会滚动生成窗口:
#socket中输入数据如下
001,181,182,busy,1000,10
002,182,183,fail,3000,20
001,183,184,busy,2000,30
002,184,185,busy,6000,40
003,181,183,busy,5000,50
#输入此条数据后会生成第一个窗口结果
001,181,182,busy,7000,10
002,182,183,fail,9000,20
001,183,184,busy,11000,30
002,184,185,busy,6000,40
#输入此条数据后会生成第二个窗口结果
003,181,183,busy,12000,50
滑动窗口(Sliding Window)
滑动窗口与滚动窗口类似,滑动窗口长度大小固定,使用滑动窗口时需要指定一个时间参数表示窗口长度(window size),既然是滑动窗口,那么同时还需要指定一个窗口滑动步长(window slide)来控制生成窗口的频率,例如:滑动窗口中指定window size 为10秒并且widow slide 为5秒,就代表每隔5秒生成一个包含最近10秒时间范围内的窗口进行计算,如下图所示:

通过上图可以看到,当窗口滑动时间(Window Slide)小于窗口长度(Window Size)时,就会出现窗口重叠,这种情况下一个事件可能被分发到多个窗口中;当窗口滑动时间(Window Slide)和窗口长度(Window Size)相同时,滑动窗口与滚动窗口一样,可以理解为滚动窗口就是一种特殊的滑动窗口,滑动窗口定义更加灵活;滑动窗口中也可以定义窗口滑动时间(Windwo Slide)大于窗口长度(Window Size),但这种情况下会导致一些数据不属于任何窗口,从而出现数据统计不准确问题,所以在使用滑动窗口时,我们一般会指定窗口滑动时间小于等于窗口长度,两者尽量是倍数关系。
在编写Flink代码时,代码中指定使用Sliding Window滑动窗口方式如下(基于KeyedStream为例):
#基于Process Time处理时间
keyedDs.Window(SlidingProcessingTimeWindows.of(Time.seconds(10), Time.seconds(5)))
#基于EventTime事件时间
keyedDs.Window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))
以上代码SlidingProcessingTimeWindows的of方法传入2个参数,第一个参数是窗口长度(Window Size),第二个参数是窗口滑动间隔时间(Window Slide)。同样,在Flink滑动窗口中,SlidingProcessingTimeWindows的of方法也可以指定offset参数用于对齐窗口。
下面分别演示在KeyedStream和NoKeyedStream 实时数据流上Sliding Window滑动窗口代码编写及测试,这里选择时间语义为EventTime事件时间,不再演示ProcessTime 时间语义。
Keyed Stream
案例:读取基站日志数据,每隔5s统计最近10s每个基站所有主叫通话总时长。
- Java代码
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,5000,50
* 001,181,182,busy,7000,10
* 002,182,183,fail,9000,20
* 001,183,184,busy,11000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,12000,50
* 003,181,183,busy,17000,50
*/
DataStreamSource<String> sourceDS = env.socketTextStream("node5", 9999);
//将数据转换成StationLog对象
SingleOutputStreamOperator<StationLog> stationLogDS = sourceDS.map(new MapFunction<String, StationLog>() {
@Override
public StationLog map(String s) throws Exception {
String[] arr = s.split(",");
return new StationLog(arr[0].trim(),
arr[1].trim(),
arr[2].trim(),
arr[3].trim(),
Long.valueOf(arr[4]),
Long.valueOf(arr[5]));
}
});
//设置水位线
SingleOutputStreamOperator<StationLog> dsWithWatermark = stationLogDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<StationLog>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((stationLog, timestamp) -> stationLog.callTime)
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
);
//按照基站ID进行分组,并每隔5s统计每个基站所有主叫通话总时长
KeyedStream<StationLog, String> keyedStream = dsWithWatermark.keyBy(new KeySelector<StationLog, String>() {
@Override
public String getKey(StationLog stationLog) throws Exception {
return stationLog.sid;
}
});
//每隔5s统计每个基站所有主叫通话总时长,使用事件时间
SingleOutputStreamOperator<String> result = keyedStream
.window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))
.process(new ProcessWindowFunction<StationLog, String, String, TimeWindow>() {
@Override
public void process(String key,
ProcessWindowFunction<StationLog, String, String, TimeWindow>.Context context,
Iterable<StationLog> elements,
Collector<String> out) throws Exception {
//统计每个基站所有主叫通话总时长
long sumCallTime = 0L;
for (StationLog element : elements) {
sumCallTime += element.duration;
}
//获取窗口起始时间
long start = context.window().getStart() <0 ? 0 : context.window().getStart();
long end = context.window().getEnd();
out.collect("窗口范围:[" + start + "~" + end + "),基站:" + key + ",所有主叫通话总时长:" + sumCallTime);
}
});
result.print();
env.execute();
- Scala代码
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//导入隐式转换
import org.apache.flink.streaming.api.scala._
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,5000,50
* 001,181,182,busy,7000,10
* 002,182,183,fail,9000,20
* 001,183,184,busy,11000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,12000,50
* 003,181,183,busy,17000,50
*/
val sourceDS: DataStream[String] = env.socketTextStream("node5", 9999)
//将数据转换成StationLog对象
val stationLogDS: DataStream[StationLog] = sourceDS.map(line => {
val arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong)
})
//给 stationLogDS 设置水位线
val dsWithWatermark: DataStream[StationLog] = stationLogDS.assignTimestampsAndWatermarks(
//设置水位线策略
WatermarkStrategy.forBoundedOutOfOrderness[StationLog](Duration.ofSeconds(2))
//设置事件时间抽取器
.withTimestampAssigner(new SerializableTimestampAssigner[StationLog] {
override def extractTimestamp(element: StationLog, recordTimestamp: Long): Long = {
element.callTime
}
})
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
)
//按照基站id进行分组,每隔5s窗口统计每个基站所有主叫通话总时长
dsWithWatermark
.keyBy(_.sid)
.window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))
.process(new ProcessWindowFunction[StationLog, String, String, TimeWindow] {
override def process(key: String, context: Context, elements: Iterable[StationLog], out: Collector[String]): Unit = {
//获取窗口起始时间
val startTime: Long = if(context.window.getStart <0) 0 else context.window.getStart
val endTime: Long = context.window.getEnd
//统计每个基站所有主叫通话总时长
var sumCallTime = 0L
for (elem <- elements) {
sumCallTime += elem.duration
}
out.collect("窗口范围:[" + startTime + "~" + endTime + "),基站:" + key + ",所有主叫通话总时长:" + sumCallTime)
}
}).print()
env.execute()
以上代码代码编写完成执行后,在socket中输入如下数据,可以看到每隔5秒会将最近10秒数据生成窗口进行统计:
#socket中输入数据如下
001,181,182,busy,1000,10
002,182,183,fail,3000,20
001,183,184,busy,2000,30
002,184,185,busy,6000,40
003,181,183,busy,5000,50
#[0~5000)窗口触发
001,181,182,busy,7000,10
002,182,183,fail,9000,20
001,183,184,busy,11000,30
002,184,185,busy,6000,40
#[0~10000)窗口触发
003,181,183,busy,12000,50
#[5000~15000)窗口触发
003,181,183,busy,17000,50
输入结果如下:
窗口范围:[0~5000),基站:001,所有主叫通话总时长:40
窗口范围:[0~5000),基站:002,所有主叫通话总时长:20
窗口范围:[0~10000),基站:001,所有主叫通话总时长:50
窗口范围:[0~10000),基站:002,所有主叫通话总时长:120
窗口范围:[0~10000),基站:003,所有主叫通话总时长:50
窗口范围:[5000~15000),基站:002,所有主叫通话总时长:100
窗口范围:[5000~15000),基站:003,所有主叫通话总时长:100
窗口范围:[5000~15000),基站:001,所有主叫通话总时长:40
Non-KeyedStream
案例:读取基站日志数据,每隔5s统计最近10s所有数据通话总时长。
- Java代码:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,5000,50
* 001,181,182,busy,7000,10
* 002,182,183,fail,9000,20
* 001,183,184,busy,11000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,12000,50
* 003,181,183,busy,17000,50
*/
DataStreamSource<String> sourceDS = env.socketTextStream("node5", 9999);
//将数据转换成StationLog对象
SingleOutputStreamOperator<StationLog> stationLogDS = sourceDS.map(new MapFunction<String, StationLog>() {
@Override
public StationLog map(String s) throws Exception {
String[] arr = s.split(",");
return new StationLog(arr[0].trim(),
arr[1].trim(),
arr[2].trim(),
arr[3].trim(),
Long.valueOf(arr[4]),
Long.valueOf(arr[5]));
}
});
//设置水位线
SingleOutputStreamOperator<StationLog> dsWithWatermark = stationLogDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<StationLog>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((stationLog, timestamp) -> stationLog.callTime)
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
);
//每隔5s统计每个基站所有主叫通话总时长,使用事件时间
dsWithWatermark.windowAll(SlidingEventTimeWindows.of(Time.seconds(10),Time.seconds(5)))
.process(new ProcessAllWindowFunction<StationLog, String, TimeWindow>() {
@Override
public void process(ProcessAllWindowFunction<StationLog, String, TimeWindow>.Context context,
Iterable<StationLog> elements,
Collector<String> out) throws Exception {
//统计每个基站所有主叫通话总时长
long sumCallTime = 0L;
for (StationLog element : elements) {
sumCallTime += element.duration;
}
//获取窗口起始时间
long start = context.window().getStart() <0 ? 0 : context.window().getStart();
long end =context.window().getEnd();
out.collect("窗口范围:[" + start + "~" + end + "),所有基站主叫通话总时长:" + sumCallTime);
}
}).print();
env.execute();
- Scala代码:
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//导入隐式转换
import org.apache.flink.streaming.api.scala._
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,5000,50
* 001,181,182,busy,7000,10
* 002,182,183,fail,9000,20
* 001,183,184,busy,11000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,12000,50
* 003,181,183,busy,17000,50
*
*/
val sourceDS: DataStream[String] = env.socketTextStream("node5", 9999)
//将数据转换成StationLog对象
val stationLogDS: DataStream[StationLog] = sourceDS.map(line => {
val arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong)
})
//给 stationLogDS 设置水位线
val dsWithWatermark: DataStream[StationLog] = stationLogDS.assignTimestampsAndWatermarks(
//设置水位线策略
WatermarkStrategy.forBoundedOutOfOrderness[StationLog](Duration.ofSeconds(2))
//设置事件时间抽取器
.withTimestampAssigner(new SerializableTimestampAssigner[StationLog] {
override def extractTimestamp(element: StationLog, recordTimestamp: Long): Long = {
element.callTime
}
})
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
)
//按照基站id进行分组,每隔5s窗口统计每个基站所有主叫通话总时长
dsWithWatermark
.windowAll(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))
.process(new ProcessAllWindowFunction[StationLog, String, TimeWindow] {
override def process(context: Context, elements: Iterable[StationLog], out: Collector[String]): Unit = {
//获取窗口起始时间
val startTime: Long = if(context.window.getStart < 0) 0 else context.window.getStart
val endTime: Long = context.window.getEnd
//统计每个基站所有主叫通话总时长
var sumCallTime = 0L
for (elem <- elements) {
sumCallTime += elem.duration
}
out.collect("窗口范围:[" + startTime + "~" + endTime + "),所有基站主叫通话总时长:" + sumCallTime)
}
}).print()
env.execute()
以上代码代码编写完成执行后,在socket中输入如下数据,可以看到每隔5秒会将最近10秒全部数据生成窗口进行统计:
#socket中输入数据如下
001,181,182,busy,1000,10
002,182,183,fail,3000,20
001,183,184,busy,2000,30
002,184,185,busy,6000,40
003,181,183,busy,5000,50
#[0~5000)窗口触发
001,181,182,busy,7000,10
002,182,183,fail,9000,20
001,183,184,busy,11000,30
002,184,185,busy,6000,40
#[0~10000)窗口触发
003,181,183,busy,12000,50
#[5000~15000)窗口触发
003,181,183,busy,17000,50
输入结果如下:
窗口范围:[0~5000),所有基站主叫通话总时长:60
窗口范围:[0~10000),所有基站主叫通话总时长:220
窗口范围:[5000~15000),所有基站主叫通话总时长:240
会话窗口(Session Window)
会话窗口(Session Windows)主要是将某段时间内活跃度较高的数据聚合成一个窗口进行计算,窗口的触发的条件是Session Gap(会话间隔),是指在规定的时间内如果没有活跃数据接入,则认为窗口结束,然后触发窗口计算结果。与滑动窗口、滚动窗口不同的是Session Windows不需要有固定windows size和slide time,只需要定义session gap,来规定不活跃数据的时间上限即可,此外,会话窗口不会相互重叠。Session Windows 窗口类型比较适合非连续性数据处理或周期性产生数据场景。

如上图所示,SessionWindow本质上没有固定的起止时间点,在Flink内部,会话窗口会为每条数据创建一个窗口,然后将距离不超过预设间隔时间的窗口合并,合并后成为一个SessionWindow并触发执行。
在编写Flink代码时,代码中指定使用Session Window会话窗口方式如下(基于KeyedStream为例):
#基于Process Time处理时间
keyedDs.window(ProcessingTimeSessionWindows.withGap(Time.seconds(3)))
或者
keyedDs.window(ProcessingTimeSessionWindows.withDynamicGap(...))
#基于EventTime事件时间
keyedDs.window(EventTimeSessionWindows.withGap(Time.seconds(3)))
或者
keyedDs.window(EventTimeSessionWindows.withDynamicGap(...))
以上代码ProcessingTimeSessionWindows调用withGap方法传入一个时间,该时间表示的就是对应数据没有输入的超时时间,超过该时间后就会自动划分形成会话窗口,ProcessingTimeSessionWindows除了有withGap方法外,还有withDynamicGap方法,通过该方法可以根据不同数据动态决定数据没有输入的超时时间。
下面分别演示在KeyedStream和NoKeyedStream 实时数据流上Session Window会话窗口代码编写及测试,这里选择时间语义为EventTime事件时间,不再演示ProcessTime 时间语义。
KeyedStream
案例:读取基站日志数据,每个基站ID 3s内没有通话记录,则生成会话窗口统计相同通话总时长。
- Java代码:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,5000,50
* 001,181,182,busy,7000,10
* 002,182,183,fail,9000,20
* 001,183,184,busy,11000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,12000,50
* 003,181,183,busy,17000,50
*/
DataStreamSource<String> sourceDS = env.socketTextStream("node5", 9999);
//将数据转换成StationLog对象
SingleOutputStreamOperator<StationLog> stationLogDS = sourceDS.map(new MapFunction<String, StationLog>() {
@Override
public StationLog map(String s) throws Exception {
String[] arr = s.split(",");
return new StationLog(arr[0].trim(),
arr[1].trim(),
arr[2].trim(),
arr[3].trim(),
Long.valueOf(arr[4]),
Long.valueOf(arr[5]));
}
});
//设置水位线
SingleOutputStreamOperator<StationLog> dsWithWatermark = stationLogDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<StationLog>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((stationLog, timestamp) -> stationLog.callTime)
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
);
//按照基站ID进行分组,并每隔5s统计每个基站所有主叫通话总时长
KeyedStream<StationLog, String> keyedStream = dsWithWatermark.keyBy(new KeySelector<StationLog, String>() {
@Override
public String getKey(StationLog stationLog) throws Exception {
return stationLog.sid;
}
});
//每隔5s统计每个基站所有主叫通话总时长,使用事件时间
SingleOutputStreamOperator<String> result = keyedStream
.window(EventTimeSessionWindows.withGap(Time.seconds(3)))
.process(new ProcessWindowFunction<StationLog, String, String, TimeWindow>() {
@Override
public void process(String key,
ProcessWindowFunction<StationLog, String, String, TimeWindow>.Context context,
Iterable<StationLog> elements,
Collector<String> out) throws Exception {
//统计每个基站所有主叫通话总时长
long sumCallTime = 0L;
for (StationLog element : elements) {
sumCallTime += element.duration;
}
//获取窗口起始时间
long start = context.window().getStart() <0 ? 0 : context.window().getStart();
long end = context.window().getEnd();
out.collect("窗口范围:[" + start + "~" + end + "),基站:" + key + ",所有主叫通话总时长:" + sumCallTime);
}
});
result.print();
env.execute();
- Scala代码:
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//导入隐式转换
import org.apache.flink.streaming.api.scala._
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,5000,50
* 001,181,182,busy,7000,10
* 002,182,183,fail,9000,20
* 001,183,184,busy,11000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,12000,50
* 003,181,183,busy,17000,50
*/
val sourceDS: DataStream[String] = env.socketTextStream("node5", 9999)
//将数据转换成StationLog对象
val stationLogDS: DataStream[StationLog] = sourceDS.map(line => {
val arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong)
})
//给 stationLogDS 设置水位线
val dsWithWatermark: DataStream[StationLog] = stationLogDS.assignTimestampsAndWatermarks(
//设置水位线策略
WatermarkStrategy.forBoundedOutOfOrderness[StationLog](Duration.ofSeconds(2))
//设置事件时间抽取器
.withTimestampAssigner(new SerializableTimestampAssigner[StationLog] {
override def extractTimestamp(element: StationLog, recordTimestamp: Long): Long = {
element.callTime
}
})
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
)
//按照基站id进行分组,每隔5s窗口统计每个基站所有主叫通话总时长
dsWithWatermark
.keyBy(_.sid)
.window(EventTimeSessionWindows.withGap(Time.seconds(3)))
.process(new ProcessWindowFunction[StationLog, String, String, TimeWindow] {
override def process(key: String, context: Context, elements: Iterable[StationLog], out: Collector[String]): Unit = {
//获取窗口起始时间
val startTime: Long = if(context.window.getStart <0) 0 else context.window.getStart
val endTime: Long = context.window.getEnd
//统计每个基站所有主叫通话总时长
var sumCallTime = 0L
for (elem <- elements) {
sumCallTime += elem.duration
}
out.collect("窗口范围:[" + startTime + "~" + endTime + "),基站:" + key + ",所有主叫通话总时长:" + sumCallTime)
}
}).print()
env.execute()
以上代码代码编写完成执行后,在socket中输入如下数据
#socket中输入数据如下
001,181,182,busy,1000,10
002,182,183,fail,3000,20
001,183,184,busy,2000,30
002,184,185,busy,6000,40
003,181,183,busy,5000,50
#输入此条数据,wm为5000,达到sid 001的gap时间,会输出sid001对应的结果
003,181,183,busy,7000,50
#输入此条数据,wm为9000,达到sid 002的gap时间,会输出sid002对应的结果
003,181,183,busy,11000,50
#输入此条数据,wm为10000,达到sid 003的gap时间,会输出sid003对应的结果
004,181,183,busy,12000,50
当输入"003,181,183,busy,7000,50"数据时,此时watermark值为5000,达到了基站001最后一次输入事件"001,183,184,busy,2000,30"事件时间超时3秒钟的阈值,所以会输出基站001的统计信息;当输入"003,181,183,busy,11000,50"事件时,此时watermark为9000,达到了事件"002,184,185,busy,6000,40"事件时间加上3秒钟的阈值,会打印基站002的结果... ...,最终输入结果如下:
窗口范围:[1000~5000),基站:001,所有主叫通话总时长:40
窗口范围:[3000~9000),基站:002,所有主叫通话总时长:60
窗口范围:[5000~10000),基站:003,所有主叫通话总时长:100
除了以上使用EventTimeSessionWindows.withGap(Time.seconds(3))方式给各个key指定统一的会话超时时间外,还可以通过EventTimeSessionWindows.withDynamicGap(...)方法针对不同基站指定不同的会话超时时间。
案例:读取基站日志数据,不同基站指定不同的会话超时时间,生成会话窗口统计相同通话总时长。
- Java代码:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,5000,50
* 001,181,182,busy,7000,10
* 002,182,183,fail,9000,20
* 001,183,184,busy,11000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,12000,50
* 003,181,183,busy,17000,50
*/
DataStreamSource<String> sourceDS = env.socketTextStream("node5", 9999);
//将数据转换成StationLog对象
SingleOutputStreamOperator<StationLog> stationLogDS = sourceDS.map(new MapFunction<String, StationLog>() {
@Override
public StationLog map(String s) throws Exception {
String[] arr = s.split(",");
return new StationLog(arr[0].trim(),
arr[1].trim(),
arr[2].trim(),
arr[3].trim(),
Long.valueOf(arr[4]),
Long.valueOf(arr[5]));
}
});
//设置水位线
SingleOutputStreamOperator<StationLog> dsWithWatermark = stationLogDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<StationLog>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((stationLog, timestamp) -> stationLog.callTime)
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
);
//按照基站ID进行分组,并每隔5s统计每个基站所有主叫通话总时长
KeyedStream<StationLog, String> keyedStream = dsWithWatermark.keyBy(new KeySelector<StationLog, String>() {
@Override
public String getKey(StationLog stationLog) throws Exception {
return stationLog.sid;
}
});
//每隔5s统计每个基站所有主叫通话总时长,使用事件时间
SingleOutputStreamOperator<String> result = keyedStream
.window(EventTimeSessionWindows.withDynamicGap(new SessionWindowTimeGapExtractor<StationLog>() {
@Override
public long extract(StationLog element) {
if ("001".equals(element.sid)) {
return 3000;
}else{
return 4000;
}
}
}))
.process(new ProcessWindowFunction<StationLog, String, String, TimeWindow>() {
@Override
public void process(String key,
ProcessWindowFunction<StationLog, String, String, TimeWindow>.Context context,
Iterable<StationLog> elements,
Collector<String> out) throws Exception {
//统计每个基站所有主叫通话总时长
long sumCallTime = 0L;
for (StationLog element : elements) {
sumCallTime += element.duration;
}
//获取窗口起始时间
long start = context.window().getStart() <0 ? 0 : context.window().getStart();
long end = context.window().getEnd();
out.collect("窗口范围:[" + start + "~" + end + "),基站:" + key + ",所有主叫通话总时长:" + sumCallTime);
}
});
result.print();
env.execute();
- Scala代码:
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//导入隐式转换
import org.apache.flink.streaming.api.scala._
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,5000,50
* 001,181,182,busy,7000,10
* 002,182,183,fail,9000,20
* 001,183,184,busy,11000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,12000,50
* 003,181,183,busy,17000,50
*/
val sourceDS: DataStream[String] = env.socketTextStream("node5", 9999)
//将数据转换成StationLog对象
val stationLogDS: DataStream[StationLog] = sourceDS.map(line => {
val arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong)
})
//给 stationLogDS 设置水位线
val dsWithWatermark: DataStream[StationLog] = stationLogDS.assignTimestampsAndWatermarks(
//设置水位线策略
WatermarkStrategy.forBoundedOutOfOrderness[StationLog](Duration.ofSeconds(2))
//设置事件时间抽取器
.withTimestampAssigner(new SerializableTimestampAssigner[StationLog] {
override def extractTimestamp(element: StationLog, recordTimestamp: Long): Long = {
element.callTime
}
})
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
)
//按照基站id进行分组,每隔5s窗口统计每个基站所有主叫通话总时长
dsWithWatermark
.keyBy(_.sid)
.window(EventTimeSessionWindows.withDynamicGap(new SessionWindowTimeGapExtractor[StationLog] {
override def extract(element: StationLog): Long = {
if ("001".equals(element.sid)) {
3000L
} else {
4000L
}
}
}))
.process(new ProcessWindowFunction[StationLog, String, String, TimeWindow] {
override def process(key: String, context: Context, elements: Iterable[StationLog], out: Collector[String]): Unit = {
//获取窗口起始时间
val startTime: Long = if(context.window.getStart <0) 0 else context.window.getStart
val endTime: Long = context.window.getEnd
//统计每个基站所有主叫通话总时长
var sumCallTime = 0L
for (elem <- elements) {
sumCallTime += elem.duration
}
out.collect("窗口范围:[" + startTime + "~" + endTime + "),基站:" + key + ",所有主叫通话总时长:" + sumCallTime)
}
}).print()
env.execute()
以上代码代码编写完成执行后,在socket中输入如下数据
#socket中输入数据如下
001,181,182,busy,1000,10
002,182,183,fail,3000,20
001,183,184,busy,2000,30
002,184,185,busy,6000,40
003,181,183,busy,5000,50
#输入此条数据,wm为5000,达到sid 001的gap时间,会输出sid001对应的结果
003,181,183,busy,7000,50
#输入此条数据,wm为9000,未达到sid 002的gap时间
003,181,183,busy,11000,50
#输入此条数据,wm为10000,达到sid 002的gap时间,会输出sid002对应的结果
004,181,183,busy,12000,50
最终输入结果如下:
窗口范围:[1000~5000),基站:001,所有主叫通话总时长:40
窗口范围:[3000~10000),基站:002,所有主叫通话总时长:60
Non-KeyedStream
案例:读取基站日志数据,针对全部数据3s内没有通话记录,则生成会话窗口统计相同通话总时长。
- Java代码:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,5000,50
* 001,181,182,busy,7000,10
* 002,182,183,fail,9000,20
* 001,183,184,busy,11000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,12000,50
* 003,181,183,busy,17000,50
*/
DataStreamSource<String> sourceDS = env.socketTextStream("node5", 9999);
//将数据转换成StationLog对象
SingleOutputStreamOperator<StationLog> stationLogDS = sourceDS.map(new MapFunction<String, StationLog>() {
@Override
public StationLog map(String s) throws Exception {
String[] arr = s.split(",");
return new StationLog(arr[0].trim(),
arr[1].trim(),
arr[2].trim(),
arr[3].trim(),
Long.valueOf(arr[4]),
Long.valueOf(arr[5]));
}
});
//设置水位线
SingleOutputStreamOperator<StationLog> dsWithWatermark = stationLogDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<StationLog>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((stationLog, timestamp) -> stationLog.callTime)
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
);
//每隔5s统计每个基站所有主叫通话总时长,使用事件时间
dsWithWatermark
.windowAll(EventTimeSessionWindows.withGap(Time.seconds(3)))
.process(new ProcessAllWindowFunction<StationLog, String, TimeWindow>() {
@Override
public void process(ProcessAllWindowFunction<StationLog, String, TimeWindow>.Context context, Iterable<StationLog> elements, Collector<String> out) throws Exception {
//统计每个基站所有主叫通话总时长
long sumCallTime = 0L;
for (StationLog element : elements) {
sumCallTime += element.duration;
}
//获取窗口起始时间
long start = context.window().getStart() <0 ? 0 : context.window().getStart();
long end = context.window().getEnd();
out.collect("窗口范围:[" + start + "~" + end + "),所有主叫通话总时长:" + sumCallTime);
}
}).print();
env.execute();
- Scala代码
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//导入隐式转换
import org.apache.flink.streaming.api.scala._
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,5000,50
* 001,181,182,busy,7000,10
* 002,182,183,fail,9000,20
* 001,183,184,busy,11000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,12000,50
* 003,181,183,busy,17000,50
*/
val sourceDS: DataStream[String] = env.socketTextStream("node5", 9999)
//将数据转换成StationLog对象
val stationLogDS: DataStream[StationLog] = sourceDS.map(line => {
val arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong)
})
//给 stationLogDS 设置水位线
val dsWithWatermark: DataStream[StationLog] = stationLogDS.assignTimestampsAndWatermarks(
//设置水位线策略
WatermarkStrategy.forBoundedOutOfOrderness[StationLog](Duration.ofSeconds(2))
//设置事件时间抽取器
.withTimestampAssigner(new SerializableTimestampAssigner[StationLog] {
override def extractTimestamp(element: StationLog, recordTimestamp: Long): Long = {
element.callTime
}
})
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
)
//按照基站id进行分组,每隔5s窗口统计每个基站所有主叫通话总时长
dsWithWatermark
.windowAll(EventTimeSessionWindows.withGap(Time.seconds(3)))
.process(new ProcessAllWindowFunction[StationLog,String,TimeWindow] {
override def process(context: Context, elements: Iterable[StationLog], out: Collector[String]): Unit = {
//获取窗口起始时间
val startTime: Long = if(context.window.getStart <0) 0 else context.window.getStart
val endTime: Long = context.window.getEnd
//统计每个基站所有主叫通话总时长
var sumCallTime = 0L
for (elem <- elements) {
sumCallTime += elem.duration
}
out.collect("窗口范围:[" + startTime + "~" + endTime + "),所有主叫通话总时长:" + sumCallTime)
}
}).print()
env.execute()
以上代码代码编写完成执行后,在socket中输入如下数据
#socket中输入数据如下
001,181,182,busy,1000,10
002,182,183,fail,3000,20
001,183,184,busy,2000,30
002,184,185,busy,6000,40
003,181,183,busy,5000,50
#输入此条数据,wm为9,达到之前输入最大事件时间超时的gap时间,会输出对应的结果
003,181,183,busy,11000,50
最终输入结果如下:
窗口范围:[1000~9000),所有主叫通话总时长:150
全局窗口(Global Window)
在Flink中全局窗口是一种特殊的窗口,针对KeyedStream会将所有相同的key数据汇集在一个全局窗口中,针对Non-keyedStream会将所有数据汇集到一个全局窗口中,该窗口的触发需要我们自定对应的窗口触发器(trigger),在定义的trigger中我们可以指定全局窗口是基于时间触发还是基于数据条数触发,或是基于自己的逻辑规则进行窗口触发。如果不定义trigger那么全局窗口不会被触发执行。

在Flink官方提供的窗口模式不能满足生产需要、较为复杂的窗口触发逻辑业务中,我们需要自己定义窗口触发逻辑。在编写Flink代码时,代码中指定使用Global Window 全局窗口方式如下(基于KeyedStream为例):
#必须设置trigger
... ...
keyedDs.window(GlobalWindows.create())
//自定义触发器
.trigger(new MyTrigger())
... ...
使用全局窗口也是通过对数据流调用window方法,传入GlobalWindows.create()方法即可,但需要后续调用trigger方法传入"触发器"触发窗口执行,该触发器可以是通过继承Trigger抽象类自定义的触发器,关于自定义trigger在后续小节中介绍。
下面分别演示在KeyedStream和NoKeyedStream 实时数据流上GlobalWindow全局窗口代码编写及测试,这里选择时间语义为EventTime事件时间,不再演示ProcessTime 时间语义。
KeyedStream
案例:读取基站日志数据,手动指定trigger触发器,针对每个基站ID每3条数据触发一次计算。
- Java代码:
/**
* Flink 基于EventTime GlobalWindow 全局窗口测试
* 案例:读取基站日志数据,手动指定trigger触发器,针对每个基站ID每3条数据触发一次计算。
*/
public class GlobalWindowWithKeyTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,5000,50
* 001,181,182,busy,7000,10
* 002,182,183,fail,9000,20
* 001,183,184,busy,11000,30
* 002,184,185,busy,6000,40
* 003,181,183,busy,12000,50
* 003,181,183,busy,17000,50
*/
DataStreamSource<String> sourceDS = env.socketTextStream("node5", 9999);
//将数据转换成StationLog对象
SingleOutputStreamOperator<StationLog> stationLogDS = sourceDS.map(new MapFunction<String, StationLog>() {
@Override
public StationLog map(String s) throws Exception {
String[] arr = s.split(",");
return new StationLog(arr[0].trim(),
arr[1].trim(),
arr[2].trim(),
arr[3].trim(),
Long.valueOf(arr[4]),
Long.valueOf(arr[5]));
}
});
//设置水位线
SingleOutputStreamOperator<StationLog> dsWithWatermark = stationLogDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<StationLog>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((stationLog, timestamp) -> stationLog.callTime)
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
);
//按照基站ID进行分组,并每隔5s统计每个基站所有主叫通话总时长
KeyedStream<StationLog, String> keyedStream = dsWithWatermark.keyBy(new KeySelector<StationLog, String>() {
@Override
public String getKey(StationLog stationLog) throws Exception {
return stationLog.sid;
}
});
keyedStream.window(GlobalWindows.create())
//自定义触发器,每3条数据触发一次计算
.trigger(new MyCountTrigger())
//自定义窗口函数,统计每个基站所有主叫通话总时长
.process(new ProcessWindowFunction<StationLog, String, String, GlobalWindow>() {
@Override
public void process(String key,
ProcessWindowFunction<StationLog, String, String, GlobalWindow>.Context context,
Iterable<StationLog> elements,
Collector<String> out) throws Exception {
//统计每个基站近3个主叫通话总时长
long sumCallTime = 0L;
for (StationLog element : elements) {
sumCallTime += element.duration;
}
out.collect("基站:" + key + ",近3条通话总时长:" + sumCallTime);
}
}).print();
env.execute();
}
}
//MyCountTrigger() 每隔3条数据触发一次计算
class MyCountTrigger extends Trigger<StationLog, GlobalWindow> {
//设置 ValueStateDescriptor描述器,用于存储计数器
private ValueStateDescriptor<Long> eventCountDescriptor = new ValueStateDescriptor<Long>("event-count", Long.class);
//每来一条数据,都会调用一次
@Override
public TriggerResult onElement(StationLog element, long timestamp, GlobalWindow window, TriggerContext ctx) throws Exception {
//获取状态计数器的值
ValueState<Long> eventState = ctx.getPartitionedState(eventCountDescriptor);
//每来一条数据,状态值加1,初始状态值为null,直接返回1即可
Long count = eventState.value() == null ? 1L :eventState.value()+1L;
//将计数器的值存入状态中
eventState.update(count);
//如果计数器的值等于3,触发计算,并清空计数器
if (eventState.value() == 3L) {
//清空状态计数
eventState.clear();
//触发计算
return TriggerResult.FIRE_AND_PURGE;
}
//如果状态计数器的值不等于3,不触发计算
return TriggerResult.CONTINUE;
}
//注册处理时间定时器。如果基于ProcessTime处理,在onElement方法中注册了定时器,当定时器触发时,会调用onProcessingTime方法
@Override
public TriggerResult onProcessingTime(long time, GlobalWindow window, TriggerContext ctx) throws Exception {
return TriggerResult.CONTINUE;
}
//注册事件时间定时器。如果基于EventTime处理,在onElement方法中注册了定时器,当定时器触发时,会调用onEventTime方法
@Override
public TriggerResult onEventTime(long time, GlobalWindow window, TriggerContext ctx) throws Exception {
return TriggerResult.CONTINUE;
}
//clear() 方法处理在对应窗口被移除时所需的逻辑。
@Override
public void clear(GlobalWindow window, TriggerContext ctx) throws Exception {
ctx.getPartitionedState(eventCountDescriptor).clear();
}
}
- Scala代码:
/**
* Flink 基于EventTime GlobalWindow 全局窗口测试
* 案例:读取基站日志数据,手动指定trigger触发器,针对每个基站ID每3条数据触发一次计算。
*/
object GlobalWindwoWithKeyTest {
def main(args: Array[String]): Unit = {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//导入隐式转换
import org.apache.flink.streaming.api.scala._
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 002,181,183,busy,5000,50
* 001,181,182,busy,7000,10
*/
val sourceDS: DataStream[String] = env.socketTextStream("node5", 9999)
//将数据转换成StationLog对象
val stationLogDS: DataStream[StationLog] = sourceDS.map(line => {
val arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong)
})
//给 stationLogDS 设置水位线
val dsWithWatermark: DataStream[StationLog] = stationLogDS.assignTimestampsAndWatermarks(
//设置水位线策略
WatermarkStrategy.forBoundedOutOfOrderness[StationLog](Duration.ofSeconds(2))
//设置事件时间抽取器
.withTimestampAssigner(new SerializableTimestampAssigner[StationLog] {
override def extractTimestamp(element: StationLog, recordTimestamp: Long): Long = {
element.callTime
}
})
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
)
//按照基站id进行分组,每隔5s窗口统计每个基站所有主叫通话总时长
dsWithWatermark
.keyBy(_.sid)
.window(GlobalWindows.create())
.trigger(new MyCountTrigger())
.process(new ProcessWindowFunction[StationLog, String, String, GlobalWindow] {
override def process(key: String, context: Context, elements: Iterable[StationLog], out: Collector[String]): Unit = {
//统计每个基站所有主叫通话总时长
var sumCallTime = 0L
for (elem <- elements) {
sumCallTime += elem.duration
}
out.collect("基站:" + key + ",近3条通话总时长:" + sumCallTime)
}
}).print()
env.execute()
}
}
//MyCountTrigger触发器,每3条数据触发一次计算
//MyCountTrigger() 每隔3条数据触发一次计算
class MyCountTrigger extends Trigger[StationLog, GlobalWindow] {
//设置 ValueStateDescriptor描述器,用于存储计数器
private val eventCountDescriptor = new ValueStateDescriptor[Long]("event-count", classOf[Long])
//每来一条数据,都会调用一次
override def onElement(element: StationLog,
timestamp: Long,
window: GlobalWindow,
ctx: Trigger.TriggerContext): TriggerResult = {
//获取状态计数器的值
val eventState = ctx.getPartitionedState(eventCountDescriptor)
//每来一条数据,状态值加1,初始状态值为null,直接返回1即可
val count = Option(eventState.value()).getOrElse(0L) + 1L
//将计数器的值存入状态中
eventState.update(count)
//如果计数器的值等于3,触发计算,并清空计数器
if (count == 3L) {
//清空状态计数
eventState.clear()
//触发计算
TriggerResult.FIRE_AND_PURGE
} else {
//如果状态计数器的值不等于3,不触发计算
TriggerResult.CONTINUE
}
}
//注册处理时间定时器。如果基于ProcessTime处理,在onElement方法中注册了定时器,当定时器触发时,会调用onProcessingTime方法
override def onProcessingTime(time: Long, window: GlobalWindow, ctx: Trigger.TriggerContext): TriggerResult = {
TriggerResult.CONTINUE
}
//注册事件时间定时器。如果基于EventTime处理,在onElement方法中注册了定时器,当定时器触发时,会调用onEventTime方法
override def onEventTime(time: Long, window: GlobalWindow, ctx: Trigger.TriggerContext): TriggerResult = {
TriggerResult.CONTINUE
}
//clear() 方法处理在对应窗口被移除时所需的逻辑。
override def clear(window: GlobalWindow, ctx: Trigger.TriggerContext): Unit = {
ctx.getPartitionedState(eventCountDescriptor).clear()
}
}
以上代码中通过trigger方法传入自定义触发器来对全局窗口进行触发,在自定义触发器中,onElement 方法会针对每条数据调用一次,这里我们定义ValueState状态对每个key进行计数,按照需求当每个key的状态值达到3时,调用"TriggerResult.FIRE_AND_PURGE"对窗口进行触发,进而会执行后续的process处理逻辑。在自定义触发器中,除了可以基于计数方式来触发窗口,还可以通过在onElement 方法中针对ProcessTime或者EventTime来定义定时器来触发全局窗口执行,对应触发器执行时会执行"onProcessingTime"或者"onEventTime"方法,这些方法中可以调用"TriggerResult.FIRE_AND_PURGE"对窗口进行触发,关于在全局窗口中定时器执行触发窗口执行的方式可以参考后续的窗口触发器(trigger)小节。
以上代码业务逻辑是针对每个key当有3条数据达到时,就会触发对应的全局窗口进行业务逻辑统计,这里实际上和事件时间/处理时间没有关系,所以跟事件的乱序也没有关系,也可以理解成当下业务逻辑下,该窗口触发就是一个CountWindow计数窗口(全局窗口实际上是Count Window的底层实现)。编写完成执行后,在socket中输入如下数据
#socket中输入数据如下
001,181,182,busy,1000,10
002,182,183,fail,3000,20
001,183,184,busy,2000,30
002,184,185,busy,6000,40
#当输入此条数据时,SID 002 达到3条数据,触发窗口
002,181,183,busy,5000,50
#当输入此条数据时,SID 001达到3条数据,触发窗口
001,181,183,busy,7000,50
最终输入结果如下:
基站:002,近3条通话总时长:110
基站:001,近3条通话总时长:90
Non-KeyedStream
案例:读取基站日志数据,手动指定trigger触发器,所有基站只要有3条数据触发一次计算。
- Java代码:
/**
* Flink 基于EventTime GlobalWindow 全局窗口测试
* 案例:读取基站日志数据,手动指定trigger触发器,所有基站只要有3条数据触发一次计算。
*/
public class GlobalWindowWithoutKeyTest {
public static void main(String[] args) throws Exception {
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 003,181,183,busy,5000,50
* 001,181,182,busy,7000,10
*/
DataStreamSource<String> sourceDS = env.socketTextStream("node5", 9999);
//将数据转换成StationLog对象
SingleOutputStreamOperator<StationLog> stationLogDS = sourceDS.map(new MapFunction<String, StationLog>() {
@Override
public StationLog map(String s) throws Exception {
String[] arr = s.split(",");
return new StationLog(arr[0].trim(),
arr[1].trim(),
arr[2].trim(),
arr[3].trim(),
Long.valueOf(arr[4]),
Long.valueOf(arr[5]));
}
});
//设置水位线
SingleOutputStreamOperator<StationLog> dsWithWatermark = stationLogDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<StationLog>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((stationLog, timestamp) -> stationLog.callTime)
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
);
dsWithWatermark.windowAll(GlobalWindows.create())
//自定义触发器,每3条数据触发一次计算
.trigger(new MyGlobalCountTrigger())
//自定义窗口函数,统计每个基站所有主叫通话总时长
.process(new ProcessAllWindowFunction<StationLog, String, GlobalWindow>() {
@Override
public void process(ProcessAllWindowFunction<StationLog, String, GlobalWindow>.Context context, Iterable<StationLog> elements, Collector<String> out) throws Exception {
//统计每个基站近3个主叫通话总时长
long sumCallTime = 0L;
for (StationLog element : elements) {
sumCallTime += element.duration;
}
out.collect("全局窗口触发,近3条通话总时长:" + sumCallTime);
}
}).print();
env.execute();
}
}
//MyGlobalCountTrigger() 每隔3条数据触发一次计算
class MyGlobalCountTrigger extends Trigger<StationLog, GlobalWindow> {
//设置 ValueStateDescriptor描述器,用于存储计数器
private ValueStateDescriptor<Long> eventCountDescriptor = new ValueStateDescriptor<Long>("event-count", Long.class);
//每来一条数据,都会调用一次
@Override
public TriggerResult onElement(StationLog element, long timestamp, GlobalWindow window, TriggerContext ctx) throws Exception {
//获取状态计数器的值
ValueState<Long> eventState = ctx.getPartitionedState(eventCountDescriptor);
//每来一条数据,状态值加1,初始状态值为null,直接返回1即可
Long count = eventState.value() == null ? 1L :eventState.value()+1L;
//将计数器的值存入状态中
eventState.update(count);
//如果计数器的值等于3,触发计算,并清空计数器
if (eventState.value() == 3L) {
//清空状态计数
eventState.clear();
//触发计算
return TriggerResult.FIRE_AND_PURGE;
}
//如果状态计数器的值不等于3,不触发计算
return TriggerResult.CONTINUE;
}
//注册处理时间定时器。如果基于ProcessTime处理,在onElement方法中注册了定时器,当定时器触发时,会调用onProcessingTime方法
@Override
public TriggerResult onProcessingTime(long time, GlobalWindow window, TriggerContext ctx) throws Exception {
return TriggerResult.CONTINUE;
}
//注册事件时间定时器。如果基于EventTime处理,在onElement方法中注册了定时器,当定时器触发时,会调用onEventTime方法
@Override
public TriggerResult onEventTime(long time, GlobalWindow window, TriggerContext ctx) throws Exception {
return TriggerResult.CONTINUE;
}
//clear() 方法处理在对应窗口被移除时所需的逻辑。
@Override
public void clear(GlobalWindow window, TriggerContext ctx) throws Exception {
ctx.getPartitionedState(eventCountDescriptor).clear();
}
}
- Scala代码:
/**
* Flink 基于EventTime GlobalWindow 全局窗口测试
* 案例:读取基站日志数据,手动指定trigger触发器,所有基站只要有3条数据触发一次计算。
*/
object GlobalWindowWithoutKeyTest {
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//导入隐式转换
import org.apache.flink.streaming.api.scala._
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 002,181,183,busy,5000,50
* 001,181,182,busy,7000,10
*/
val sourceDS: DataStream[String] = env.socketTextStream("node5", 9999)
//将数据转换成StationLog对象
val stationLogDS: DataStream[StationLog] = sourceDS.map(line => {
val arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong)
})
//给 stationLogDS 设置水位线
val dsWithWatermark: DataStream[StationLog] = stationLogDS.assignTimestampsAndWatermarks(
//设置水位线策略
WatermarkStrategy.forBoundedOutOfOrderness[StationLog](Duration.ofSeconds(2))
//设置事件时间抽取器
.withTimestampAssigner(new SerializableTimestampAssigner[StationLog] {
override def extractTimestamp(element: StationLog, recordTimestamp: Long): Long = {
element.callTime
}
})
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
)
//按照基站id进行分组,每隔5s窗口统计每个基站所有主叫通话总时长
dsWithWatermark
.windowAll(GlobalWindows.create())
.trigger(new MyGlobalCountTrigger())
.process(new ProcessAllWindowFunction[StationLog, String, GlobalWindow] {
override def process(context: Context, elements: Iterable[StationLog], out: Collector[String]): Unit = {
//统计每个基站所有主叫通话总时长
var sumCallTime = 0L
for (elem <- elements) {
sumCallTime += elem.duration
}
out.collect("全局窗口触发,近3条通话总时长:" + sumCallTime)
}
}).print()
env.execute()
}
//MyGlobalCountTrigger() 每隔3条数据触发一次计算
class MyGlobalCountTrigger extends Trigger[StationLog, GlobalWindow] {
//设置 ValueStateDescriptor描述器,用于存储计数器
private val eventCountDescriptor = new ValueStateDescriptor[Long]("event-count", classOf[Long])
//每来一条数据,都会调用一次
override def onElement(element: StationLog,
timestamp: Long,
window: GlobalWindow,
ctx: Trigger.TriggerContext): TriggerResult = {
//获取状态计数器的值
val eventState = ctx.getPartitionedState(eventCountDescriptor)
//每来一条数据,状态值加1,初始状态值为null,直接返回1即可
val count = Option(eventState.value()).getOrElse(0L) + 1L
//将计数器的值存入状态中
eventState.update(count)
//如果计数器的值等于3,触发计算,并清空计数器
if (count == 3L) {
//清空状态计数
eventState.clear()
//触发计算
TriggerResult.FIRE_AND_PURGE
} else {
//如果状态计数器的值不等于3,不触发计算
TriggerResult.CONTINUE
}
}
//注册处理时间定时器。如果基于ProcessTime处理,在onElement方法中注册了定时器,当定时器触发时,会调用onProcessingTime方法
override def onProcessingTime(time: Long, window: GlobalWindow, ctx: Trigger.TriggerContext): TriggerResult = {
TriggerResult.CONTINUE
}
//注册事件时间定时器。如果基于EventTime处理,在onElement方法中注册了定时器,当定时器触发时,会调用onEventTime方法
override def onEventTime(time: Long, window: GlobalWindow, ctx: Trigger.TriggerContext): TriggerResult = {
TriggerResult.CONTINUE
}
//clear() 方法处理在对应窗口被移除时所需的逻辑。
override def clear(window: GlobalWindow, ctx: Trigger.TriggerContext): Unit = {
ctx.getPartitionedState(eventCountDescriptor).clear()
}
}
以上全局窗口是针对所有数据数据,没有再进行按照key分组,代码编写完成执行后,在socket中输入如下数据
#socket中输入数据如下
001,181,182,busy,1000,10
002,182,183,fail,3000,20
#输入该条数据会触发全局窗口执行
001,183,184,busy,2000,30
002,184,185,busy,6000,40
002,181,183,busy,5000,50
#输入该条数据会触发全局窗口执行
001,181,183,busy,7000,50
最终输入结果如下:
全局窗口触发,近3条通话总时长:60
全局窗口触发,近3条通话总时长:140
计数窗口(Count Window)
计数窗口(CountWindow) 根据固定的事件数量定义窗口大小,跟时间没有关系,当数据达到指定事件数量时窗口触发执行,计数窗口底层的实现就是全局窗口(Global Window)实现。在上小节Global Window案例实现实际上就是CountWindow的实现。
在编写Flink代码时,代码中指定使用Count Window 计数窗口方式如下(基于KeyedStream为例):
#针对KeyedStream
keyedStream.countWindow(3)
或者
keyedStream.countWindow(5,2)
#针对Non-KeyedStream
DataStream.countWindowAll(3)
或者
DataStream.countWindowAll(5,2)
与其他窗口类似,Flink中支持针对keyedStream和Non-KeyedStream使用CountWindow,直接针对流调用countWindow(size)/countWindowAll(size)方法传入对应的窗口事件个数即可,当达到size事件个数时就会触发对应的窗口,这种叫做"滚动计数窗口",同时CountWindow也支持"滑动计数窗口",通过调用countWindow(size,slide)/countWindowAll(size,slide)方法实现,size为窗口大小,slide为滑动步长,意为每隔slide个事件对最近size个事件进行划分窗口统计。
下面分别演示在KeyedStream和NoKeyedStream 实时数据流上Count Window计数窗口代码编写及测试。
KeyedStream
案例:读取基站日志数据,每个基站ID每5条数据触发一次计算。
- Java代码:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 001,181,183,busy,5000,50
* 001,181,182,busy,7000,10
* 002,182,183,fail,9000,20
* 001,183,184,busy,11000,30
* 002,184,185,busy,6000,40
* 002,181,183,busy,12000,50
*/
DataStreamSource<String> sourceDS = env.socketTextStream("node5", 9999);
//将数据转换成StationLog对象
SingleOutputStreamOperator<StationLog> stationLogDS = sourceDS.map(new MapFunction<String, StationLog>() {
@Override
public StationLog map(String s) throws Exception {
String[] arr = s.split(",");
return new StationLog(arr[0].trim(),
arr[1].trim(),
arr[2].trim(),
arr[3].trim(),
Long.valueOf(arr[4]),
Long.valueOf(arr[5]));
}
});
//设置水位线
SingleOutputStreamOperator<StationLog> dsWithWatermark = stationLogDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<StationLog>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((stationLog, timestamp) -> stationLog.callTime)
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
);
//按照基站ID进行分组,并每隔5s统计每个基站所有主叫通话总时长
KeyedStream<StationLog, String> keyedStream = dsWithWatermark.keyBy(new KeySelector<StationLog, String>() {
@Override
public String getKey(StationLog stationLog) throws Exception {
return stationLog.sid;
}
});
keyedStream.countWindow(5)
//自定义窗口函数,统计每个基站所有主叫通话总时长
.process(new ProcessWindowFunction<StationLog, String, String, GlobalWindow>() {
@Override
public void process(String key,
ProcessWindowFunction<StationLog, String, String, GlobalWindow>.Context context,
Iterable<StationLog> elements,
Collector<String> out) throws Exception {
//统计每个基站近3个主叫通话总时长
long sumCallTime = 0L;
for (StationLog element : elements) {
sumCallTime += element.duration;
}
out.collect("基站:" + key + ",近5条通话总时长:" + sumCallTime);
}
}).print();
env.execute();
- Scala代码:
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//导入隐式转换
import org.apache.flink.streaming.api.scala._
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 001,181,183,busy,5000,50
* 001,181,182,busy,7000,10
* 002,182,183,fail,9000,20
* 001,183,184,busy,11000,30
* 002,184,185,busy,6000,40
* 002,181,183,busy,12000,50
*/
val sourceDS: DataStream[String] = env.socketTextStream("node5", 9999)
//将数据转换成StationLog对象
val stationLogDS: DataStream[StationLog] = sourceDS.map(line => {
val arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong)
})
//给 stationLogDS 设置水位线
val dsWithWatermark: DataStream[StationLog] = stationLogDS.assignTimestampsAndWatermarks(
//设置水位线策略
WatermarkStrategy.forBoundedOutOfOrderness[StationLog](Duration.ofSeconds(2))
//设置事件时间抽取器
.withTimestampAssigner(new SerializableTimestampAssigner[StationLog] {
override def extractTimestamp(element: StationLog, recordTimestamp: Long): Long = {
element.callTime
}
})
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
)
//按照基站id进行分组,每隔5s窗口统计每个基站所有主叫通话总时长
dsWithWatermark
.keyBy(_.sid)
.countWindow(5)
.process(new ProcessWindowFunction[StationLog, String, String, GlobalWindow] {
override def process(key: String, context: Context, elements: Iterable[StationLog], out: Collector[String]): Unit = {
//统计每个基站所有主叫通话总时长
var sumCallTime = 0L
for (elem <- elements) {
sumCallTime += elem.duration
}
out.collect("基站:" + key + ",近5条通话总时长:" + sumCallTime)
}
}).print()
env.execute()
以上Flink代码编写完成执行后,在socket中输入如下数据
#socket中输入数据如下
001,181,182,busy,1000,10
002,182,183,fail,3000,20
001,183,184,busy,2000,30
002,184,185,busy,6000,40
001,181,183,busy,5000,50
001,181,182,busy,7000,10
002,182,183,fail,9000,20
#当输入此条数据时,SID 001 达到5条,窗口触发
001,183,184,busy,11000,30
002,184,185,busy,6000,40
#当输入此条数据时,SID 002 达到5条,窗口触发
002,181,183,busy,12000,50
最终输入结果如下:
基站:001,近5条通话总时长:130
基站:002,近5条通话总时长:170
除了以上滚动计数窗口外,还可以使用滑动计数窗口,只需要调用countWindow方法时,传入slide步长参数即可,具体参考如下案例。
案例:读取基站日志数据,每个基站ID每有2条数据输入计算近5条数据通话总时长。
Java 代码和Scala代码总体类似以上滚动计数窗口代码,只需要修改如下代码即可
...
keyedStream.countWindow(5,2)
...
代码修改完成后,可以输入对应的数据,每隔2条数据都会对最近5条数据形成窗口输入数据结果。
Non-KeyedStream
案例:读取基站日志数据,所有基站ID每5条数据触发一次计算。
- Java代码:
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 001,181,183,busy,5000,50
* 001,181,182,busy,7000,10
* 002,182,183,fail,9000,20
* 001,183,184,busy,11000,30
* 002,184,185,busy,6000,40
* 002,181,183,busy,12000,50
*/
DataStreamSource<String> sourceDS = env.socketTextStream("node5", 9999);
//将数据转换成StationLog对象
SingleOutputStreamOperator<StationLog> stationLogDS = sourceDS.map(new MapFunction<String, StationLog>() {
@Override
public StationLog map(String s) throws Exception {
String[] arr = s.split(",");
return new StationLog(arr[0].trim(),
arr[1].trim(),
arr[2].trim(),
arr[3].trim(),
Long.valueOf(arr[4]),
Long.valueOf(arr[5]));
}
});
//设置水位线
SingleOutputStreamOperator<StationLog> dsWithWatermark = stationLogDS.assignTimestampsAndWatermarks(
//设置watermark ,延迟时间为2s
WatermarkStrategy.<StationLog>forBoundedOutOfOrderness(Duration.ofSeconds(2))
//设置时间戳列信息
.withTimestampAssigner((stationLog, timestamp) -> stationLog.callTime)
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
);
dsWithWatermark.countWindowAll(5)
//自定义窗口函数,统计每个基站所有主叫通话总时长
.process(new ProcessAllWindowFunction<StationLog, String, GlobalWindow>() {
@Override
public void process(ProcessAllWindowFunction<StationLog, String, GlobalWindow>.Context context, Iterable<StationLog> elements, Collector<String> out) throws Exception {
//统计每个基站所有主叫通话总时长
long sumCallTime = 0L;
for (StationLog element : elements) {
sumCallTime += element.duration;
}
out.collect("所有基站,近5条通话总时长:" + sumCallTime);
}
}).print();
env.execute();
- Scala代码:
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
//导入隐式转换
import org.apache.flink.streaming.api.scala._
/**
* Socket中输入数据格式如下:
* 001,181,182,busy,1000,10
* 002,182,183,fail,3000,20
* 001,183,184,busy,2000,30
* 002,184,185,busy,6000,40
* 001,181,183,busy,5000,50
* 001,181,182,busy,7000,10
* 002,182,183,fail,9000,20
* 001,183,184,busy,11000,30
* 002,184,185,busy,6000,40
* 002,181,183,busy,12000,50
*/
val sourceDS: DataStream[String] = env.socketTextStream("node5", 9999)
//将数据转换成StationLog对象
val stationLogDS: DataStream[StationLog] = sourceDS.map(line => {
val arr = line.split(",")
StationLog(arr(0), arr(1), arr(2), arr(3), arr(4).toLong, arr(5).toLong)
})
//给 stationLogDS 设置水位线
val dsWithWatermark: DataStream[StationLog] = stationLogDS.assignTimestampsAndWatermarks(
//设置水位线策略
WatermarkStrategy.forBoundedOutOfOrderness[StationLog](Duration.ofSeconds(2))
//设置事件时间抽取器
.withTimestampAssigner(new SerializableTimestampAssigner[StationLog] {
override def extractTimestamp(element: StationLog, recordTimestamp: Long): Long = {
element.callTime
}
})
//设置并行度空闲时间,方便推进水位线
.withIdleness(Duration.ofSeconds(5))
)
//按照基站id进行分组,每隔5s窗口统计每个基站所有主叫通话总时长
dsWithWatermark
.countWindowAll(5)
.process(new ProcessAllWindowFunction[StationLog, String, GlobalWindow] {
override def process(context: Context, elements: Iterable[StationLog], out: Collector[String]): Unit = {
//统计每个基站所有主叫通话总时长
var sumCallTime = 0L
for (elem <- elements) {
sumCallTime += elem.duration
}
out.collect("近5条通话总时长:" + sumCallTime)
}
}).print()
env.execute()
以上Flink代码编写完成执行后,在socket中输入如下数据。
#socket中输入数据如下
001,181,182,busy,1000,10
002,182,183,fail,3000,20
001,183,184,busy,2000,30
002,184,185,busy,6000,40
#当输入此条数据时,事件达到5条,窗口触发
001,181,183,busy,5000,50
001,181,182,busy,7000,10
002,182,183,fail,9000,20
001,183,184,busy,11000,30
002,184,185,busy,6000,40
#当输入此条数据时,事件达到5条,窗口触发
002,181,183,busy,12000,50
最终输入结果如下:
所有基站,近5条通话总时长:150
所有基站,近5条通话总时长:150