5 Flink的时间和窗口操作

1 Flink的时间语义和Wartermark介绍

1.1 时间语义

Flink中窗口划分的时候是以时间作为划分标志，在Flink中对于时间有三种不同的语义，分别如下

event-time: 事件产生时间，也就是数据本身带的时间
ingestion-time: 事件摄入时间，是指数据到达Flink程序时当前的系统时间也就是被source模块处理的时间
process-time: 事件处理时间，是指数据被对应算子处理的当前系统时间，这里一般是配合窗口使用，所有是指数据被窗口函数处理时的当时系统时间
注意：在Flink1.12版本之后默认的是event-time，之前是process-time

1.2 watermark介绍

watermark本质上是一个定时往事件流中插入的时间戳，它是让事件流窗口延迟触发的一种机制，用来解决由于网络抖动或其他原因导致的轻微延迟问题。它的计算公式如下
Wartermark=进入Flink的最大事件时间-指定的延迟时间

watermark传递： 在Flink程序中一般都是多并行度执行的，所有在watermark往下游传递的时候，就会出现各个并行度不一致情况，此时规定全局watermark取所有并行度中最小的那个，然后把它广播给下游的算子
watermark对齐机制： 如上介绍，全局的watermark是取所有并行度中最小的那个，如果各并行度之间的流速相差较大，这就导致整体的性能问题，此时就需要一些对齐的机制，在Flink中可以通过withWatermarkAlignment()方法限制快的并行度来读取数据，以达到平衡

1.3 watermark的使用

Flink中定义好的watermark策略有两种，分别是针对有序流的forMonotonuous和针对无序流的forBoundedOutOfOrderness两种，除此之外，还可以自定义watermark，下面分别介绍下它们的用法

forMonotonous： 它是针对有序流的，在keyBy和window操作之前，调用DataStream中的assignTimestampsAndWatermarks方法，然后在通过WatermarkStrategy类指定forMonotonuous方法实现
forBoundedOutOfOrderness： 它是针对无序流的，用法和forMonotonous一致，需要注意的是，使用此策略在填写延迟时间为0的时候，效果跟上面的forMonotonous一样
自定义watermark： 实现WatermarkGenerator接口，实现对应的onEvent和onPeriodicEmit方法即可

2 窗口及分类

Flink中根据窗口种类可以分为，滑动窗口，滚动窗口，会话窗口，计数窗口和全局窗口这几种。同时可以对keyedStream和Non-KeyedStream两种数据流进行窗口分割

滑动窗口： keyedStream.window(TumblingEventTimeWindows.of(Time.seconds(5))) dataStream.windowAll(TumblingEventTimeWindows.of(Time.seconds(5))) //Non-keyed的窗口 它表示周期性的每隔5s钟生成一个新窗口
滚动窗口： keyedDs.window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5))) 表示每个5s滚动一次长度为10s的窗口，Non-keyedStream也是一样用windowAll()即可

会话窗口：

keyedDs.window(EventTimeSessionWindows.withGap(Time.seconds(3))) 表示3s内没有数据产生就直接触发窗口计算
自定义策略动态指定不同数据触发不同的时间窗口

java 复制代码

keyedStream .window(EventTimeSessionWindows.withDynamicGap(
new SessionWindowTimeGapExtractor<StationLog>() { 
@Override 
public long extract(StationLog element) { 
//key为001的触发时间是3s
if ("001".equals(element.sid)) {
 return 3000; 
 //key为002的触发时间是4s
 }else if("002".equals(element.sid)){ 
 return 4000; 
 //其他的触发时间是5s
 }else{
 return 5000;
 } } }))

计数窗口：

java 复制代码

//每隔3个事件就会触发窗口计算（滑动窗口）
KeyedStream keyedStream.countWindow(3) 
//每隔2个时间就触发一次5个事件的窗口计算（滚动窗口）
keyedStream.countWindow(5,2)

全局窗口： 全局窗口是不分割窗口计算的，需要手动的写触发器去触发窗口，所以它也可以认为是自定义窗口。GlobalWindows.create()和实现自定义的触发器是核心

java 复制代码

//必须设置trigger ... ... 
keyedDs.window(GlobalWindows.create()).trigger(new MyTrigger()).process(...) //最核心的就是要实现自定义窗口触发器

class MyCountTrigger extends Trigger<StationLog, GlobalWindow> {
//每笔数据都会触发，核心的触发逻辑在此处实现
@Override
public TriggerResult onElement(StationLog element, long timestamp, GlobalWindow window, TriggerContext ctx) throws Exception {
....
}
}

2.1 窗口API的使用

窗口API一共可以有这么几种类型，窗口触发器，数据剔除器，窗口聚合函数，AllowedLateness机制（允许迟到机制），侧输出流机制。其中窗口聚合函数是必须的选项，不管是KeyedWindow通过window()开窗还是Non-Keyed Window通过allWindow()开窗，下面分别介绍下这几种类型

触发器（trigger）： 可以自定义窗口的触发机制，如果不自定义程序会给默认的时间触发器或是事件触发器，如下是具体实现方案

java 复制代码

//使用自定义触发器
DataStream.window(WindowAssigner...)
        //自定义触发器
        .trigger(new MyTrigger())
        //处理函数逻辑
        .process(...)
//自定义触发器，继承Trigger抽象类
class MyTrigger extends Trigger<EventType,WindowType>{
    //每笔数据都需要调用，主要的触发方法写在这里
    @Override
    public TriggerResult onElement(EventType element, long timestamp, WindowType window, TriggerContext ctx) throws Exception {
        return ...;
    }
    //用processTime会调用这个方法
    @Override
    public TriggerResult onProcessingTime(long time, WindowType window, TriggerContext ctx) throws Exception {
        return ...;
    }
    //用eventTime会调用这个方法
    @Override
    public TriggerResult onEventTime(long time, WindowType window, TriggerContext ctx) throws Exception {
        return ...;
    }
    // 窗口销毁的时调用这个方法
    @Override
    public void clear(WindowType window, TriggerContext ctx) throws Exception {
...
    }
}

数据剔除器（evictor）： 顾名思义它的功能是根据自定义逻辑剔除窗口中一些不需要的数据，可以在窗口触发前，或是窗口触发后的窗口内对数据进行剔除，Flink中定义了剔除方法，用法如下

java 复制代码

dsWithWatermark
  .keyBy(_.sid)
  .window(TumblingEventTimeWindows.of(Time.seconds(5)))
  //设置窗口移除策略
  .evictor(DeltaEvictor.of[StationLog,TimeWindow](5, new DeltaFunction[StationLog] {
    //获取两个数据点的差值,差值大于5s就会被剔除
    override def getDelta(oldDataPoint: StationLog, newDataPoint: StationLog): Double = {
      Math.abs(newDataPoint.duration - oldDataPoint.duration)
    }
  }))
  .process(...)
  .print()

自定义剔除器使用方法如下

java 复制代码

dsWithWatermark
      .keyBy(_.sid)
      .window(GlobalWindows.create())
      .trigger(new MyTimeTriggerCls())
      .evictor(new Evictor[StationLog, GlobalWindow] {
        //对窗口触发前的窗口数据进行处理
        override def evictBefore(elements: lang.Iterable[TimestampedValue[StationLog]],
                                 size: Int,
                                 window: GlobalWindow,
                                 evictorContext: Evictor.EvictorContext): Unit = {
          val iter: util.Iterator[TimestampedValue[StationLog]] = elements.iterator
          //如果数据的 callType 标记为"迟到数据"，则移除该数据
          while (iter.hasNext) {
            val next: TimestampedValue[StationLog] = iter.next
            if (next.getValue.callType == "迟到数据") {
              System.out.println("移除了迟到数据：" + next.getValue)
              //移除迟到数据,删除当前指针所指向的元素
              iter.remove()
            }
          }
        }
        //对窗口触发后的窗口数据进行处理
        override def evictAfter(elements: lang.Iterable[TimestampedValue[StationLog]],
                                size: Int,
                                window: GlobalWindow,
                                evictorContext: Evictor.EvictorContext): Unit = {

        }
      })
      .process(....
      }).print()

窗口聚合函数： 窗口聚合函数分为增量聚合和全量聚合，reduceFunction()和aggregateFunction()数据增量聚合函数，processFunction()属于全量聚合函数，他们的主要区别是，增量聚合函数会根据自定义的聚合逻辑，只保留计算后的状态，全量聚合函数会把窗口内所有事件数据都保存下来，如下是aggregate的具体实现，reduce的实现和正常算子的用法一致，process前面已经介绍过很多次了

java 复制代码

dsWithWatermark
  .keyBy(_.sid)
  .window(TumblingEventTimeWindows.of(Time.seconds(5)))
  .aggregate(new AggregateFunction[StationLog,(String,Long),String] {
    //创建累加器
    override def createAccumulator(): (String, Long) = ("", 0L)
    //累加器累加
    override def add(value: StationLog, accumulator: (String, Long)): (String, Long) 
      (value.sid, accumulator._2 + value.duration)
    //获取结果
    override def getResult(accumulator: (String, Long)): String =
      "基站：" + accumulator._1 + ",通话时长：" + accumulator._2
    //合并累加器
    override def merge(a: (String, Long), b: (String, Long)): (String, Long) = (a._1, a._2 + b._2)
  })

增量和全量API交叉使用：可以先用增量函数对窗口进行聚合，然后把聚合之后的数据用全量窗口processFunction进一步处理，此时是利用process方法能拿到窗口的context，key，等较全的信息做进一步处理，具体案例如下

java 复制代码

dsWithWatermark
  .keyBy(_.sid)
  .window(TumblingEventTimeWindows.of(Time.seconds(5)))
  .aggregate(new AggregateFunction[StationLog,(Long,Long),(Long,Long)] {
    //创建累加器
    override def createAccumulator(): (Long, Long) = (0L,0L)
    //累加器累加
    override def add(value: StationLog, accumulator: (Long, Long)): (Long, Long) =
      (accumulator._1 + value.duration,accumulator._2 + 1L)
    //获取结果
    override def getResult(accumulator: (Long, Long)): (Long, Long) = accumulator
    //合并累加器
    override def merge(a: (Long, Long), b: (Long, Long)): (Long, Long) = (a._1 + b._1,a._2 + b._2)
  },new ProcessWindowFunction[(Long,Long),String,String,TimeWindow] {
    override def process(key: String, context: Context, elements: Iterable[(Long, Long)], out: Collector[String]): Unit = {
      //这里可以基于context和key做更多更底层的处理
      val avgDuration: Double = (elements.head._1 / elements.head._2).toDouble
      out.collect("基站ID:" + key + "," + "窗口范围:[" + context.window.getStart + "," + context.window.getEnd + ")," + "平均通话时长：" + avgDuration)
    }
  }).print()

允许延迟（Allowed Lateness）： allowed Lateness机制的作用是处理延迟数据的，它和watermark是有区别的，watermark它本质上是一个时间戳，它是通过延迟窗口的触发时间来等待迟到数据的解决方案，但是lateness机制它是通过保留窗口不过期，在一定的时间内，迟到的数据来了，会把这笔数据加入从新触发窗口计算一次得到最新结果。这种实现方案的优势是保证了数据的时效性，缺点也是明显的，需要保存更多的中间状态数据，具体实现案例如下
java 复制代码
```
dsWithWatermark
  .keyBy(_.sid)
  .window(TumblingEventTimeWindows.of(Time.seconds(5)))
  //在watermark基础之上，再延迟2s触发窗口计算
  .allowedLateness(Time.seconds(2))
  .process(...
    }
  }).print()
```

侧输出流（sideOutputLatgeData）： 侧输出流机制这里用来处理非常严重的数据，当然了，它不仅能处理迟到数据，如上所述，数据通过watermark机制和lateness机制肯定还会存在一些迟到非常严重的数据，默认情况下，Flink机制会把这些数据丢弃掉，如果不想丢弃这些数据，可以使用侧输出流机制来承接这些数据，案例如下

java 复制代码

val lateOutputTag = new OutputTag[StationLog]("late-data")
//按照基站id进行分组，每隔5s窗口统计每个基站所有主叫通话总时长
val result: DataStream[String] = dsWithWatermark
  .keyBy(_.sid)
  .window(TumblingEventTimeWindows.of(Time.seconds(5)))
  //在watermark基础之上，再延迟2s触发窗口计算
  .allowedLateness(Time.seconds(2))
  //迟到的数据，通过侧输出流方式进行收集
  .sideOutputLateData(lateOutputTag)
  .process(...
  })
//获取正常流的数据
result.print("正常窗口数据")
//获取侧输出流数据
result.getSideOutput(lateOutputTag).print("迟到的数据")

3 事件时间下的流关联

流的关联操作中有Union，Connect，Window Join，Interval Join，Window Cogroup这几种，下面分别介绍下

Union： 类似与Sql中的Union，它一定要求把两个类型一致的流聚合在一起，实际案例

java 复制代码

//两个类型一致的流进行聚合成一个流
adsWithWatermark.union(bdsWithWatermark)
  .keyBy(_.sid)
  .window(TumblingEventTimeWindows.of(Time.seconds(5)))
  //对合成一个的流进行处理
  .process(...).print()

Connect： 类似Union但是它可以让两个类型不同的两个流聚合在一起，案例如下

java 复制代码

//两流进行connect操作
AdsWithWatermark.connect(BdsWithWatermark)
  .process(new CoProcessFunction[String, String, String]() {
  //对两个流中计算的数据进行处理
  //对第一个流的数据进行处理
    override def processElement1(value: String,
                                 ctx: CoProcessFunction[String, String, String]#Context,
                                 out: Collector[String]): Unit = {
      out.collect("A流数据：" + value + ",当前watermark：" + ctx.timerService().currentWatermark())
    }
    //对两个流中的数据进行处理
    override def processElement2(value: String,
                                 ctx: CoProcessFunction[String, String, String]#Context,
                                 out: Collector[String]): Unit = {
      out.collect("B流数据：" + value + ",当前watermark：" + ctx.timerService().currentWatermark())
    }
  }).print()

Window Join： 它是基于窗口的关联操作，类似与sql中的join操作，案例如下

java 复制代码

//将订单流和支付流进行关联，并设置窗口
val result: DataStream[String] = orderDSWithWatermark.join(payDSWithWatermark)
  //select * from order a join pay b on a.key = b.key
  //设置关联条件，where相当于找出上面sql的a.key值返回，equalTo找出b.key的值返回
  .where(value=>value.split(",")(0))
  .equalTo(value=>value.split(",")(0))
  //切割可以关联的窗口
  .window(TumblingEventTimeWindows.of(Time.seconds(5)))
  //关联后的数据处理
  .apply(new JoinFunction[String, String, String] {
  	//两个流关联上的数据处理
    override def join(orderInfo: String, payInfo: String): String =
      s"订单信息：$orderInfo - 支付信息：$payInfo"
  }).print()

Interval Join： 它是指定时间范围的流关联，比如一个小游戏网站上有投广告，在统计登录哪些用户登录之后点击了什么广告，这里分为用户登录流和广告点击流，假设每个广告的显示时间是10s，此时可用Interval Join，把登录流 Interval Join广告流，关联的上边界时间是0，下边界时间是登陆时间+10（默认是包括边界0和10的），具体案例

java 复制代码

loginDSWithWatermark.keyBy(loginInfo => loginInfo.split(",")(0))
  .intervalJoin(clickDSWithWatermark.keyBy(clickInfo => clickInfo.split(",")(0)))
  // 设置相对于"主流"的时间范围
  .between(Time.seconds(0), Time.seconds(10))
  // 设置处理函数
  .process(new ProcessJoinFunction[String, String, String] {
    override def processElement(left: String,
                                right: String,
                                ctx: ProcessJoinFunction[String, String, String]#Context,
                                out: Collector[String]): Unit = {
      // 获取用户ID
      val userId: String = left.split(",")(0)
      out.collect(s"用户ID为：$userId 的用户点击了广告：$right")
    }
  })
  .print()

Window Cogroup： 它是一个功能强大且灵活的操作，可以实现Sql中的inner join，left/right/full join，用法上跟Window Join一致，案例如下

java 复制代码

//将订单流和支付流进行关联，并设置窗口
val result: DataStream[String] = orderDSWithWatermark.coGroup(payDSWithWatermark)
  //具体含义可以参照Window Join
  //设置关联条件，where是订单流，equalTo是支付流
  .where(value=>value.split(",")(0))
  .equalTo(value=>value.split(",")(0))
  //设置窗口
  .window(TumblingEventTimeWindows.of(Time.seconds(5)))
  //关联后的数据处理
  .apply(new CoGroupFunction[String,String,String] {
    override def coGroup(first: lang.Iterable[String], second: lang.Iterable[String], out: Collector[String]): Unit = {
      out.collect(first+"====="+second)
    }
  }).print()