Flink实现TopN URL访问量统计

Scala 复制代码

package processfunction

import org.apache.flink.api.common.functions.AggregateFunction
import org.apache.flink.api.common.state.{ListState, ListStateDescriptor}
import org.apache.flink.configuration.Configuration
import org.apache.flink.streaming.api.functions.KeyedProcessFunction
import org.apache.flink.streaming.api.scala._
import org.apache.flink.streaming.api.scala.function.ProcessWindowFunction
import org.apache.flink.streaming.api.windowing.assigners.SlidingEventTimeWindows
import org.apache.flink.streaming.api.windowing.time.Time
import org.apache.flink.streaming.api.windowing.windows.TimeWindow
import org.apache.flink.util.Collector
import org.apache.hadoop.metrics2.util.Metrics2Util.TopN
import source.{ClickSource, Event}

import scala.collection.convert.ImplicitConversions.`iterable AsScalaIterable`

case class UrlViewCount(url:String,count:Long,widowStart:Long,windowEnd:Long)
/**
 *
 * @PROJECT_NAME: flink1.13
 * @PACKAGE_NAME: processfunction
 * @author: 赵嘉盟-HONOR
 * @data: 2023-11-24 21:55
 * @DESCRIPTION
 *
 */
object TopNKeyedProcessFunction {
  def main(args: Array[String]): Unit = {
    val env = StreamExecutionEnvironment.getExecutionEnvironment
    env.setParallelism(1)

    val data = env.addSource(new ClickSource).assignAscendingTimestamps(_.timestamp)

    val urlCountStream = data.keyBy(_.url)
      .window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))
      .aggregate(new UrlViewCountAgg, new UrlViewCountResult)

    urlCountStream.keyBy(_.windowEnd).process(new TopN(5)).print()

    env.execute("TopNDemo2")
  }
  class UrlViewCountAgg extends AggregateFunction[source.Event,Long,Long] {
    override def createAccumulator(): Long = 0L
    override def add(in: Event, acc: Long): Long = acc+1
    override def getResult(acc: Long): Long = acc
    override def merge(acc: Long, acc1: Long): Long = ???
  }
  class UrlViewCountResult extends ProcessWindowFunction[Long,UrlViewCount,String,TimeWindow] {
    override def process(key: String, context: Context, elements: Iterable[Long], out: Collector[UrlViewCount]): Unit = {
      out.collect(UrlViewCount(
        key,elements.iterator.next(),context.window.getStart,context.window.getEnd
      ))
    }
  }
  class TopN(topN:Int) extends KeyedProcessFunction[Long,UrlViewCount,String] {
    var urlViewCountListState:ListState[UrlViewCount]=_
    override def open(parameters: Configuration): Unit = {
      urlViewCountListState = getRuntimeContext.getListState(new ListStateDescriptor[UrlViewCount]("list-state", classOf[UrlViewCount]))
    }
    override def processElement(i: UrlViewCount, context: KeyedProcessFunction[Long, UrlViewCount, String]#Context, collector: Collector[String]): Unit = {
      urlViewCountListState.add(i)
      context.timerService().registerEventTimeTimer(i.windowEnd+1)
    }
    override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[Long, UrlViewCount, String]#OnTimerContext, out: Collector[String]): Unit = {
      val topNList = urlViewCountListState.get().toList.sortBy(-_.count).take(topN)
      val builder = new StringBuilder()
      builder.append(s"========窗口：${timestamp-1-10000} ~ ${timestamp-1} ======= \n")
      for (i <- topNList.indices){
        val urlViewCount = topNList(i)
        builder.append(
          s"浏览量Top ${i+1} " +
          s"url: ${urlViewCount.url} " +
          s"浏览量是： ${urlViewCount.count} \n")
      }
      out.collect(builder.toString())
    }
  }
}

这段代码展示了如何使用 Apache Flink 实现一个 TopN 统计 的功能，即统计某个时间窗口内访问量最高的前 N 个 URL。以下是代码的详细解释和背景知识拓展。

代码解释

1. 环境设置

Scala 复制代码

val env = StreamExecutionEnvironment.getExecutionEnvironment
env.setParallelism(1)

StreamExecutionEnvironment.getExecutionEnvironment：获取流处理执行环境。
env.setParallelism(1)：设置并行度为 1，方便调试和观察结果。

2. 数据源与时间戳分配

Scala 复制代码

val data = env.addSource(new ClickSource).assignAscendingTimestamps(_.timestamp)

addSource(new ClickSource)：从自定义数据源 ClickSource 读取数据。
assignAscendingTimestamps(_.timestamp)：为数据分配时间戳，用于事件时间处理。

3. 窗口统计

Scala 复制代码

val urlCountStream = data.keyBy(_.url)
  .window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))
  .aggregate(new UrlViewCountAgg, new UrlViewCountResult)

keyBy(_.url)：按 URL 分组。
window(SlidingEventTimeWindows.of(Time.seconds(10), Time.seconds(5)))：定义滑动窗口，窗口大小为 10 秒，滑动步长为 5 秒。
aggregate(new UrlViewCountAgg, new UrlViewCountResult)：使用聚合函数 UrlViewCountAgg 和窗口函数 UrlViewCountResult 统计每个 URL 的访问量。

4. 聚合函数

Scala 复制代码

class UrlViewCountAgg extends AggregateFunction[source.Event, Long, Long] {
  override def createAccumulator(): Long = 0L
  override def add(in: Event, acc: Long): Long = acc + 1
  override def getResult(acc: Long): Long = acc
  override def merge(acc: Long, acc1: Long): Long = ???
}

createAccumulator：初始化累加器。
add：对每条数据进行累加。
getResult：返回累加结果。
merge：合并累加器（未实现）。

5. 窗口函数

Scala 复制代码

class UrlViewCountResult extends ProcessWindowFunction[Long, UrlViewCount, String, TimeWindow] {
  override def process(key: String, context: Context, elements: Iterable[Long], out: Collector[UrlViewCount]): Unit = {
    out.collect(UrlViewCount(key, elements.iterator.next(), context.window.getStart, context.window.getEnd))
  }
}

process：将聚合结果封装为 UrlViewCount 对象，包含 URL、访问量、窗口开始时间和窗口结束时间。

6. TopN 统计

Scala 复制代码

urlCountStream.keyBy(_.windowEnd).process(new TopN(5)).print()

keyBy(_.windowEnd)：按窗口结束时间分组。
process(new TopN(5))：使用 TopN 函数统计每个窗口内访问量最高的前 5 个 URL。
print：打印结果。

7. TopN 函数

Scala 复制代码

class TopN(topN: Int) extends KeyedProcessFunction[Long, UrlViewCount, String] {
  var urlViewCountListState: ListState[UrlViewCount] = _
  override def open(parameters: Configuration): Unit = {
    urlViewCountListState = getRuntimeContext.getListState(new ListStateDescriptor[UrlViewCount]("list-state", classOf[UrlViewCount]))
  }
  override def processElement(i: UrlViewCount, context: KeyedProcessFunction[Long, UrlViewCount, String]#Context, collector: Collector[String]): Unit = {
    urlViewCountListState.add(i)
    context.timerService().registerEventTimeTimer(i.windowEnd + 1)
  }
  override def onTimer(timestamp: Long, ctx: KeyedProcessFunction[Long, UrlViewCount, String]#OnTimerContext, out: Collector[String]): Unit = {
    val topNList = urlViewCountListState.get().toList.sortBy(-_.count).take(topN)
    val builder = new StringBuilder()
    builder.append(s"========窗口：${timestamp - 1 - 10000} ~ ${timestamp - 1} ======= \n")
    for (i <- topNList.indices) {
      val urlViewCount = topNList(i)
      builder.append(
        s"浏览量Top ${i + 1} " +
        s"url: ${urlViewCount.url} " +
        s"浏览量是： ${urlViewCount.count} \n")
    }
    out.collect(builder.toString())
  }
}

open：初始化状态，用于存储每个窗口的 URL 访问量。
processElement：将每个 URL 的访问量添加到状态中，并注册一个定时器。
onTimer：定时器触发时，从状态中获取数据，排序并取前 N 个，生成结果字符串。

8. 任务执行

Scala 复制代码

env.execute("TopNDemo2")

启动 Flink 任务。

背景知识拓展

1. 窗口计算

滑动窗口（Sliding Window）：窗口大小固定，滑动步长固定，窗口之间会有重叠。
滚动窗口（Tumbling Window）：窗口大小固定，滑动步长等于窗口大小，窗口之间无重叠。
会话窗口（Session Window）：根据数据之间的间隔动态划分窗口。

2. 聚合函数与窗口函数

AggregateFunction：用于增量聚合，适合高效计算。
ProcessWindowFunction：用于全量处理，适合复杂计算。

3. 状态管理

ListState：用于存储列表类型的状态。
ValueState：用于存储单个值类型的状态。
MapState：用于存储键值对类型的状态。

4. 定时器

事件时间定时器：基于事件时间触发。
处理时间定时器：基于处理时间触发。

5. Flink 的时间语义

事件时间（Event Time）：数据实际发生的时间。
处理时间（Processing Time）：数据被处理时的系统时间。
摄入时间（Ingestion Time）：数据进入 Flink 系统的时间。

6. TopN 统计

应用场景：统计访问量最高的 URL、最活跃的用户等。
实现方式：通过状态管理和定时器实现。

7. Flink 的容错机制

Checkpoint：定期保存状态，用于故障恢复。
Savepoint：手动触发的状态保存，用于版本升级或任务迁移。

进一步学习

Flink 官方文档 ：Flink 官方文档https://flink.apache.org/docs/stable/
窗口计算教程 ：窗口计算教程https://ci.apache.org/projects/flink/flink-docs-stable/dev/stream/operators/windows.html
状态管理与容错：学习 Flink 的状态管理和容错机制。

通过这段代码的学习，你可以掌握如何使用 Flink 实现窗口计算、状态管理和 TopN 统计，并了解 Flink 的时间语义、窗口类型和定时器等核心概念。