Flink笔记

1. 简介

Apache Flink 是一个开源的分布式计算框架,用于处理批处理和流数据。Flink 的典型应用场景包括:

  • 实时数据处理:Flink 可以用于实时处理各种数据,例如来自传感器、日志、社交媒体等。

  • 数据分析:Flink 可以用于对大规模数据进行分析,例如进行机器学习、统计分析等。

  • 数据管道:Flink 可以用于构建数据管道,将数据从一个系统传输到另一个系统。

2. 基本概念

介绍 Flink 的核心概念,例如 DataStream、DataSet、Transformation、Window 等。

2.1 架构

Flink Architecture | Apache Flink

2.1.1 The Client

The Client is not part of the runtime and program execution, but is used to prepare and send a dataflow to the JobManager. After that, the client can disconnect (detached mode ), or stay connected to receive progress reports (attached mode ). The client runs either as part of the Java/Scala program that triggers the execution, or in the command line process ./bin/flink run ....

2.1.2 The JobManager and TaskManagers

The JobManager and TaskManagers can be started in various ways: directly on the machines as a standalone cluster, in containers, or managed by resource frameworks like YARN. TaskManagers connect to JobManagers, announcing themselves as available, and are assigned work.

2.1.2.1 JobManager

The JobManager has a number of responsibilities related to coordinating the distributed execution of Flink Applications: it decides when to schedule the next task (or set of tasks), reacts to finished tasks or execution failures, coordinates checkpoints, and coordinates recovery on failures, among others. This process consists of three different components:

  • ResourceManager: The ResourceManager is responsible for resource de-/allocation and provisioning in a Flink cluster --- it manages task slots , which are the unit of resource scheduling in a Flink cluster (see TaskManagers). Flink implements multiple ResourceManagers for different environments and resource providers such as YARN, Kubernetes and standalone deployments. In a standalone setup, the ResourceManager can only distribute the slots of available TaskManagers and cannot start new TaskManagers on its own.

  • Dispatcher: The Dispatcher provides a REST interface to submit Flink applications for execution and starts a new JobMaster for each submitted job. It also runs the Flink WebUI to provide information about job executions.

  • JobMaster: A JobMaster is responsible for managing the execution of a single JobGraph. Multiple jobs can run simultaneously in a Flink cluster, each having its own JobMaster.

There is always at least one JobManager. A high-availability setup might have multiple JobManagers, one of which is always the leader , and the others are standby (see High Availability (HA)).

2.1.2.2 TaskManagers

The TaskManagers (also called workers ) execute the tasks of a dataflow, and buffer and exchange the data streams. There must always be at least one TaskManager. The smallest unit of resource scheduling in a TaskManager is a task slot . The number of task slots in a TaskManager indicates the number of concurrent processing tasks. Note that multiple operators may execute in a task slot (see Tasks and Operator Chains).

2.2 时间窗口的类型

Windows | Apache Flink

2.2.1 滚动窗口(Tumbling Windows)

A tumbling windows assigner assigns each element to a window of a specified window size. Tumbling windows have a fixed size and do not overlap. For example, if you specify a tumbling window with a size of 5 minutes, the current window will be evaluated and a new window will be started every five minutes as illustrated by the following figure.

2.2.2 滑动窗口(Sliding Windows)

The sliding windows assigner assigns elements towindows of fixed length . Similar to a tumbling windows assigner, the size of the windows is configured by the window size parameter. An additional window slide parameter controls how frequently a sliding window is started. Hence, sliding windows can be overlapping if the slide is smaller than the window size. In this case elements are assigned to multiple windows.

For example, you could have windows of size 10 minutes that slides by 5 minutes. With this you get every 5 minutes a window that contains the events that arrived during the last 10 minutes as depicted by the following figure.

2.2.3 会话窗口(Session Windows)

The session windows assigner groups elements by sessions of activity. Session windows do not overlap and do not have a fixed start and end time, in contrast to tumbling windows and sliding windows . Instead a session window closes when it does not receive elements for a certain period of time, i.e. , when a gap of inactivity occurred. A session window assigner can be configured with either a static session gap or with a session gap extractor function which defines how long the period of inactivity is. When this period expires, the current session closes and subsequent elements are assigned to a new session window.

2.2.4 全局窗口(Global Windows)

A global windows assigner assigns all elements with the same keyto the same single global window . This windowing scheme is only useful if you also specify a custom trigger. Otherwise, no computation will be performed, as the global window does not have a natural end at which we could process the aggregated elements.

2.3 Flink的3种时间语义

2.3.1 Event Time

Event Time指的是数据流中每个元素或者每个事件自带的时间属性,一般是事件发生的时间。

2.3.2 Processing Time

2.3.3 Ingestion Time

Ingestion Time是事件到达Flink Source的时间。

3. 实践应用

3.1 窗口聚合

3.2 数据通道

4. 性能优化

介绍如何优化 Flink 的性能,例如算子并行度、状态管理等。

5. 常见问题解答

解答有关 Flink 的常见问题。

6. 总结

总结 Flink 的关键知识点,并提供学习资源和进一步研究方向。

参考资料

【1】官方文档:Windows | Apache Flink

相关推荐
武子康1 天前
大数据-236 离线数仓 - 会员指标验证、DataX 导出与广告业务 ODS/DWD/ADS 全流程
大数据·后端·apache hive
武子康2 天前
大数据-235 离线数仓 - 实战:Flume+HDFS+Hive 搭建 ODS/DWD/DWS/ADS 会员分析链路
大数据·后端·apache hive
DianSan_ERP3 天前
电商API接口全链路监控:构建坚不可摧的线上运维防线
大数据·运维·网络·人工智能·git·servlet
够快云库3 天前
能源行业非结构化数据治理实战:从数据沼泽到智能资产
大数据·人工智能·机器学习·企业文件安全
AI周红伟3 天前
周红伟:智能体全栈构建实操:OpenClaw部署+Agent Skills+Seedance+RAG从入门到实战
大数据·人工智能·大模型·智能体
B站计算机毕业设计超人3 天前
计算机毕业设计Django+Vue.js高考推荐系统 高考可视化 大数据毕业设计(源码+LW文档+PPT+详细讲解)
大数据·vue.js·hadoop·django·毕业设计·课程设计·推荐算法
计算机程序猿学长3 天前
大数据毕业设计-基于django的音乐网站数据分析管理系统的设计与实现(源码+LW+部署文档+全bao+远程调试+代码讲解等)
大数据·django·课程设计
B站计算机毕业设计超人3 天前
计算机毕业设计Django+Vue.js音乐推荐系统 音乐可视化 大数据毕业设计 (源码+文档+PPT+讲解)
大数据·vue.js·hadoop·python·spark·django·课程设计
十月南城3 天前
数据湖技术对比——Iceberg、Hudi、Delta的表格格式与维护策略
大数据·数据库·数据仓库·hive·hadoop·spark
中烟创新3 天前
灯塔AI智能体获评“2025-2026中国数智科技年度十大创新力产品”
大数据·人工智能·科技