Apache Flink

"Apache Flink is the opensource stream processing framework for distributed, high-performance, ready-to-use, and accurate stream processing applications."

Apache Flink is a framework and distributed processing engine for stateful computation of unbounded and bounded data streams. Flink is designed to run in all common clustered environments, performing computations at in-memory execution speeds and at any scale.

The first is the Checkpoint mechanism, which is one of Flink's most important features. Flink implements a distributed snapshot of consistency based on Chandy-Lamport algorithm, thus providing consistency semantics. The Chandy-Lamport algorithm was actually proposed in 1985, but it was not widely used, and Flink took it to the next level.

Spark has recently implemented Continue streaming, the purpose of which is to reduce the latency of its processing, and it also needs to provide such consistent semantics, and finally adopts the Chandy-Lamport algorithm. It shows that Chandy-Lamport algorithm has gained certain recognition in the industry.

After providing consistent semantics, Flink also provides a set of very simple and clear State apis, including ValueState, ListState, MapState, etc., in order to make it easier for users to manage state during programming. With the recent addition of BroadcastState, you can automatically enjoy this consistent semantics using the State API.

In addition, Flink also implements the Watermark mechanism, which can support event-based time processing, or processing based on system time, and can tolerate data delay, data lateness, and out-of-order data.

In addition, in stream calculation, Windows are generally opened before the flow data is operated, that is, based on what kind of window to do this calculation. Flink provides a variety of Windows out of the box, such as sliding Windows, scrolling Windows, session Windows and very flexible custom Windows.

Batch and stream processing

The characteristics of batch processing are bounded, persistent, and large, and batch processing is very suitable for computing work that requires access to a full set of records, and is generally used for offline statistics. Stream processing is characterized by unbounded and real-time, and it does not need to perform operations on the entire data set, but performs operations on each data item transmitted through the system, which is generally used for real-time statistics.

In the Spark ecosystem, different technical frameworks are adopted for batch processing and stream processing. Batch processing is realized by SparkSQL, while stream processing is realized by Spark Streaming, which is also the strategy adopted by most frameworks. Independent processors are used to realize batch processing and stream processing. Flink can do both batch and stream processing.

The core computing architecture of Flink is the Flink Runtime execution engine shown below, which is a distributed system capable of accepting data flow programs and executing them in a fault-tolerant manner on one or more machines.

The Flink Runtime execution engine can run on a cluster as an application of YARN (Yet Another Resource Negotiator) or on a Mesos cluster. It can also be run on a single machine (which is very useful for debugging Flink applications).

The above figure shows the core components of the Flink technology stack, and it is worth mentioning that Flink provides streaming processing interfaces (DataStream API) and batch processing interfaces (DataSet API). As a result, Flink can do both stream and batch processing. Flink supports extensions to machine learning (FlinkML), complex event processing (CEP), and graph Computing (Gelly), as well as Table apis for stream and batch processing, respectively.

Flink's distributed nature is reflected in its ability to run on hundreds or thousands of machines, which divides large computing tasks into many smaller parts, with each machine performing a portion. Flink automatically ensures that calculations continue in the event of a machine failure or other error, or that they are scheduled to be performed again after a bug is fixed or a version upgrade is made. This capability eliminates the need for developers to worry about running failures. Flink essentially uses fault-tolerant data streaming, which allows developers to analyze data that is continuously generated and never ends (i.e., stream processing).

The basic building blocks of a Flink program are flow and transformation. Conceptually, a stream is a stream of data records, while a transformation is the operation of one or more streams as one or more streams. Input and produce one or more output streams.

When the Flink program is executed, it is mapped to a Streaming Dataflow, which consists of a set of streams and Transformation operators. It starts with one or more Source operators and ends with one or more Sink operators.

Flink programs are parallel and distributed in nature; during execution, a stream contains one or more stream partitions, and each operator contains one or more operator subtasks. Operation subtasks are independent of each other and execute in different threads, even on different machines or containers. The number of operator subtasks is the degree of parallelism for this particular operator. Different operators in the same program have different levels of parallelism.

A Stream can be divided into multiple Stream partitions, called Stream partitions. An Operator can also be divided into multiple Operator subtasks. In the figure above, Source is divided into Source1 and Source2, which are the Operator subtasks of Source, respectively. Each Operator Subtask is executed independently in a different thread. The parallelism of an Operator is equal to the number of Operator subtasks. The parallelism of Source in the figure above is 2. The parallelism of a Stream is equal to the parallelism of the Operator it generates.

There are two modes when data is passed between the two operators:

One to One mode: When two operators are passed in this mode, the number of partitions and the ordering of data are kept. For example, Source1 to Map1 in the figure above, it retains the partitioning nature of Source and the orderliness of the partition element processing.

Redistributing mode: This mode changes the number of partitions of data. Each operator subtask sends data to different target subtasks based on the selection of transformation. For example, keyBy() repartitions using hashcode,broadcast() and broadcast() The rebalance() method repartitions randomly.

相关推荐
方向研究8 小时前
汽油生产
大数据
码农小白AI8 小时前
IACheck AI报告文档审核:高端制造合规新助力,保障标准引用报告质量
大数据·人工智能·制造
泰迪智能科技9 小时前
分享|高校必备三大实训管理平台,助力高校人工智能、大数据、商务数据分析人才培养
大数据·人工智能·数据分析
GJGCY10 小时前
2026企业级AI智能体架构对比:RPA+大模型融合在财务场景的表现
大数据·人工智能·ai·rpa·智能体
无心水10 小时前
【OpenClaw:应用与协同】23、OpenClaw生产环境安全指南——Token管理/沙箱隔离/权限最小化
大数据·人工智能·安全·ai·性能优化·openclaw
思码逸研发效能10 小时前
代码度量分析入门:从0到1掌握核心指标
大数据·人工智能·研发效能·研发管理
云境筑桃源哇10 小时前
亿迈跨境分销商城启航
大数据·人工智能
Sylvia33.11 小时前
OpenClaw + 数眼智能:Windows/Mac 双系统部署与特价模型接入实战指南
大数据·人工智能
瑞通软件源头厂家11 小时前
瑞通酒店管理系统:开启酒店成本控制智能新篇
大数据·人工智能
搭贝12 小时前
长沙韶光芯材|精准管控工时,夯实高端制造数字化管理根基
大数据·人工智能·低代码·自动化·sass