Apache Flink

"Apache Flink is the opensource stream processing framework for distributed, high-performance, ready-to-use, and accurate stream processing applications."

Apache Flink is a framework and distributed processing engine for stateful computation of unbounded and bounded data streams. Flink is designed to run in all common clustered environments, performing computations at in-memory execution speeds and at any scale.

The first is the Checkpoint mechanism, which is one of Flink's most important features. Flink implements a distributed snapshot of consistency based on Chandy-Lamport algorithm, thus providing consistency semantics. The Chandy-Lamport algorithm was actually proposed in 1985, but it was not widely used, and Flink took it to the next level.

Spark has recently implemented Continue streaming, the purpose of which is to reduce the latency of its processing, and it also needs to provide such consistent semantics, and finally adopts the Chandy-Lamport algorithm. It shows that Chandy-Lamport algorithm has gained certain recognition in the industry.

After providing consistent semantics, Flink also provides a set of very simple and clear State apis, including ValueState, ListState, MapState, etc., in order to make it easier for users to manage state during programming. With the recent addition of BroadcastState, you can automatically enjoy this consistent semantics using the State API.

In addition, Flink also implements the Watermark mechanism, which can support event-based time processing, or processing based on system time, and can tolerate data delay, data lateness, and out-of-order data.

In addition, in stream calculation, Windows are generally opened before the flow data is operated, that is, based on what kind of window to do this calculation. Flink provides a variety of Windows out of the box, such as sliding Windows, scrolling Windows, session Windows and very flexible custom Windows.

Batch and stream processing

The characteristics of batch processing are bounded, persistent, and large, and batch processing is very suitable for computing work that requires access to a full set of records, and is generally used for offline statistics. Stream processing is characterized by unbounded and real-time, and it does not need to perform operations on the entire data set, but performs operations on each data item transmitted through the system, which is generally used for real-time statistics.

In the Spark ecosystem, different technical frameworks are adopted for batch processing and stream processing. Batch processing is realized by SparkSQL, while stream processing is realized by Spark Streaming, which is also the strategy adopted by most frameworks. Independent processors are used to realize batch processing and stream processing. Flink can do both batch and stream processing.

The core computing architecture of Flink is the Flink Runtime execution engine shown below, which is a distributed system capable of accepting data flow programs and executing them in a fault-tolerant manner on one or more machines.

The Flink Runtime execution engine can run on a cluster as an application of YARN (Yet Another Resource Negotiator) or on a Mesos cluster. It can also be run on a single machine (which is very useful for debugging Flink applications).

The above figure shows the core components of the Flink technology stack, and it is worth mentioning that Flink provides streaming processing interfaces (DataStream API) and batch processing interfaces (DataSet API). As a result, Flink can do both stream and batch processing. Flink supports extensions to machine learning (FlinkML), complex event processing (CEP), and graph Computing (Gelly), as well as Table apis for stream and batch processing, respectively.

Flink's distributed nature is reflected in its ability to run on hundreds or thousands of machines, which divides large computing tasks into many smaller parts, with each machine performing a portion. Flink automatically ensures that calculations continue in the event of a machine failure or other error, or that they are scheduled to be performed again after a bug is fixed or a version upgrade is made. This capability eliminates the need for developers to worry about running failures. Flink essentially uses fault-tolerant data streaming, which allows developers to analyze data that is continuously generated and never ends (i.e., stream processing).

The basic building blocks of a Flink program are flow and transformation. Conceptually, a stream is a stream of data records, while a transformation is the operation of one or more streams as one or more streams. Input and produce one or more output streams.

When the Flink program is executed, it is mapped to a Streaming Dataflow, which consists of a set of streams and Transformation operators. It starts with one or more Source operators and ends with one or more Sink operators.

Flink programs are parallel and distributed in nature; during execution, a stream contains one or more stream partitions, and each operator contains one or more operator subtasks. Operation subtasks are independent of each other and execute in different threads, even on different machines or containers. The number of operator subtasks is the degree of parallelism for this particular operator. Different operators in the same program have different levels of parallelism.

A Stream can be divided into multiple Stream partitions, called Stream partitions. An Operator can also be divided into multiple Operator subtasks. In the figure above, Source is divided into Source1 and Source2, which are the Operator subtasks of Source, respectively. Each Operator Subtask is executed independently in a different thread. The parallelism of an Operator is equal to the number of Operator subtasks. The parallelism of Source in the figure above is 2. The parallelism of a Stream is equal to the parallelism of the Operator it generates.

There are two modes when data is passed between the two operators:

One to One mode: When two operators are passed in this mode, the number of partitions and the ordering of data are kept. For example, Source1 to Map1 in the figure above, it retains the partitioning nature of Source and the orderliness of the partition element processing.

Redistributing mode: This mode changes the number of partitions of data. Each operator subtask sends data to different target subtasks based on the selection of transformation. For example, keyBy() repartitions using hashcode,broadcast() and broadcast() The rebalance() method repartitions randomly.

相关推荐
SelectDB5 小时前
Apache Doris Python UDF:让 SQL 直接调用 Python 生态,支撑 Agent 时代复杂业务逻辑
大数据·数据库·python
ApacheSeaTunnel8 小时前
当多表数据涌入,Apache SeaTunnel 如何巧妙化解主键冲突?
大数据·开源·数据集成·seatunnel·技术分享·数据同步
大大大大晴天2 天前
Flinksql内置函数不够用?一文弄懂UDF
flink
大大大大晴天3 天前
Hudi Metadata Table 与 Hive Sync (HMS)怎么选?
大数据
手可摘星辰7774 天前
一次线上FlinkCDC异常排查复盘
大数据·flink
大大大大晴天4 天前
Hudi技术内幕:Metadata Table原理与实践
大数据
大大大大晴天5 天前
Hudi技术内幕:深入解析Index索引机制
大数据
阿里云大数据AI技术5 天前
Flink Forward Asia 2026 深圳启幕:Agentic Streaming for AI,开启实时智能新范式
大数据·flink
SelectDB5 天前
阶跃星辰基于 SelectDB 构建 PB 级 Agent 可观测平台
大数据·数据库·aigc
tonyabasy6 天前
Flink 实时数仓开发实战:SQL中也能做到资源精细化管理
flink