Apache Flink

"Apache Flink is the opensource stream processing framework for distributed, high-performance, ready-to-use, and accurate stream processing applications."

Apache Flink is a framework and distributed processing engine for stateful computation of unbounded and bounded data streams. Flink is designed to run in all common clustered environments, performing computations at in-memory execution speeds and at any scale.

The first is the Checkpoint mechanism, which is one of Flink's most important features. Flink implements a distributed snapshot of consistency based on Chandy-Lamport algorithm, thus providing consistency semantics. The Chandy-Lamport algorithm was actually proposed in 1985, but it was not widely used, and Flink took it to the next level.

Spark has recently implemented Continue streaming, the purpose of which is to reduce the latency of its processing, and it also needs to provide such consistent semantics, and finally adopts the Chandy-Lamport algorithm. It shows that Chandy-Lamport algorithm has gained certain recognition in the industry.

After providing consistent semantics, Flink also provides a set of very simple and clear State apis, including ValueState, ListState, MapState, etc., in order to make it easier for users to manage state during programming. With the recent addition of BroadcastState, you can automatically enjoy this consistent semantics using the State API.

In addition, Flink also implements the Watermark mechanism, which can support event-based time processing, or processing based on system time, and can tolerate data delay, data lateness, and out-of-order data.

In addition, in stream calculation, Windows are generally opened before the flow data is operated, that is, based on what kind of window to do this calculation. Flink provides a variety of Windows out of the box, such as sliding Windows, scrolling Windows, session Windows and very flexible custom Windows.

Batch and stream processing

The characteristics of batch processing are bounded, persistent, and large, and batch processing is very suitable for computing work that requires access to a full set of records, and is generally used for offline statistics. Stream processing is characterized by unbounded and real-time, and it does not need to perform operations on the entire data set, but performs operations on each data item transmitted through the system, which is generally used for real-time statistics.

In the Spark ecosystem, different technical frameworks are adopted for batch processing and stream processing. Batch processing is realized by SparkSQL, while stream processing is realized by Spark Streaming, which is also the strategy adopted by most frameworks. Independent processors are used to realize batch processing and stream processing. Flink can do both batch and stream processing.

The core computing architecture of Flink is the Flink Runtime execution engine shown below, which is a distributed system capable of accepting data flow programs and executing them in a fault-tolerant manner on one or more machines.

The Flink Runtime execution engine can run on a cluster as an application of YARN (Yet Another Resource Negotiator) or on a Mesos cluster. It can also be run on a single machine (which is very useful for debugging Flink applications).

The above figure shows the core components of the Flink technology stack, and it is worth mentioning that Flink provides streaming processing interfaces (DataStream API) and batch processing interfaces (DataSet API). As a result, Flink can do both stream and batch processing. Flink supports extensions to machine learning (FlinkML), complex event processing (CEP), and graph Computing (Gelly), as well as Table apis for stream and batch processing, respectively.

Flink's distributed nature is reflected in its ability to run on hundreds or thousands of machines, which divides large computing tasks into many smaller parts, with each machine performing a portion. Flink automatically ensures that calculations continue in the event of a machine failure or other error, or that they are scheduled to be performed again after a bug is fixed or a version upgrade is made. This capability eliminates the need for developers to worry about running failures. Flink essentially uses fault-tolerant data streaming, which allows developers to analyze data that is continuously generated and never ends (i.e., stream processing).

The basic building blocks of a Flink program are flow and transformation. Conceptually, a stream is a stream of data records, while a transformation is the operation of one or more streams as one or more streams. Input and produce one or more output streams.

When the Flink program is executed, it is mapped to a Streaming Dataflow, which consists of a set of streams and Transformation operators. It starts with one or more Source operators and ends with one or more Sink operators.

Flink programs are parallel and distributed in nature; during execution, a stream contains one or more stream partitions, and each operator contains one or more operator subtasks. Operation subtasks are independent of each other and execute in different threads, even on different machines or containers. The number of operator subtasks is the degree of parallelism for this particular operator. Different operators in the same program have different levels of parallelism.

A Stream can be divided into multiple Stream partitions, called Stream partitions. An Operator can also be divided into multiple Operator subtasks. In the figure above, Source is divided into Source1 and Source2, which are the Operator subtasks of Source, respectively. Each Operator Subtask is executed independently in a different thread. The parallelism of an Operator is equal to the number of Operator subtasks. The parallelism of Source in the figure above is 2. The parallelism of a Stream is equal to the parallelism of the Operator it generates.

There are two modes when data is passed between the two operators:

One to One mode: When two operators are passed in this mode, the number of partitions and the ordering of data are kept. For example, Source1 to Map1 in the figure above, it retains the partitioning nature of Source and the orderliness of the partition element processing.

Redistributing mode: This mode changes the number of partitions of data. Each operator subtask sends data to different target subtasks based on the selection of transformation. For example, keyBy() repartitions using hashcode,broadcast() and broadcast() The rebalance() method repartitions randomly.

相关推荐
元拓数智1 小时前
智能分析落地卡壳?先补好「数据关系+语义治理」这层技术基建
大数据·分布式·ai·spark·数据关系·语义治理
TDengine (老段)2 小时前
TDengine Tag 设计哲学与 Schema 变更机制
大数据·数据库·物联网·时序数据库·iot·tdengine·涛思数据
sxgzzn3 小时前
新能源场站数智化转型:基于数字孪生与AI的智慧运维管理平台解析
大数据·运维·人工智能
清平乐的技术专栏4 小时前
【Flink学习】(二)Flink 本地环境搭建,运行第一个入门程序
大数据·flink
这是程序猿4 小时前
Spring Boot自动配置详解
java·大数据·前端
ws2019074 小时前
AUTO TECH China 2026广州汽车零部件展:从整机集成迈向核心部件的产业跃升
大数据·人工智能·科技·汽车
humors2214 小时前
从数据到决策:汽车使用成本的精细计算指南
大数据·程序人生
大大大大晴天4 小时前
Flink技术实践:RocksDB 状态后端技术解密
大数据·flink
189228048615 小时前
NY382固态MT29F32T08GSLBHL8-24QM:B
大数据·服务器·人工智能·科技·缓存
liu_sir_5 小时前
升级谷歌webview
大数据·elasticsearch·搜索引擎