Apache Flink

"Apache Flink is the opensource stream processing framework for distributed, high-performance, ready-to-use, and accurate stream processing applications."

Apache Flink is a framework and distributed processing engine for stateful computation of unbounded and bounded data streams. Flink is designed to run in all common clustered environments, performing computations at in-memory execution speeds and at any scale.

The first is the Checkpoint mechanism, which is one of Flink's most important features. Flink implements a distributed snapshot of consistency based on Chandy-Lamport algorithm, thus providing consistency semantics. The Chandy-Lamport algorithm was actually proposed in 1985, but it was not widely used, and Flink took it to the next level.

Spark has recently implemented Continue streaming, the purpose of which is to reduce the latency of its processing, and it also needs to provide such consistent semantics, and finally adopts the Chandy-Lamport algorithm. It shows that Chandy-Lamport algorithm has gained certain recognition in the industry.

After providing consistent semantics, Flink also provides a set of very simple and clear State apis, including ValueState, ListState, MapState, etc., in order to make it easier for users to manage state during programming. With the recent addition of BroadcastState, you can automatically enjoy this consistent semantics using the State API.

In addition, Flink also implements the Watermark mechanism, which can support event-based time processing, or processing based on system time, and can tolerate data delay, data lateness, and out-of-order data.

In addition, in stream calculation, Windows are generally opened before the flow data is operated, that is, based on what kind of window to do this calculation. Flink provides a variety of Windows out of the box, such as sliding Windows, scrolling Windows, session Windows and very flexible custom Windows.

Batch and stream processing

The characteristics of batch processing are bounded, persistent, and large, and batch processing is very suitable for computing work that requires access to a full set of records, and is generally used for offline statistics. Stream processing is characterized by unbounded and real-time, and it does not need to perform operations on the entire data set, but performs operations on each data item transmitted through the system, which is generally used for real-time statistics.

In the Spark ecosystem, different technical frameworks are adopted for batch processing and stream processing. Batch processing is realized by SparkSQL, while stream processing is realized by Spark Streaming, which is also the strategy adopted by most frameworks. Independent processors are used to realize batch processing and stream processing. Flink can do both batch and stream processing.

The core computing architecture of Flink is the Flink Runtime execution engine shown below, which is a distributed system capable of accepting data flow programs and executing them in a fault-tolerant manner on one or more machines.

The Flink Runtime execution engine can run on a cluster as an application of YARN (Yet Another Resource Negotiator) or on a Mesos cluster. It can also be run on a single machine (which is very useful for debugging Flink applications).

The above figure shows the core components of the Flink technology stack, and it is worth mentioning that Flink provides streaming processing interfaces (DataStream API) and batch processing interfaces (DataSet API). As a result, Flink can do both stream and batch processing. Flink supports extensions to machine learning (FlinkML), complex event processing (CEP), and graph Computing (Gelly), as well as Table apis for stream and batch processing, respectively.

Flink's distributed nature is reflected in its ability to run on hundreds or thousands of machines, which divides large computing tasks into many smaller parts, with each machine performing a portion. Flink automatically ensures that calculations continue in the event of a machine failure or other error, or that they are scheduled to be performed again after a bug is fixed or a version upgrade is made. This capability eliminates the need for developers to worry about running failures. Flink essentially uses fault-tolerant data streaming, which allows developers to analyze data that is continuously generated and never ends (i.e., stream processing).

The basic building blocks of a Flink program are flow and transformation. Conceptually, a stream is a stream of data records, while a transformation is the operation of one or more streams as one or more streams. Input and produce one or more output streams.

When the Flink program is executed, it is mapped to a Streaming Dataflow, which consists of a set of streams and Transformation operators. It starts with one or more Source operators and ends with one or more Sink operators.

Flink programs are parallel and distributed in nature; during execution, a stream contains one or more stream partitions, and each operator contains one or more operator subtasks. Operation subtasks are independent of each other and execute in different threads, even on different machines or containers. The number of operator subtasks is the degree of parallelism for this particular operator. Different operators in the same program have different levels of parallelism.

A Stream can be divided into multiple Stream partitions, called Stream partitions. An Operator can also be divided into multiple Operator subtasks. In the figure above, Source is divided into Source1 and Source2, which are the Operator subtasks of Source, respectively. Each Operator Subtask is executed independently in a different thread. The parallelism of an Operator is equal to the number of Operator subtasks. The parallelism of Source in the figure above is 2. The parallelism of a Stream is equal to the parallelism of the Operator it generates.

There are two modes when data is passed between the two operators:

One to One mode: When two operators are passed in this mode, the number of partitions and the ordering of data are kept. For example, Source1 to Map1 in the figure above, it retains the partitioning nature of Source and the orderliness of the partition element processing.

Redistributing mode: This mode changes the number of partitions of data. Each operator subtask sends data to different target subtasks based on the selection of transformation. For example, keyBy() repartitions using hashcode,broadcast() and broadcast() The rebalance() method repartitions randomly.

相关推荐
狮歌~资深攻城狮6 小时前
HBase性能优化秘籍:让数据处理飞起来
大数据·hbase
Elastic 中国社区官方博客7 小时前
Elasticsearch Open Inference API 增加了对 Jina AI 嵌入和 Rerank 模型的支持
大数据·人工智能·elasticsearch·搜索引擎·ai·全文检索·jina
workflower7 小时前
Prompt Engineering的重要性
大数据·人工智能·设计模式·prompt·软件工程·需求分析·ai编程
API_technology9 小时前
电商搜索API的Elasticsearch优化策略
大数据·elasticsearch·搜索引擎
黄雪超9 小时前
大数据SQL调优专题——引擎优化
大数据·数据库·sql
The god of big data9 小时前
MapReduce 第二部:深入分析与实践
大数据·mapreduce
G***技10 小时前
杰和科技GAM-AI视觉识别管理系统,让AI走进零售营销
大数据·人工智能·系统架构
天天爱吃肉821811 小时前
碳化硅(SiC)功率器件:新能源汽车的“心脏”革命与技术突围
大数据·人工智能
Java资深爱好者12 小时前
在Spark中,如何使用DataFrame进行高效的数据处理
大数据·分布式·spark
跨境卫士小树13 小时前
店铺矩阵崩塌前夜:跨境多账号运营的3个生死线
大数据·线性代数·矩阵