Chapter 11: Stream Processing_《Designing Data-Intensive Application》

Stream Processing

- - [Chapter 11: Stream Processing](#Chapter 11: Stream Processing)
  - - [1. Core Concepts and Architecture](#1. Core Concepts and Architecture)
    - [2. Transporting Event Streams](#2. Transporting Event Streams)
    - [3. Stream Processing Use Cases](#3. Stream Processing Use Cases)
    - [4. Time Handling in Streams](#4. Time Handling in Streams)
    - [5. Stream Joins and State Management](#5. Stream Joins and State Management)
    - [6. Fault Tolerance and Exactly-Once Semantics](#6. Fault Tolerance and Exactly-Once Semantics)
    - [7. Challenges and Advanced Topics](#7. Challenges and Advanced Topics)
    - [8. Real-World Examples](#8. Real-World Examples)
    - [9. Key Takeaways](#9. Key Takeaways)
    - [10. Common Pitfalls](#10. Common Pitfalls)
  - [Multiple-Choice Questions](#Multiple-Choice Questions)
  - [Answers & Explanations](#Answers & Explanations)

Chapter 11: Stream Processing

1. Core Concepts and Architecture

Stream Processing Defined:
- Continuous processing of unbounded data streams in real-time, contrasting with batch processing of finite datasets.
- Core components: event producers, streaming transport (e.g., message brokers), processing engines, and sinks (output systems).
- Example systems: Apache Flink, Apache Kafka Streams, Apache Samza.
Event Streams vs. Batch Data:
- Unbounded vs. Bounded: Streams have no predefined end, requiring perpetual processing; batches are finite.
- Low-Latency vs. High-Throughput: Streams prioritize immediate insights, while batches focus on large-scale computations.

2. Transporting Event Streams

Messaging Systems:
- Direct Messaging (e.g., RabbitMQ): Low-latency but limited durability. Challenges include message loss if consumers fail.
- Log-Based Systems (e.g., Apache Kafka):
  - Events stored in partitioned, append-only logs.
  - Durability: Replication ensures fault tolerance.
  - Ordering: Guaranteed per-partition ordering.
  - Consumer Offsets: Track progress to resume after failures.
Integration with Databases:
- Change Data Capture (CDC): Stream database changes (e.g., row updates) as events.
- Event Sourcing: Model system state as an immutable log of events, enabling time-travel queries and auditability.

3. Stream Processing Use Cases

Real-Time Analytics:
- Metrics aggregation (e.g., dashboards for user activity, server health).
- Fraud detection (e.g., identifying suspicious transactions instantly).
Data Synchronization:
- Keep derived data systems (caches, search indexes) in sync with source-of-truth databases.
Complex Event Processing (CEP):
- Detect patterns in streams (e.g., "user clicked 3 ads in 10 seconds" for ad fraud).

4. Time Handling in Streams

Event Time vs. Processing Time:
- Event Time: When the event occurred (embedded in data).
- Processing Time: When the event is processed by the system.
- Challenge: Events may arrive out-of-order due to network delays (e.g., mobile app events).
Windowing Mechanisms:
- Tumbling Windows: Fixed, non-overlapping intervals (e.g., 1-minute aggregates).
- Sliding Windows: Overlapping intervals for smoother trends.
- Session Windows: Activity-based windows (e.g., user sessions with inactivity gaps).
Watermarks:
- Heuristic markers indicating when all events for a time window are expected to have arrived. Used to trigger computations despite possible late arrivals.

5. Stream Joins and State Management

Join Types:
- Stream-Stream Joins: Combine two streams (e.g., clickstream and purchase events) using windowing.
- Stream-Table Joins: Enrich streams with static data (e.g., user profiles from a database).
- Table-Table Joins: Maintain dynamic tables updated by streams (e.g., real-time inventory).
Stateful Processing:
- Local State: Stored in processing nodes (e.g., in-memory hash tables) for fast access.
- Fault Tolerance: Periodic checkpoints (e.g., Flink's distributed snapshots) to recover state after failures.

6. Fault Tolerance and Exactly-Once Semantics

At-Least-Once vs. Exactly-Once:
- At-Least-Once: Events are processed ≥1 times (duplicates possible).
- Exactly-Once: Achieved via idempotent operations or transactional protocols.
Techniques:
- Checkpointing: Save state snapshots to durable storage; replay from last checkpoint on failure.
- Transactional Updates :
  - Use two-phase commits (2PC) for atomic writes to sinks (e.g., databases).
  - Example: Kafka's transactions for atomic writes across partitions.
- Idempotent Operations: Ensure repeated processing has the same effect (e.g., appending with unique IDs).

7. Challenges and Advanced Topics

Handling Late Data:
- Side outputs for late events (e.g., Google Dataflow's triggers and accumulators).
- Adjusting watermarks dynamically based on observed delays.
Scalability:
- Horizontal scaling via partitioning (e.g., Kafka's partition keys).
- State redistribution during rebalancing (e.g., Flink's key-group reshuffling).
Resource Management:
- Backpressure mechanisms to handle load spikes (e.g., Kafka's consumer fetch limits).
- Autoscaling based on throughput/latency metrics.

8. Real-World Examples

Netflix:
- Uses Apache Flink for real-time anomaly detection in video streaming quality.
- Employs Kafka for event sourcing microservices.
Uber:
- Apache Samza for real-time ETA calculations and surge pricing.
Apache Kafka:
- Kafka Streams library enables stateful stream processing with exactly-once semantics via transactional logs.

9. Key Takeaways

Event Time is Critical: Correct handling of out-of-order data ensures accurate results.
State Management: Efficient state storage and fault tolerance are foundational for complex workflows.
Trade-offs: Latency vs. accuracy, throughput vs. resource usage, and simplicity vs. fault tolerance.

10. Common Pitfalls

Ignoring Event Time: Using processing time for windowing can lead to incorrect aggregations.
Overloading State: Storing excessive state in memory without checkpointing risks data loss.
Neglecting Backpressure: Unmanaged load can cascade into system failures.

Multiple-Choice Questions

Question 1: Fault Tolerance in Stream Processing

Which techniques are valid for achieving fault tolerance in distributed stream processing systems?

A. Idempotent writes

B. Checkpointing operator state periodically

C. Using exactly-once message delivery guarantees

D. Recomputing lost data from immutable source-of-truth logs

Question 2: Event Time vs. Processing Time

Which statements are true about event time and processing time?

A. Event time is determined by the system clock when an event is processed.

B. Watermarks are used to handle late-arriving events in event time processing.

C. Processing time guarantees no late data but may misrepresent event causality.

D. Event time requires all events to arrive in strict chronological order.

Question 3: Stream-Table Joins

Which scenarios require stream-table joins?

A. Enriching clickstream events with static user profile data.

B. Detecting patterns across two infinite event streams with temporal constraints.

C. Aggregating sensor readings into hourly summaries.

D. Updating a leaderboard based on real-time game scores.

Question 4: Change Data Capture (CDC)

Which are valid use cases for Change Data Capture?

A. Migrating data from a legacy monolithic database to a data warehouse.

B. Creating materialized views in a caching layer.

C. Triggering business workflows in response to database updates.

D. Performing ad-hoc analytical queries on historical data.

Question 5: Exactly-Once Semantics

Which mechanisms enable exactly-once processing in stream pipelines?

A. Transactional writes to sinks using two-phase commit.

B. At-least-once message delivery with deduplication.

C. Batch-style micro-batching of records.

D. Stateful operators with deterministic processing.

Question 6: State Management Challenges

What are key challenges in managing stateful stream processing?

A. Ensuring low-latency access to distributed state.

B. Handling out-of-order event delivery.

C. Maintaining consistency during failover.

D. Scaling state storage independently of processing throughput.

Question 7: Stream Processing Backpressure

Which strategies mitigate backpressure in high-throughput stream systems?

A. Dropping events selectively during overload.

B. Dynamic resource scaling based on queue lengths.

C. Switching to batch processing temporarily.

D. Implementing flow control protocols between nodes.

Question 8: Time Windowing

Which window types are supported in event-time processing?

A. Sliding windows with overlapping intervals.

B. Tumbling windows aligned to processing time.

C. Session windows based on event inactivity gaps.

D. Global windows encompassing all historical data.

Question 9: Stream-Stream Joins

What are valid approaches for joining two infinite event streams?

A. Windowed joins with temporal constraints.

B. Joining on a common key with no time bounds.

C. Using a distributed hash table for all historical matches.

D. Correlating events via shared transaction IDs.

Question 10: Immutable Logs

Which properties make immutable logs suitable for stream processing?

A. Support for random-access updates.

B. Ability to replay events for reprocessing.

C. Horizontal scalability through partitioning.

D. Built-in compression of historical data.

Answers & Explanations

Answer 1: A, B, D

A: Idempotent writes ensure repeated processing of the same data has no side effects.
B: Checkpointing saves state snapshots for recovery.
D: Recomputing from source logs (e.g., Kafka) enables recovery.
C: Exactly-once delivery is a semantic guarantee, not a standalone technique.

Answer 2: B, C

B: Watermarks estimate event time progress to handle late data.
C: Processing time ignores event causality but simplifies latency.
A: Incorrect---processing time uses system clock, not event time.
D: Incorrect---event time allows out-of-order events.

Answer 3: A, B

A: Static table (user profiles) joined with stream (clicks).
B: Stream-stream joins require windowing (temporal constraints).
C: Aggregation uses windowing, not joins.
D: Leaderboards use stateful aggregation, not joins.

Answer 4: A, B, C

A: CDC replicates database changes to other systems (e.g., data warehouses).
B: Materialized views can be updated via CDC.
C: CDC events trigger downstream workflows.
D: Ad-hoc queries relate to batch analytics, not CDC.

Answer 5: A, B, D

A: Transactions ensure atomic writes to sinks.
B: Deduplication with idempotency achieves exactly-once.
D: Determinism ensures reprocessed data matches initial results.
C: Micro-batching alone doesn't guarantee exactly-once.

Answer 6: A, B, C

A: State access latency impacts processing speed.
B: Out-of-order events complicate state updates.
C: Failover requires consistent state restoration.
D: State storage scaling is orthogonal to processing.

Answer 7: A, B, D

A: Selective dropping trades data loss for stability.
B: Scaling adjusts resources to handle load.
D: Flow control (e.g., TCP-like) regulates data flow.
C: Batch switching disrupts real-time processing.

Answer 8: A, C

A: Sliding windows allow overlapping intervals (e.g., every 5 mins for 10-min windows).
C: Session windows group events separated by inactivity.
B: Tumbling windows in processing time misalign with event time.
D: Global windows are unbounded and require triggers.

Answer 9: A, D

A: Windowed joins limit matches to temporal ranges.
D: Transaction IDs correlate related events.
B: Unbounded joins risk infinite state growth.
C: Hash tables for all history are impractical.

Answer 10: B, C

B: Replayability is critical for reprocessing.
C: Partitioning enables scalability.
A: Logs are append-only; updates are immutable.
D: Compression is optional, not a core property.

These questions test deep understanding of stream processing concepts, trade-offs, and implementation strategies from Chapter 11.