大数据-20-Flume 采集数据双写+HDFS 监控目录变化 Agent MemoryChannel Source

点一下关注吧！！！非常感谢！！持续更新！！！

🚀 AI篇持续更新中！（长期更新）

目前2025年06月16日更新到： AI炼丹日志-29 - 字节跳动 DeerFlow 深度研究框斜体样式架私有部署测试上手架构研究，持续打造实用AI工具指南！📐🤖

💻 Java篇正式开启！（300篇）

目前2025年06月23日更新到： Java-53 深入浅出 Tomcat 性能优化 JVM内存模型垃圾回收GC Tomcat配置优化 MyBatis 已完结，Spring 已完结，Nginx已完结，Tomcat已完结，分布式服务正在更新！深入浅出助你打牢基础！

📊 大数据板块已完成多项干货更新（300篇）：

包括 Hadoop、Hive、Kafka、Flink、ClickHouse、Elasticsearch 等二十余项核心组件，覆盖离线+实时数仓全栈！目前2025年06月13日更新到： 大数据-278 Spark MLib - 基础介绍机器学习算法梯度提升树 GBDT案例详解

章节内容

上一节完成了如下的内容：

编写Agent Conf配置文件
收集Hive数据
汇聚到HDFS中
测试效果

背景介绍

这里是三台公网云服务器，每台 2C4G，搭建一个Hadoop的学习环境，供我学习。

2C4G 编号 h121
2C4G 编号 h122
2C2G 编号 h123

文档推荐

除了官方文档以外，这里有一个写的很好的中文文档： flume.liyifeng.org/

组件介绍

这里会对组件进行一个基本的介绍

Source（数据源）

常用如：

taildir-source：高性能监听文件（推荐）
exec-source：一次性执行命令（不推荐用于持续采集）
spooldir-source：监听指定目录新文件

Channel（缓存通道）

使用 Multiplexing Channel Selector 配合多 sink 实现"双写"：

memory channel 用于快速缓冲实时数据
file channel 用于 HDFS 分支，确保落盘可靠性

Sink（目标输出）

Sink1：KafkaSink（将数据推入 Kafka，供实时消费）
Sink2：HDFSEventSink（按时间路径写入 HDFS）

监控目录

业务需求

想要监控指定目录收集信息并上传到HDFS中

Source

选择 spooldir，因为 spooldir 能够保证数据不丢失，且能够进行断点续传，但是延迟较高，不能实时监控。

Channel

选择 memory

Sink

选择 HDFS

需要注意

拷贝到 spool 目录下的文件不可以再打开编辑
无法监控子目录的文件夹变动
被监控文件夹每500毫秒扫描一次文件变动
适合用于同步新文件，但不适合对实时追加日志的文件进行监听并同步

生产环境

如果在生产环境中，我们的基本逻辑是：

数据采集一份实时入 Kafka 等消息队列或实时处理系统（如 Spark、Flink）
同时另一份可靠写入 HDFS 用于后续离线分析或长期存储

这种"双写"机制实现了实时 + 离线混合架构，满足以下需求：

实时处理：秒级告警 / 监控系统
离线分析：日报、月报、数据仓库

shell 复制代码

         [日志源，如Nginx、APP等]
                     ↓
             Flume Source (taildir / exec / spooling)
                     ↓
               Flume Channel (Memory + File)
                     ↓
          ┌────────────┬────────────┐
          ↓                         ↓
  Flume Sink1 (Kafka)       Flume Sink2 (HDFS)

配置样例

这里放一个Demo

shell 复制代码

# Source 配置
agent.sources.source1.type = TAILDIR
agent.sources.source1.filegroups = f1
agent.sources.source1.filegroups.f1 = /data/logs/nginx/access.log

# Channel 配置（内存和文件双通道）
agent.channels.memoryChannel.type = memory
agent.channels.memoryChannel.capacity = 10000

agent.channels.fileChannel.type = file
agent.channels.fileChannel.checkpointDir = /var/log/flume/checkpoint
agent.channels.fileChannel.dataDirs = /var/log/flume/data

# Sink1: Kafka
agent.sinks.kafkaSink.type = org.apache.flume.sink.kafka.KafkaSink
agent.sinks.kafkaSink.topic = nginx-logs
agent.sinks.kafkaSink.brokerList = localhost:9092

# Sink2: HDFS
agent.sinks.hdfsSink.type = hdfs
agent.sinks.hdfsSink.hdfs.path = hdfs://namenode:8020/logs/nginx/%Y-%m-%d/
agent.sinks.hdfsSink.hdfs.fileType = DataStream
agent.sinks.hdfsSink.hdfs.writeFormat = Text

# Channel 绑定
agent.sources.source1.channels = memoryChannel fileChannel
agent.sinks.kafkaSink.channel = memoryChannel
agent.sinks.hdfsSink.channel = fileChannel

这里我们需要注意

多通道绑定：Source 支持绑定多个 Channel，Sink 各自绑定
数据一致性：Flume 自身无强一致保证，但文件 channel 比 memory channel 更可靠
容错机制：HDFS Sink 支持自动创建目录、滚动文件、断点续传
性能调优：可设置 bufferSize、batchSize、rollInterval 提升性能

使用建议

重要数据一定要落 HDFS 一份，避免 Kafka 异常导致数据丢失
Kafka 通道选 memory channel，速度更快
HDFS 落盘建议设置 rollInterval=60、batchSize=1000 适配大吞吐

落盘路径按天/小时分目录，例如：

shell 复制代码

hdfs.path = /logs/nginx/%Y/%m/%d/%H/

此外我们常见的还会遇到如下的问题：

数据延迟写入 HDFS：batch size 太大，rollInterval 太长，降低 batchSize 或缩短 rollInterval
Kafka 中数据丢失：memory channel 崩溃丢缓存，使用 file channel 或启用 Kafka ACK 机制
HDFS 写入失败：权限、网络、文件系统故障，开启 Flume 日志调试，检查路径和权限

配置文件

shell 复制代码

cd /opt/wzk/flume_test
vim flume_spooldir-hdfs.conf

我们需要写入如下内容

shell 复制代码

# Name the components on this agent
a3.sources = r3
a3.channels = c3
a3.sinks = k3
# Describe/configure the source
a3.sources.r3.type = spooldir
# 注意这里的文件夹 换成自己的！！！
a3.sources.r3.spoolDir = /opt/wzk/upload
a3.sources.r3.fileSuffix = .COMPLETED
a3.sources.r3.fileHeader = true

# 忽略以.tmp结尾的文件，不上传
a3.sources.r3.ignorePattern = ([^ ]*\.tmp)
# Use a channel which buffers events in memory
a3.channels.c3.type = memory
a3.channels.c3.capacity = 10000
a3.channels.c3.transactionCapacity = 500
# Describe the sink
a3.sinks.k3.type = hdfs
# 注意修改成你自己的IP!!!
a3.sinks.k3.hdfs.path = hdfs://h121.wzk.icu:9000/flume/upload/%Y%m%d/%H%M

# 上传文件的前缀
a3.sinks.k3.hdfs.filePrefix = upload-
# 是否使用本地时间戳
a3.sinks.k3.hdfs.useLocalTimeStamp = true
# 积攒500个Event，flush到HDFS一次
a3.sinks.k3.hdfs.batchSize = 500
# 设置文件类型
a3.sinks.k3.hdfs.fileType = DataStream
# 60秒滚动一次
a3.sinks.k3.hdfs.rollInterval = 60
# 128M滚动一次
a3.sinks.k3.hdfs.rollSize = 134217700
# 文件滚动与event数量无关
a3.sinks.k3.hdfs.rollCount = 0
# 最小冗余数
a3.sinks.k3.hdfs.minBlockReplicas = 1

# Bind the source and sink to the channel
a3.sources.r3.channels = c3
a3.sinks.k3.channel = c3

启动Agent

shell 复制代码

$FLUME_HOME/bin/flume-ng agent --name a3 \
--conf-file flume-spooldir-hdfs.conf \
-Dflume.root.logger=INFO,console

测试效果

Flume

shell 复制代码

cd /opt/wzk/upload
vim 1.txt

随便向其中写入一些内容，并保存，可以看到Flume已经有反应了。

HDFS

查看HDFS，也已经有内容了

采集双写

这里业务上需要：

Flume将数据写入本地
Flume将数据写入HDFS

分析实现

需要多个Agent级联实现
Source选择taildir
Channel选择memory
最终的Sink分别选择HDFS，file_roll

配置文件1

配置文件包含如下内容：

1个 taildir source
2个 memory channel
2个 avro sink

新建文件

shell 复制代码

vim flume-taildir-avro.conf

写入如下内容

shell 复制代码

# Name the components on this agent
a1.sources = r1
a1.sinks = k1 k2
a1.channels = c1 c2
# 将数据流复制给所有channel
a1.sources.r1.selector.type = replicating
# source
a1.sources.r1.type = taildir
# 记录每个文件最新消费位置
a1.sources.r1.positionFile = /root/flume/taildir_position.json
a1.sources.r1.filegroups = f1
# 备注：.*log 是正则表达式；这里写成 *.log 是错误的
a1.sources.r1.filegroups.f1 = /tmp/root/.*log
# sink
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = linux123
a1.sinks.k1.port = 9091
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = linux123
a1.sinks.k2.port = 9092
# channel
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 500
a1.channels.c2.type = memory
a1.channels.c2.capacity = 10000
a1.channels.c2.transactionCapacity = 500
# Bind the source and sink to the channel
a1.sources.r1.channels = c1 c2
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2

配置文件2

配置文件包含如下内容：

1个 avro source
1个 memory channel
1个 hdfs sink

新建配置文件

shell 复制代码

vim flume-avro-hdfs.conf

写入如下的内容：

shell 复制代码

# Name the components on this agent
a2.sources = r1
a2.sinks = k1
a2.channels = c1
# Describe/configure the source
a2.sources.r1.type = avro
a2.sources.r1.bind = linux123
a2.sources.r1.port = 9091
# Describe the channel
a2.channels.c1.type = memory
a2.channels.c1.capacity = 10000
a2.channels.c1.transactionCapacity = 500
# Describe the sink
a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.path = hdfs://linux121:8020/flume2/%Y%m%d/%H
# 上传文件的前缀
a2.sinks.k1.hdfs.filePrefix = flume2-
# 是否使用本地时间戳
a2.sinks.k1.hdfs.useLocalTimeStamp = true
# 500个Event才flush到HDFS一次
a2.sinks.k1.hdfs.batchSize = 500
# 设置文件类型，可支持压缩
a2.sinks.k1.hdfs.fileType = DataStream
# 60秒生成一个新的文件
a2.sinks.k1.hdfs.rollInterval = 60
a2.sinks.k1.hdfs.rollSize = 0
a2.sinks.k1.hdfs.rollCount = 0
a2.sinks.k1.hdfs.minBlockReplicas = 1
# Bind the source and sink to the channel
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1

配置文件3

配置文件包含如下内容：

1个 avro source
1个 memory channel
1个 file_roll sink

新建配置文件

shell 复制代码

vim flume-avro-file.conf

写入如下的内容

shell 复制代码

# Name the components on this agent
a3.sources = r1
a3.sinks = k1
a3.channels = c2
# Describe/configure the source
a3.sources.r1.type = avro
a3.sources.r1.bind = linux123
a3.sources.r1.port = 9092
# Describe the sink
a3.sinks.k1.type = file_roll
# 目录需要提前创建好
a3.sinks.k1.sink.directory = /root/flume/output
# Describe the channel
a3.channels.c2.type = memory
a3.channels.c2.capacity = 10000
a3.channels.c2.transactionCapacity = 500
# Bind the source and sink to the channel
a3.sources.r1.channels = c2
a3.sinks.k1.channel = c2

启动Agent1

shell 复制代码

$FLUME_HOME/bin/flume-ng agent --name a3 \
--conf-file ~/conf/flume-avro-file.conf \
-Dflume.root.logger=INFO,console &

启动Agent2

shell 复制代码

$FLUME_HOME/bin/flume-ng agent --name a2 \
--conf-file ~/conf/flume-avro-hdfs.conf \
-Dflume.root.logger=INFO,console &

启动Agent3

shell 复制代码

$FLUME_HOME/bin/flume-ng agent --name a1 \
--conf-file ~/conf/flume-taildir-avro.conf \
-Dflume.root.logger=INFO,console &

Hive测试

shell 复制代码

hive -e "show databases;"