Flume日志采集系统 --- 完整知识点与案例代码
一、Flume概述
1.1 什么是Flume
Apache Flume 是一个分布式、高可靠、高可用的日志采集、聚合和传输系统。它可以从多种数据源收集数据,将数据传输到集中式数据存储(如HDFS、HBase、Kafka等)。
1.2 Flume的核心特性
| 特性 | 说明 |
|---|---|
| 可靠性 | 事务机制保证数据不丢失,支持端到端的可靠传输 |
| 可扩展性 | Agent可以水平扩展,支持多级串联 |
| 可恢复性 | Channel提供缓存能力,故障恢复后可继续传输 |
| 实时性 | 近实时的数据采集,延迟在秒级 |
| 灵活性 | 支持多种Source、Channel、Sink的自由组合 |
1.3 Flume的核心组件
数据流模型:
Source --> Channel --> Sink
(采集) (缓存) (输出)
- Event:Flume传输数据的基本单元,由Header(头信息)和Body(字节数组)组成
- Source:数据源,负责接收或拉取数据并封装为Event
- Channel:管道/缓冲区,连接Source和Sink,提供事务和缓存
- Sink:数据目的地,从Channel取出Event并发送到目标存储
二、Flume日志采集系统结构
2.1 Agent结构
一个Flume Agent是一个独立的JVM进程,包含三个核心组件:
┌─────────────────────────────────────┐
│ Flume Agent │
│ │
│ ┌────────┐ ┌─────────┐ ┌──────┐ │
│ │ Source │→│ Channel │→│ Sink │ │
│ └────────┘ └─────────┘ └──────┘ │
│ │
└─────────────────────────────────────┘
2.2 多Agent级联结构
Agent1 (Source→Channel→Sink)
↓
Agent2 (Source→Channel→Sink)
↓
Agent3 (Source→Channel→Sink)
↓
HDFS / HBase / Kafka
2.3 多路复用结构(1个Source → 多个Channel)
┌→ Channel1 → Sink1 → HDFS
Source → 通道选择器 ┼→ Channel2 → Sink2 → HBase
└→ Channel3 → Sink3 → Kafka
三、Flume的部署
3.1 前置条件
bash
# 1. 确保已安装JDK 1.8或以上版本
java -version
# 2. 确保Hadoop集群已启动(如果需要写入HDFS)
start-dfs.sh
start-yarn.sh
3.2 下载与解压
bash
# 下载Flume安装包(以1.9.0版本为例)
wget https://archive.apache.org/dist/flume/1.9.0/apache-flume-1.9.0-bin.tar.gz
# 解压到指定目录
tar -zxvf apache-flume-1.9.0-bin.tar.gz -C /opt/module/
# 重命名目录(便于管理)
mv /opt/module/apache-flume-1.9.0-bin /opt/module/flume-1.9.0
3.3 配置环境变量
bash
# 编辑系统环境变量文件
sudo vim /etc/profile.d/flume.sh
# 添加以下内容:
export FLUME_HOME=/opt/module/flume-1.9.0 # Flume安装根目录
export PATH=$PATH:$FLUME_HOME/bin # 将Flume的bin目录加入系统PATH
# 使环境变量立即生效
source /etc/profile.d/flume.sh
3.4 修改Flume配置文件
bash
# 进入Flume的conf目录
cd /opt/module/flume-1.9.0/conf/
# 复制模板配置文件
cp flume-env.sh.template flume-env.sh
# 编辑flume-env.sh,指定JDK路径
vim flume-env.sh
在 flume-env.sh 中修改:
bash
# 指定Java安装路径
export JAVA_HOME=/opt/module/jdk1.8.0_212
# 可选:设置JVM内存参数
export JAVA_OPTS="-Xms100m -Xmx200m"
3.5 验证安装
bash
# 查看Flume版本信息
flume-ng version
# 应输出类似:Flume 1.9.0
四、Flume的基本使用
4.1 Flume Agent启动命令格式
bash
flume-ng agent \
--name <agent名称> \ # 必须与配置文件中的agent名称一致
--conf <配置文件目录> \ # Flume的conf目录路径
--conf-file <配置文件路径> \ # 自定义的配置文件路径
-Dflume.root.logger=INFO,console # 设置日志级别和输出方式
4.2 第一个Flume案例 ------ Netcat Source + Memory Channel + Logger Sink
需求:监听本机44444端口,将接收到的数据打印到控制台日志中。
配置文件 netcat-logger.conf:
properties
# ============================================================
# netcat-logger.conf
# 功能:监听44444端口,将数据打印到控制台日志
# ============================================================
# ---------- 1. 定义组件名称 ----------
# 给这个Agent起名为 a1
# 定义Source组件的名称为 r1
a1.sources = r1
# 定义Channel组件的名称为 c1
a1.channels = c1
# 定义Sink组件的名称为 k1
a1.sinks = k1
# ---------- 2. 配置Source ----------
# 使用 netcat 类型的Source(监听指定端口的TCP数据)
a1.sources.r1.type = netcat
# 绑定监听的主机地址,0.0.0.0表示监听所有网络接口
a1.sources.r1.bind = 0.0.0.0
# 监听的端口号
a1.sources.r1.port = 44444
# ---------- 3. 配置Channel ----------
# 使用 memory 类型的Channel(数据缓存在内存中,速度快但不持久)
a1.channels.c1.type = memory
# Channel中最多缓存的Event数量
a1.channels.c1.capacity = 1000
# Channel每次事务中最多传输的Event数量
a1.channels.c1.transactionCapacity = 100
# ---------- 4. 配置Sink ----------
# 使用 logger 类型的Sink(将数据以日志形式输出到控制台)
a1.sinks.k1.type = logger
# ---------- 5. 绑定组件(将Source、Channel、Sink组装起来) ----------
# 将Source r1连接到Channel c1(一个Source可以连接多个Channel,用空格分隔)
a1.sources.r1.channels = c1
# 将Sink k1连接到Channel c1(一个Sink只能连接一个Channel)
a1.sinks.k1.channel = c1
启动Agent:
bash
# 启动Flume Agent
# --name a1 : Agent名称与配置文件中的a1对应
# --conf $FLUME_HOME/conf : 指定Flume的配置目录
# --conf-file : 指定自定义配置文件的绝对路径
# -Dflume.root.logger=INFO,console : 日志输出到控制台
flume-ng agent \
--name a1 \
--conf $FLUME_HOME/conf \
--conf-file /opt/module/flume-1.9.0/conf/netcat-logger.conf \
-Dflume.root.logger=INFO,console
测试:
bash
# 通过telnet连接到44444端口发送数据
# 需要先安装telnet:yum install -y telnet
telnet localhost 44444
# 连接后输入任意内容并回车,如:
hello flume
# 在Flume控制台日志中可以看到接收到的数据
五、Flume的采集方案
5.1 常见采集方案组合
| 方案 | Source | Channel | Sink | 用途 |
|---|---|---|---|---|
| 方案1 | Exec | Memory | HDFS | 监控单个日志文件写入HDFS |
| 方案2 | Spooling Directory | Memory/File | HDFS | 监控目录中新文件写入HDFS |
| 方案3 | Taildir | Memory/File | HDFS/Kafka | 监控多个文件的追加写入 |
| 方案4 | Netcat | Memory | Logger | 端口测试/调试 |
| 方案5 | Avro | Memory | Avro | Agent级联传输 |
| 方案6 | Kafka | Memory | HDFS | 消费Kafka数据写入HDFS |
六、Flume Sources(详细知识点)
6.1 Avro Source
功能:监听Avro协议的RPC请求,常用于Agent之间的级联传输。
properties
# ============================================================
# avro-source.conf
# 功能:Avro Source接收来自其他Agent的Avro数据
# ============================================================
# 定义Agent组件
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# ---------- Avro Source配置 ----------
# Source类型为avro
a1.sources.r1.type = avro
# 绑定监听地址
a1.sources.r1.bind = 0.0.0.0
# 监听端口号
a1.sources.r1.port = 4141
# 可选:设置线程数(处理请求的最大线程数)
a1.sources.r1.threads = 5
# 可选:设置传输的压缩方式(true表示使用deflate压缩)
a1.sources.r1.compression-type = none
# ---------- Channel配置 ----------
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# ---------- Sink配置 ----------
a1.sinks.k1.type = logger
# ---------- 绑定 ----------
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
发送Avro数据测试:
bash
# 使用Flume内置的avro-client工具发送数据到Avro Source
flume-ng avro-client \
--host localhost \ # 目标主机
--port 4141 \ # 目标端口
--filename /opt/test.txt # 要发送的文件路径
6.2 Exec Source
功能 :执行一条Linux命令,将命令的标准输出作为数据源。常用于tail -f监控日志文件。
properties
# ============================================================
# exec-hdfs.conf
# 功能:使用Exec Source监控日志文件,写入HDFS
# ============================================================
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# ---------- Exec Source配置 ----------
# Source类型为exec,执行Linux命令
a1.sources.r1.type = exec
# 要执行的命令:-F表示当文件被截断或轮转时重新打开文件
# /opt/module/data/app.log 是要监控的日志文件路径
a1.sources.r1.command = tail -F /opt/module/data/app.log
# 设置Shell类型
a1.sources.r1.shell = /bin/bash -c
# 可选:设置重启间隔(命令异常退出后,多久重新启动),单位毫秒
a1.sources.r1.restartThrottle = 10000
# 可选:是否记录命令的stderr输出
a1.sources.r1.logStdErr = true
# ---------- File Channel配置 ----------
# 使用文件类型的Channel,数据持久化到磁盘,可靠性更高
a1.channels.c1.type = file
# 检查点文件存放目录
a1.channels.c1.checkpointDir = /opt/module/flume-1.9.0/checkpoint
# 数据文件存放目录
a1.channels.c1.dataDirs = /opt/module/flume-1.9.0/data
# Channel最大容量
a1.channels.c1.capacity = 1000000
# 每次事务最大传输量
a1.channels.c1.transactionCapacity = 1000
# ---------- HDFS Sink配置 ----------
a1.sinks.k1.type = hdfs
# HDFS上的目标路径,%Y%m%d/%H 是时间占位符
a1.sinks.k1.hdfs.path = hdfs://hadoop101:8020/flume/logs/%Y%m%d/%H
# 文件前缀名
a1.sinks.k1.hdfs.filePrefix = app-log-
# 文件后缀名
a1.sinks.k1.hdfs.fileSuffix = .log
# 使用本地时间戳(而非Event中的时间戳)
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# 文件滚动方式:按时间滚动,每30秒生成一个新文件
a1.sinks.k1.hdfs.rollInterval = 30
# 文件滚动方式:按大小滚动,每128MB生成一个新文件
a1.sinks.k1.hdfs.rollSize = 134217728
# 文件滚动方式:按Event数量滚动,每1000个Event生成一个新文件
a1.sinks.k1.hdfs.rollCount = 1000
# 文件格式:DataStream表示不压缩的普通文本
a1.sinks.k1.hdfs.fileType = DataStream
# ---------- 绑定 ----------
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
6.3 Spooling Directory Source
功能 :监控指定目录下的新文件 ,将文件内容逐行读取为Event。文件读取完成后会被标记为 .COMPLETED。
properties
# ============================================================
# spool-hdfs.conf
# 功能:监控目录中新增文件,写入HDFS
# ============================================================
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# ---------- Spooling Directory Source配置 ----------
# Source类型为spooldir
a1.sources.r1.type = spooldir
# 要监控的目录路径(注意:不能监控嵌套的子目录)
a1.sources.r1.spoolDir = /opt/module/data/spooldir
# 文件后缀:读取完成后的文件会被加上此后缀
a1.sources.r1.fileSuffix = .COMPLETED
# 是否忽略以.开头的隐藏文件
a1.sources.r1.ignorePattern = (^$|.*\\.COMPLETED$|.*\\.tmp$)
# 上传文件的头信息分隔符
a1.sources.r1.deserializer = LINE
# 每行最大字节数
a1.sources.r1.deserializer.maxLineLength = 5120
# 文件元数据存储方式
a1.sources.r1.fileHeader = true
a1.sources.r1.fileHeaderKey = file
# 可选:basenameHeader = true 会在Header中添加文件名
a1.sources.r1.basenameHeader = true
a1.sources.r1.basenameHeaderKey = basename
# ---------- Memory Channel配置 ----------
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000
# Channel中所有Event的最大总字节数
a1.channels.c1.byteCapacityBufferPercentage = 20
a1.channels.c1.byteCapacity = 800000
# ---------- HDFS Sink配置 ----------
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop101:8020/flume/spooldir/%Y%m%d
a1.sinks.k1.hdfs.filePrefix = events-
a1.sinks.k1.hdfs.fileSuffix = .txt
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# 设为true表示在同一个Bucket(时间分区)中使用同一个文件写入
a1.sinks.k1.hdfs.round = true
# 时间舍入的单位(5表示5分钟)
a1.sinks.k1.hdfs.roundValue = 5
a1.sinks.k1.hdfs.roundUnit = minute
# 滚动设置
a1.sinks.k1.hdfs.rollInterval = 60
a1.sinks.k1.hdfs.rollSize = 0
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.fileType = DataStream
# 写入格式为Text
a1.sinks.k1.hdfs.writeFormat = Text
# ---------- 绑定 ----------
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
测试命令:
bash
# 先创建监控目录
mkdir -p /opt/module/data/spooldir
# 启动Agent后,将文件移入监控目录
cp /opt/module/data/access.log /opt/module/data/spooldir/access.log
# 注意:文件名不能在写入过程中被修改,要先准备好完整文件再移入
6.4 Taildir Source(推荐使用)
功能 :监控多个目录中的多个文件,支持断点续传(通过JSON文件记录读取位置)。当文件有新内容追加时自动采集。
properties
# ============================================================
# taildir-hdfs.conf
# 功能:Taildir Source监控多个文件,断点续传写入HDFS
# ============================================================
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# ---------- Taildir Source配置 ----------
# Source类型为taildir
a1.sources.r1.type = taildir
# 定义文件组(可以定义多个组,每组监控不同的目录/文件)
a1.sources.r1.filegroups = f1 f2
# 文件组f1:监控 /opt/module/data/logs/app1/ 目录下以.log结尾的文件
a1.sources.r1.filegroups.f1 = /opt/module/data/logs/app1/.*\\.log
# 文件组f2:监控 /opt/module/data/logs/app2/ 目录下以.txt结尾的文件
a1.sources.r1.filegroups.f2 = /opt/module/data/logs/app2/.*\\.txt
# 是否添加文件路径到Header
a1.sources.r1.headers.f1.type = app1
a1.sources.r1.headers.f2.type = app2
# 断点续传的位置记录文件(JSON格式)
a1.sources.r1.positionFile = /opt/module/flume-1.9.0/position/taildir_position.json
# 最大回退行数(从文件末尾开始读取时,最多回退多少行)
a1.sources.r1.skipToLast = true
# 每批读取的最大行数
a1.sources.r1.batchSize = 100
# 空闲文件的超时检查间隔(毫秒)
a1.sources.r1.idleTimeout = 3000
# ---------- Memory Channel ----------
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000
# ---------- HDFS Sink ----------
a1.sinks.k1.type = hdfs
# 使用Header中的type字段作为HDFS路径中的子目录
a1.sinks.k1.hdfs.path = hdfs://hadoop101:8020/flume/taildir/%{type}/%Y%m%d
a1.sinks.k1.hdfs.filePrefix = log-
a1.sinks.k1.hdfs.useLocalTimeStamp = true
a1.sinks.k1.hdfs.rollInterval = 30
a1.sinks.k1.hdfs.rollSize = 134217728
a1.sinks.k1.hdfs.rollCount = 0
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.writeFormat = Text
# ---------- 绑定 ----------
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
6.5 Kafka Source
功能:作为Kafka的消费者,从Kafka Topic中拉取消息。
properties
# ============================================================
# kafka-source.conf
# 功能:从Kafka消费数据,输出到Logger
# ============================================================
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# ---------- Kafka Source配置 ----------
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource
# Kafka Broker列表
a1.sources.r1.kafka.bootstrap.servers = hadoop101:9092,hadoop102:9092,hadoop103:9092
# 消费的Topic名称
a1.sources.r1.kafka.topics = flume-topic
# 消费者组ID
a1.sources.r1.kafka.consumer.group.id = flume-consumer-group
# 是否将Kafka的Key放入Flume Event的Header中
a1.sources.r1.kafka.consumer.key.deserializer = org.apache.kafka.common.serialization.StringDeserializer
a1.sources.r1.kafka.consumer.value.deserializer = org.apache.kafka.common.serialization.StringDeserializer
# 每批拉取的最大消息数
a1.sources.r1.batchSize = 100
# 每批最大等待时间(毫秒)
a1.sources.r1.batchDurationMillis = 1000
# ---------- Memory Channel ----------
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000
# ---------- Logger Sink ----------
a1.sinks.k1.type = logger
# ---------- 绑定 ----------
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
七、Flume Channels(详细知识点)
7.1 Memory Channel
特点:数据存储在内存中,速度快,但Agent重启后数据会丢失。
properties
# ---------- Memory Channel 完整配置 ----------
a1.channels.c1.type = memory
# Channel中可存储的最大Event数量(容量)
a1.channels.c1.capacity = 10000
# 每次事务中可传输的最大Event数量
a1.channels.c1.transactionCapacity = 1000
# Channel中所有Event允许的最大总字节数(默认为JVM堆内存的80%)
# 设为-1表示不限制字节数,只根据capacity限制
a1.channels.c1.byteCapacity = -1
# 字节容量的缓冲百分比(实际字节容量 = 堆内存 * byteCapacityPercentage * (1 - buffer))
a1.channels.c1.byteCapacityBufferPercentage = 20
# 存储在Channel中的Event类型
a1.channels.c1.keep-alive = 3
7.2 File Channel
特点:数据存储在磁盘上,可靠性高,但速度较Memory Channel慢。
properties
# ---------- File Channel 完整配置 ----------
a1.channels.c1.type = file
# 检查点文件目录(记录Channel的状态信息)
a1.channels.c1.checkpointDir = /opt/module/flume-1.9.0/checkpoint
# 数据文件目录(存储实际的Event数据,可以配置多个目录,用逗号分隔)
a1.channels.c1.dataDirs = /opt/module/flume-1.9.0/data1,/opt/module/flume-1.9.0/data2
# Channel最大容量
a1.channels.c1.capacity = 1000000
# 每次事务最大传输Event数量
a1.channels.c1.transactionCapacity = 1000
# 检查点写入间隔(毫秒)
a1.channels.c1.checkpointInterval = 30000
# 使用双写模式(写入两个数据目录,提高可靠性)
a1.channels.c1.useDualCheckpoints = false
a1.channels.c1.backupCheckpointDir = /opt/module/flume-1.9.0/backup_checkpoint
# 最大文件大小
a1.channels.c1.maxFileSize = 2146435071
7.3 Kafka Channel
特点:将Kafka同时作为Channel和消息中间件。Source写入Kafka,Sink从Kafka读取。可以不用Source和Sink单独配置。
properties
# ---------- Kafka Channel 完整配置 ----------
a1.channels.c1.type = org.apache.flume.channel.kafka.KafkaChannel
# Kafka Broker列表
a1.channels.c1.kafka.bootstrap.servers = hadoop101:9092,hadoop102:9092
# Channel对应的Kafka Topic
a1.channels.c1.kafka.topic = flume-channel-topic
# 消费者组
a1.channels.c1.kafka.consumer.group.id = flume-channel-group
# 是否将Flume的Header信息也写入Kafka(false表示只写Body)
a1.channels.c1.parseAsFlumeEvent = true
# ZooKeeper连接地址
a1.channels.c1.kafka.zookeeperConnect = hadoop101:2181,hadoop102:2181
7.4 JDBC Channel
特点:将Event存储在数据库中(嵌入式Derby数据库),可靠性最高但性能最差。
properties
# ---------- JDBC Channel ----------
a1.channels.c1.type = jdbc
# 使用嵌入式Derby数据库
a1.channels.c1.db.type = DERBY
a1.channels.c1.db.driver = org.apache.derby.jdbc.EmbeddedDriver
a1.channels.c1.db.url = jdbc:derby:flume_jdbc_channel;create=true
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000
八、Flume Sinks(详细知识点)
8.1 HDFS Sink
功能:将Event写入HDFS,支持多种文件格式和滚动策略。
properties
# ============================================================
# hdfs-sink-complete.conf
# 功能:HDFS Sink的完整配置示例
# ============================================================
a1.sources = r1
a1.channels = c1
a1.sinks = k1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/data/access.log
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000
# ---------- HDFS Sink 完整配置 ----------
a1.sinks.k1.type = hdfs
# HDFS目标路径(支持时间转义序列:%Y-年、%m-月、%d-日、%H-时、%M-分)
a1.sinks.k1.hdfs.path = hdfs://hadoop101:8020/flume/data/%Y/%m/%d/%H
# 文件前缀
a1.sinks.k1.hdfs.filePrefix = access-
# 文件后缀
a1.sinks.k1.hdfs.fileSuffix = .log
# 使用本地时间(使用Agent所在机器的时间替换路径中的时间占位符)
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# ============= 文件滚动策略 =============
# 按时间滚动:每隔30秒滚动一次(0表示禁用按时间滚动)
a1.sinks.k1.hdfs.rollInterval = 30
# 按大小滚动:每128MB滚动一次(0表示禁用按大小滚动)
a1.sinks.k1.hdfs.rollSize = 134217728
# 按Event数量滚动:每1000个Event滚动一次(0表示禁用按数量滚动)
a1.sinks.k1.hdfs.rollCount = 0
# 按时间间隔轮转目录(与rollInterval区别:rollSize控制文件,round控制目录)
a1.sinks.k1.hdfs.round = true
a1.sinks.k1.hdfs.roundValue = 1
a1.sinks.k1.hdfs.roundUnit = hour
# ============= 文件格式 =============
# DataStream:普通文本格式(不能设置压缩codec)
# SequenceFile:Hadoop序列文件格式
# CompressedData:压缩的文本格式
a1.sinks.k1.hdfs.fileType = DataStream
# 写入格式
a1.sinks.k1.hdfs.writeFormat = Text
# ============= 压缩配置 =============
# 使用gzip压缩(仅CompressedData类型有效)
# a1.sinks.k1.hdfs.codeC = gzip
# ============= 序列化器 =============
# 默认使用BodyTextEventSerializer
a1.sinks.k1.hdfs.serializer = TEXT
# 每行之间的分隔符
a1.sinks.k1.hdfs.appendNewline = true
# ============= 调优参数 =============
# 最多同时打开的文件数(需要大于bucket的总数)
a1.sinks.k1.hdfs.maxOpenFiles = 5000
# 最小副本数
a1.sinks.k1.hdfs.minBlockReplicas = 1
# 批量写入的Event数量(提高吞吐量)
a1.sinks.k1.hdfs.batchSize = 100
# 调用close()前的超时时间
a1.sinks.k1.hdfs.callTimeout = 10000
# 重试次数
a1.sinks.k1.hdfs.retryCount = 3
# ---------- 绑定 ----------
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
8.2 Logger Sink
功能 :将Event内容输出到Flume日志中(通常输出到控制台),主要用于测试和调试。
properties
# ---------- Logger Sink配置 ----------
# 日志级别:必须将日志级别设为INFO才能看到输出
# 在启动命令中添加:-Dflume.root.logger=INFO,console
a1.sinks.k1.type = logger
# 输出格式:text(默认)或 json
a1.sinks.k1.serializer = text
# 每个Event的最大输出字节数(防止日志过大)
a1.sinks.k1.maxBytesPerEvent = 1024
8.3 Avro Sink
功能 :将Event通过Avro RPC协议发送到下一跳Agent的Avro Source。常用于Agent级联。
properties
# ============================================================
# avro-sink.conf
# 功能:将数据通过Avro发送到下一跳Agent
# ============================================================
a1.sources = r1
a1.channels = c1
a1.sinks = k1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/data/app.log
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# ---------- Avro Sink配置 ----------
a1.sinks.k1.type = avro
# 下一跳Agent的主机名
a1.sinks.k1.hostname = hadoop102
# 下一跳Agent的Avro Source端口号
a1.sinks.k1.port = 4141
# 批量发送大小
a1.sinks.k1.batch-size = 100
# 连接超时时间(毫秒)
a1.sinks.k1.connect-timeout = 20000
# 请求超时时间(毫秒)
a1.sinks.k1.request-timeout = 20000
# 压缩方式(none或deflate)
a1.sinks.k1.compression-type = none
# SSL加密
a1.sinks.k1.ssl = false
# ---------- 绑定 ----------
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
8.4 Kafka Sink
功能:将Event作为消息发送到Kafka Topic。
properties
# ============================================================
# kafka-sink.conf
# 功能:将采集到的数据发送到Kafka Topic
# ============================================================
a1.sources = r1
a1.channels = c1
a1.sinks = k1
a1.sources.r1.type = taildir
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /opt/module/data/logs/.*\\.log
a1.sources.r1.positionFile = /opt/module/flume-1.9.0/position/taildir_position.json
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000
# ---------- Kafka Sink配置 ----------
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink
# Kafka Broker列表
a1.sinks.k1.kafka.bootstrap.servers = hadoop101:9092,hadoop102:9092,hadoop103:9092
# 目标Topic名称
a1.sinks.k1.kafka.topic = flume-to-kafka
# 生产者的Key(设置为Header中的某个key值)
a1.sinks.k1.kafka.producer.acks = 1
# 批量发送大小
a1.sinks.k1.kafka.producer.batch.size = 16384
# 压缩类型
a1.sinks.k1.kafka.producer.compression.type = snappy
# Flume Event序列化器
a1.sinks.k1.serializer = org.apache.flume.sink.kafka.KafkaEventSerializer
# ---------- 绑定 ----------
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
8.5 File Roll Sink
功能:将Event写入本地文件系统。
properties
# ============================================================
# file-roll.conf
# 功能:将数据保存到本地文件
# ============================================================
a1.sources = r1
a1.channels = c1
a1.sinks = k1
a1.sources.r1.type = netcat
a1.sources.r1.bind = 0.0.0.0
a1.sources.r1.port = 44444
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# ---------- File Roll Sink ----------
a1.sinks.k1.type = file_roll
# 输出文件的本地目录
a1.sinks.k1.sink.directory = /opt/module/flume-1.9.0/output
# 每隔多少秒滚动一个新文件
a1.sinks.k1.sink.rollInterval = 60
# 文件前缀
a1.sinks.k1.sink.serializer = TEXT
# 每行后追加换行
a1.sinks.k1.sink.appendNewline = true
# ---------- 绑定 ----------
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
8.6 HBase Sink
功能:将Event写入HBase表。
properties
# ============================================================
# hbase-sink.conf
# 功能:将数据写入HBase
# ============================================================
a1.sources = r1
a1.channels = c1
a1.sinks = k1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/data/app.log
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
# ---------- HBase Sink ----------
a1.sinks.k1.type = hbase
# HBase表名
a1.sinks.k1.table = flume_hbase_table
# HBase的ZooKeeper地址
a1.sinks.k1.zookeeperQuorum = hadoop101:2181,hadoop102:2181,hadoop103:2181
# 列族名
a1.sinks.k1.columnFamily = cf
# 序列化器
a1.sinks.k1.serializer = org.apache.flume.sink.hbase.SimpleHbaseEventSerializer
# RowKey序列化器
a1.sinks.k1.serializer.payloadColumn = data
# ---------- 绑定 ----------
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
九、Flume拦截器(Interceptors)
9.1 拦截器概述
拦截器是Source端的数据处理链 ,可以在Event被发送到Channel之前对Event进行修改、过滤、路由等操作。多个拦截器可以链式组合。
9.2 Timestamp Interceptor(时间戳拦截器)
功能:在Event的Header中添加时间戳信息(timestamp=当前时间戳)。
properties
# ============================================================
# timestamp-interceptor.conf
# 功能:为Event添加时间戳Header
# ============================================================
a1.sources = r1
a1.channels = c1
a1.sinks = k1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/data/app.log
# ---------- Timestamp Interceptor ----------
# 配置拦截器列表(可以有多个,用空格分隔)
a1.sources.r1.interceptors = i1
# 拦截器类型为timestamp
a1.sources.r1.interceptors.i1.type = timestamp
# 如果Event的Header中已有timestamp,是否覆盖
a1.sources.r1.interceptors.i1.preserveExisting = true
# 使用系统时钟(如果为false,将从Event的Header中读取)
a1.sources.r1.interceptors.i1.header = timestamp
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sinks.k1.type = hdfs
# 使用Header中的时间戳来构建HDFS路径
a1.sinks.k1.hdfs.path = hdfs://hadoop101:8020/flume/timestamp/%Y%m%d/%H
a1.sinks.k1.hdfs.filePrefix = log-
# 设为false表示使用Event Header中的timestamp
a1.sinks.k1.hdfs.useLocalTimeStamp = false
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.rollInterval = 60
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
9.3 Host Interceptor(主机拦截器)
功能:在Event的Header中添加主机名或IP地址。
properties
# ---------- Host Interceptor ----------
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = host
# 使用主机名还是IP地址(false=主机名,true=IP地址)
a1.sources.r1.interceptors.i1.useIP = false
# Header的key名称
a1.sources.r1.interceptors.i1.hostHeader = host
9.4 Static Interceptor(静态拦截器)
功能:在Event的Header中添加一个固定的键值对。
properties
# ---------- Static Interceptor ----------
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = static
# 要添加的key
a1.sources.r1.interceptors.i1.key = datacenter
# 要添加的value
a1.sources.r1.interceptors.i1.value = Beijing_Rack01
# 是否覆盖已有的同名Header
a1.sources.r1.interceptors.i1.preserveExisting = true
9.5 Regex Filtering Interceptor(正则过滤拦截器)
功能 :根据正则表达式过滤Event,匹配则保留或排除。
properties
# ============================================================
# regex-filter.conf
# 功能:使用正则表达式过滤Event
# ============================================================
a1.sources = r1
a1.channels = c1
a1.sinks = k1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/data/access.log
# ---------- Regex Filtering Interceptor ----------
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = regex_filter
# 用正则表达式匹配Event Body的内容
# 例如:过滤掉包含 "ERROR" 的日志行
a1.sources.r1.interceptors.i1.regex = ^.*ERROR.*$
# true=匹配的Event被排除,false=匹配的Event被保留
a1.sources.r1.interceptors.i1.excludeEvents = true
# 可选:是否区分大小写
a1.sources.r1.interceptors.i1.regexIgnoreCase = false
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sinks.k1.type = logger
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
9.6 Regex Extractor Interceptor(正则提取拦截器)
功能 :使用正则表达式从Event Body中提取信息,并将提取结果存入Header。
properties
# ---------- Regex Extractor Interceptor ----------
a1.sources.r1.interceptors = i1
a1.sources.r1.interceptors.i1.type = regex_extractor
# 正则表达式:例如从日志 "2024-01-15 10:30:00 ERROR ModuleA xxx" 中提取日期和级别
a1.sources.r1.interceptors.i1.regex = ^(\\d{4}-\\d{2}-\\d{2})\\s+(\\d{2}:\\d{2}:\\d{2})\\s+(\\w+).*
# 序列化器:将匹配到的分组映射到Header的key
a1.sources.r1.interceptors.i1.serializers = s1 s2 s3
# s1对应第1个分组(日期)
a1.sources.r1.interceptors.i1.serializers.s1.name = logdate
# s2对应第2个分组(时间)
a1.sources.r1.interceptors.i1.serializers.s2.name = logtime
# s3对应第3个分组(日志级别)
a1.sources.r1.interceptors.i1.serializers.s3.name = loglevel
9.7 Multiplexing Channel Selector(多路复用通道选择器)
功能 :根据Event的Header值,将Event路由到不同的Channel。这是一种条件路由机制。
properties
# ============================================================
# multiplexing.conf
# 功能:根据Header中的日志级别将数据路由到不同目的地
# ============================================================
a1.sources = r1
a1.channels = c1 c2 c3 # 定义三个Channel
a1.sinks = k1 k2 k3 # 定义三个Sink
# 使用Regex Extractor提取日志级别到Header
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/data/app.log
a1.sources.r1.interceptors = i1 i2
# 时间戳拦截器
a1.sources.r1.interceptors.i1.type = timestamp
# 正则提取拦截器:提取日志级别
a1.sources.r1.interceptors.i2.type = regex_extractor
a1.sources.r1.interceptors.i2.regex = ^\\S+\\s+\\S+\\s+(\\w+)\\s+.*
a1.sources.r1.interceptors.i2.serializers = s1
a1.sources.r1.interceptors.i2.serializers.s1.name = loglevel
# ---------- 多路复用通道选择器 ----------
# 选择器类型为multiplexing
a1.sources.r1.selector.type = multiplexing
# 根据Header中"loglevel"的值来路由
a1.sources.r1.selector.header = loglevel
# 当loglevel=ERROR时,路由到Channel c1
a1.sources.r1.selector.mapping.ERROR = c1
# 当loglevel=WARN时,路由到Channel c2
a1.sources.r1.selector.mapping.WARN = c2
# 当loglevel=INFO时,路由到Channel c3
a1.sources.r1.selector.mapping.INFO = c3
# 默认路由(不匹配任何规则时)
a1.sources.r1.selector.default = c3
# ---------- 三个Channel ----------
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.channels.c2.type = memory
a1.channels.c2.capacity = 1000
a1.channels.c2.transactionCapacity = 100
a1.channels.c3.type = memory
a1.channels.c3.capacity = 1000
a1.channels.c3.transactionCapacity = 100
# ---------- 三个Sink(分别写入不同HDFS路径) ----------
# Sink1:ERROR日志写入专门目录
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop101:8020/flume/logs/error/%Y%m%d
a1.sinks.k1.hdfs.filePrefix = error-
a1.sinks.k1.hdfs.fileType = DataStream
# Sink2:WARN日志
a1.sinks.k2.type = hdfs
a1.sinks.k2.hdfs.path = hdfs://hadoop101:8020/flume/logs/warn/%Y%m%d
a1.sinks.k2.hdfs.filePrefix = warn-
a1.sinks.k2.hdfs.fileType = DataStream
# Sink3:INFO日志
a1.sinks.k3.type = hdfs
a1.sinks.k3.hdfs.path = hdfs://hadoop101:8020/flume/logs/info/%Y%m%d
a1.sinks.k3.hdfs.filePrefix = info-
a1.sinks.k3.hdfs.fileType = DataStream
# ---------- 绑定(每个Sink绑定各自的Channel) ----------
a1.sources.r1.channels = c1 c2 c3
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c2
a1.sinks.k3.channel = c3
9.8 多拦截器链式配置示例
properties
# ============================================================
# multi-interceptor.conf
# 功能:多个拦截器串联使用
# ============================================================
a1.sources = r1
a1.channels = c1
a1.sinks = k1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/data/app.log
# ---------- 拦截器链:先加时间戳,再加主机名,最后过滤 ----------
# 注意:拦截器按顺序执行(i1 → i2 → i3)
a1.sources.r1.interceptors = i1 i2 i3
# 第1个拦截器:添加时间戳
a1.sources.r1.interceptors.i1.type = timestamp
# 第2个拦截器:添加主机IP
a1.sources.r1.interceptors.i2.type = host
a1.sources.r1.interceptors.i2.useIP = true
a1.sources.r1.interceptors.i2.hostHeader = agent_ip
# 第3个拦截器:过滤掉DEBUG日志
a1.sources.r1.interceptors.i3.type = regex_filter
a1.sources.r1.interceptors.i3.regex = ^.*DEBUG.*$
a1.sources.r1.interceptors.i3.excludeEvents = true
a1.channels.c1.type = memory
a1.channels.c1.capacity = 1000
a1.channels.c1.transactionCapacity = 100
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop101:8020/flume/logs/%Y%m%d
a1.sinks.k1.hdfs.filePrefix = log-
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.rollInterval = 60
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
十、Flume的可靠性保证
10.1 端到端可靠性
Flume的可靠性基于事务机制:
Source → [put事务] → Channel → [take事务] → Sink
- Source → Channel(put事务):Source将数据放入Channel时,先开启put事务。放入成功则提交事务,失败则回滚重试。
- Channel → Sink(take事务):Sink从Channel取出数据时,先开启take事务。写入目标成功后才提交事务,否则回滚重放。
properties
# ============================================================
# reliability.conf
# 功能:演示事务机制配置(使用File Channel保证数据持久化)
# ============================================================
a1.sources = r1
a1.channels = c1
a1.sinks = k1
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/data/app.log
# 使用File Channel确保Channel中的数据持久化到磁盘
# 即使Agent崩溃重启,Channel中的数据不会丢失
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/module/flume-1.9.0/checkpoint
a1.channels.c1.dataDirs = /opt/module/flume-1.9.0/data
a1.channels.c1.capacity = 1000000
a1.channels.c1.transactionCapacity = 1000
# HDFS Sink
a1.sinks.k1.type = hdfs
a1.sinks.k1.hdfs.path = hdfs://hadoop101:8020/flume/reliable/%Y%m%d
a1.sinks.k1.hdfs.filePrefix = reliable-
a1.sinks.k1.hdfs.fileType = DataStream
a1.sinks.k1.hdfs.rollInterval = 30
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
10.2 负载均衡(Load Balancing)
功能:将Sink分组,按照负载均衡策略将Event分发到组内的不同Sink,避免单个目标节点过载。
properties
# ============================================================
# load-balance.conf
# 功能:Sink组负载均衡,将数据均匀分发到多个目标Agent
# ============================================================
a1.sources = r1
a1.channels = c1
a1.sinkgroups = g1 # 定义Sink组
a1.sinks = k1 k2 # 组内两个Sink
# ---------- Source ----------
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/data/app.log
# ---------- Channel ----------
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000
# ---------- Sink 1(发送到hadoop102上的Agent) ----------
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4141
# ---------- Sink 2(发送到hadoop103上的Agent) ----------
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop103
a1.sinks.k2.port = 4141
# ---------- Sink组配置 ----------
# 将k1和k2加入Sink组g1
a1.sinkgroups.g1.sinks = k1 k2
# 负载均衡策略:round_robin(轮询)或 random(随机)
a1.sinkgroups.g1.processor.type = load_balance
# 轮询策略
a1.sinkgroups.g1.processor.selector = round_robin
# 当某个Sink失败时,是否将其放入黑名单
a1.sinkgroups.g1.processor.backoff = true
# 黑名单的超时时间(毫秒),超时后重新尝试该Sink
a1.sinkgroups.g1.processor.selector.maxTimeOut = 30000
# ---------- 绑定 ----------
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1
10.3 故障恢复(Failover)
功能 :Sink组中的Sink有优先级,优先使用高优先级的Sink。当高优先级Sink失败时,自动切换到备用Sink。
properties
# ============================================================
# failover.conf
# 功能:Sink组故障转移,主Sink故障后自动切换到备用Sink
# ============================================================
a1.sources = r1
a1.channels = c1
a1.sinkgroups = g1
a1.sinks = k1 k2 k3 # 三个Sink
# ---------- Source ----------
a1.sources.r1.type = exec
a1.sources.r1.command = tail -F /opt/module/data/app.log
# ---------- Channel ----------
a1.channels.c1.type = memory
a1.channels.c1.capacity = 10000
a1.channels.c1.transactionCapacity = 1000
# ---------- Sink 1(主Sink,优先级最高 = 5) ----------
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4141
# ---------- Sink 2(备用Sink,优先级次高 = 3) ----------
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop103
a1.sinks.k2.port = 4141
# ---------- Sink 3(备用Sink,优先级最低 = 1) ----------
a1.sinks.k3.type = avro
a1.sinks.k3.hostname = hadoop104
a1.sinks.k3.port = 4141
# ---------- Sink组故障转移配置 ----------
a1.sinkgroups.g1.sinks = k1 k2 k3
# 故障转移策略类型
a1.sinkgroups.g1.processor.type = failover
# k1的优先级为5(数字越大优先级越高)
a1.sinkgroups.g1.processor.priority.k1 = 5
# k2的优先级为3
a1.sinkgroups.g1.processor.priority.k2 = 3
# k3的优先级为1
a1.sinkgroups.g1.processor.priority.k3 = 1
# 故障Sink的冷却时间(毫秒),超过此时间后会再次尝试
a1.sinkgroups.g1.processor.maxpenalty = 30000
# ---------- 绑定 ----------
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1
a1.sinks.k3.channel = c1
故障转移工作流程:
正常状态: k1(priority=5) ← 主要使用
k2(priority=3) ← 备用
k3(priority=1) ← 备用
k1故障后: k1标记为失效(冷却30秒)
k2(priority=3) ← 自动切换为主
k2也故障后:k1,k2都标记为失效
k3(priority=1) ← 自动切换为主
k1冷却到期后:k1重新尝试连接
如果成功,k1重新成为主
十一、采集案例
11.1 案例一:将目录采集到HDFS中
需求 :监控 /opt/module/data/flume-input/ 目录下的新增文件,将文件内容写入HDFS。
步骤1:准备配置文件 spool-dir-to-hdfs.conf
properties
# ============================================================
# spool-dir-to-hdfs.conf
# 功能:监控目录中的新文件,采集文件内容写入HDFS
# ============================================================
# ============ 1. 定义Agent组件 ============
# Agent名称为a1
# 定义一个Source、一个Channel、一个Sink
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# ============ 2. 配置Spooling Directory Source ============
# Source类型:spooldir(监控目录中新增的文件)
a1.sources.r1.type = spooldir
# 要监控的目录路径
# 注意:该目录必须提前创建好,且Flume进程有读写权限
a1.sources.r1.spoolDir = /opt/module/data/flume-input
# 文件被完全读取后,会被追加此后缀标记为已完成
a1.sources.r1.fileSuffix = .COMPLETED
# 忽略规则:忽略已完成的文件和临时文件
# (^$):忽略空文件名
# (.*\\.COMPLETED$):忽略已完成文件
# (.*\\.tmp$):忽略正在写入的临时文件
a1.sources.r1.ignorePattern = (^$|.*\\.COMPLETED$|.*\\.tmp$)
# 使用行反序列化器,将文件中每一行解析为一个Event
a1.sources.r1.deserializer = LINE
# 每行最大字节数(超过此长度的行会被截断),单位:字节
a1.sources.r1.deserializer.maxLineLength = 5120
# 在Event的Header中添加文件路径信息
a1.sources.r1.fileHeader = true
a1.sources.r1.fileHeaderKey = file
# 在Event的Header中添加文件名信息
a1.sources.r1.basenameHeader = true
a1.sources.r1.basenameHeaderKey = basename
# ============ 3. 配置Channel ============
# 使用内存Channel(速度快,适合大量小文件场景)
a1.channels.c1.type = memory
# Channel最多缓存的Event数量
a1.channels.c1.capacity = 10000
# 每个事务中最多传输的Event数量
a1.channels.c1.transactionCapacity = 1000
# Channel中所有Event的最大总字节数(占JVM堆内存的百分比)
a1.channels.c1.byteCapacityBufferPercentage = 20
# ============ 4. 配置HDFS Sink ============
a1.sinks.k1.type = hdfs
# HDFS目标路径
# %Y%m%d 表示按日期分区(年月日),使用本地时间
a1.sinks.k1.hdfs.path = hdfs://hadoop101:8020/flume/spool-input/%Y%m%d
# HDFS上的文件前缀
a1.sinks.k1.hdfs.filePrefix = spool-
# HDFS上的文件后缀
a1.sinks.k1.hdfs.fileSuffix = .txt
# 使用本地时间戳替换路径中的时间占位符
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# ---- 文件滚动策略 ----
# 按时间:每60秒生成一个新文件
a1.sinks.k1.hdfs.rollInterval = 60
# 按大小:每128MB生成一个新文件(128 * 1024 * 1024 = 134217728)
a1.sinks.k1.hdfs.rollSize = 134217728
# 按数量:每10000个Event生成一个新文件(设为0表示不按数量滚动)
a1.sinks.k1.hdfs.rollCount = 0
# 文件格式:DataStream表示普通文本文件(不压缩)
a1.sinks.k1.hdfs.fileType = DataStream
# 写入格式:Text
a1.sinks.k1.hdfs.writeFormat = Text
# 每行末尾追加换行符
a1.sinks.k1.hdfs.appendNewline = true
# 同时打开的最大文件数
a1.sinks.k1.hdfs.maxOpenFiles = 5000
# 调用close()方法的超时时间(毫秒)
a1.sinks.k1.hdfs.callTimeout = 30000
# ============ 5. 绑定组件 ============
# 将Source r1连接到Channel c1
a1.sources.r1.channels = c1
# 将Sink k1连接到Channel c1
a1.sinks.k1.channel = c1
步骤2:启动Agent并测试
bash
# 创建监控目录
mkdir -p /opt/module/data/flume-input
# 启动Flume Agent
flume-ng agent \
--name a1 \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/spool-dir-to-hdfs.conf \
-Dflume.root.logger=INFO,console
# 新开终端,创建测试文件并移入监控目录
echo -e "Hello Flume\nThis is line 2\nThis is line 3" > /tmp/test1.txt
cp /tmp/test1.txt /opt/module/data/flume-input/test1.txt
# 等待Flume采集完成后,检查HDFS上的文件
hdfs dfs -ls -R /flume/spool-input/
hdfs dfs -cat /flume/spool-input/20240115/spool-*.txt.1705286400000.txt
# 查看被标记为完成的文件
ls -la /opt/module/data/flume-input/
# 会看到 test1.txt.COMPLETED
11.2 案例二:将文件采集到HDFS中
需求 :实时监控日志文件 /opt/module/data/app-server/access.log 的新增内容,将新追加的行采集到HDFS中,并按小时分区。
步骤1:准备配置文件 taildir-file-to-hdfs.conf
properties
# ============================================================
# taildir-file-to-hdfs.conf
# 功能:使用Taildir Source实时监控日志文件,写入HDFS(按小时分区)
# ============================================================
# ============ 1. 定义Agent组件 ============
a1.sources = r1
a1.channels = c1
a1.sinks = k1
# ============ 2. 配置Taildir Source ============
# Source类型:taildir(支持断点续传,实时监控文件追加内容)
a1.sources.r1.type = taildir
# 定义文件组(可以定义多个文件组,监控不同目录的文件)
# 这里只定义了一个文件组 f1
a1.sources.r1.filegroups = f1
# 文件组f1的匹配规则:
# 路径/正则表达式
# 监控 /opt/module/data/app-server/ 目录下所有以 .log 结尾的文件
a1.sources.r1.filegroups.f1 = /opt/module/data/app-server/.*\\.log
# 为文件组f1的Event添加自定义Header
# 所有来自f1的Event都会包含 header: type=access_log
a1.sources.r1.headers.f1.type = access_log
# 断点续传的位置文件(JSON格式)
# 记录每个文件当前读取到的位置(行号和偏移量)
# Agent重启后从上次的位置继续读取,不会重复采集
a1.sources.r1.positionFile = /opt/module/flume-1.9.0/position/taildir_position.json
# 设置空闲超时(毫秒):如果文件在指定时间内没有新内容,关闭该文件句柄
a1.sources.r1.idleTimeout = 3000
# 每批读取的最大Event数量
a1.sources.r1.batchSize = 100
# ============ 3. 配置Channel ============
# 使用内存Channel
a1.channels.c1.type = memory
# 最大缓存Event数
a1.channels.c1.capacity = 100000
# 每次事务最大传输量
a1.channels.c1.transactionCapacity = 5000
# Channel总字节数限制(设置为500MB)
a1.channels.c1.byteCapacity = 524288000
# ============ 4. 配置HDFS Sink ============
a1.sinks.k1.type = hdfs
# HDFS目标路径(按小时分区)
# 使用Event Header中的type值作为子目录
# 使用时间占位符 %Y%m%d/%H 作为时间子目录
a1.sinks.k1.hdfs.path = hdfs://hadoop101:8020/flume/file-input/%{type}/%Y%m%d/%H
# 文件前缀
a1.sinks.k1.hdfs.filePrefix = access-
# 文件后缀
a1.sinks.k1.hdfs.fileSuffix = .log
# 使用Agent本地时间替换时间占位符
a1.sinks.k1.hdfs.useLocalTimeStamp = true
# ---- 文件滚动策略 ----
# 每30秒滚动一次
a1.sinks.k1.hdfs.rollInterval = 30
# 每128MB滚动一次
a1.sinks.k1.hdfs.rollSize = 134217728
# 不按Event数量滚动
a1.sinks.k1.hdfs.rollCount = 0
# 文件格式
a1.sinks.k1.hdfs.fileType = DataStream
# 写入格式
a1.sinks.k1.hdfs.writeFormat = Text
# 追加换行符
a1.sinks.k1.hdfs.appendNewline = true
# 同时打开的最大文件数
a1.sinks.k1.hdfs.maxOpenFiles = 5000
# 批量写入大小
a1.sinks.k1.hdfs.batchSize = 1000
# 超时时间
a1.sinks.k1.hdfs.callTimeout = 30000
# ============ 5. 绑定组件 ============
# Source r1 连接到 Channel c1
a1.sources.r1.channels = c1
# Sink k1 连接到 Channel c1
a1.sinks.k1.channel = c1
步骤2:启动Agent并测试
bash
# 创建必要的目录
mkdir -p /opt/module/data/app-server
mkdir -p /opt/module/flume-1.9.0/position
# 启动Flume Agent
flume-ng agent \
--name a1 \
--conf $FLUME_HOME/conf \
--conf-file $FLUME_HOME/conf/taildir-file-to-hdfs.conf \
-Dflume.root.logger=INFO,console
# 新开终端,模拟持续写入日志文件
# 方式1:手动追加
echo "2024-01-15 10:00:01 INFO User login success, userId=10001" >> /opt/module/data/app-server/access.log
echo "2024-01-15 10:00:02 ERROR NullPointerException at ModuleA:45" >> /opt/module/data/app-server/access.log
echo "2024-01-15 10:00:03 INFO Query executed, duration=150ms" >> /opt/module/data/app-server/access.log
# 方式2:持续写入(模拟实时日志生成)
for i in $(seq 1 100); do
echo "$(date '+%Y-%m-%d %H:%M:%S') INFO Request processed, requestId=req-$i" >> /opt/module/data/app-server/access.log
sleep 0.1
done
# 检查HDFS上的采集结果
hdfs dfs -ls -R /flume/file-input/
# 查看采集到的数据
hdfs dfs -cat /flume/file-input/access_log/20240115/10/access-*.log
# 查看断点续传位置文件
cat /opt/module/flume-1.9.0/position/taildir_position.json
# 输出类似:
# [{"inode":262200,"pos":1024,"file":"/opt/module/data/app-server/access.log"}]
十二、高级案例:Agent级联 + 负载均衡 + 故障转移
12.1 架构说明
┌→ hadoop102 Agent2(k2-priority=5) → HDFS
hadoop101 Agent1 → SinkGroup(failover) ┼→ hadoop103 Agent3(k3-priority=3) → HDFS
└→ hadoop104 Agent4(k4-priority=1) → HDFS
Agent1(hadoop101):采集数据并故障转移发送
properties
# ============================================================
# agent1-failover.conf (hadoop101)
# 功能:采集日志,故障转移到下游Agent
# ============================================================
a1.sources = r1
a1.channels = c1
a1.sinkgroups = g1
a1.sinks = k1 k2 k3
# ---- Source ----
a1.sources.r1.type = taildir
a1.sources.r1.filegroups = f1
a1.sources.r1.filegroups.f1 = /opt/module/data/logs/.*\\.log
a1.sources.r1.positionFile = /opt/module/flume-1.9.0/position/taildir_position.json
a1.sources.r1.batchSize = 100
# ---- Channel ----
a1.channels.c1.type = file
a1.channels.c1.checkpointDir = /opt/module/flume-1.9.0/checkpoint
a1.channels.c1.dataDirs = /opt/module/flume-1.9.0/data
a1.channels.c1.capacity = 1000000
a1.channels.c1.transactionCapacity = 1000
# ---- Sink 1 → hadoop102 (主Sink,优先级=5) ----
a1.sinks.k1.type = avro
a1.sinks.k1.hostname = hadoop102
a1.sinks.k1.port = 4141
# ---- Sink 2 → hadoop103 (备用Sink,优先级=3) ----
a1.sinks.k2.type = avro
a1.sinks.k2.hostname = hadoop103
a1.sinks.k2.port = 4141
# ---- Sink 3 → hadoop104 (备用Sink,优先级=1) ----
a1.sinks.k3.type = avro
a1.sinks.k3.hostname = hadoop104
a1.sinks.k3.port = 4141
# ---- SinkGroup 故障转移 ----
a1.sinkgroups.g1.sinks = k1 k2 k3
a1.sinkgroups.g1.processor.type = failover
a1.sinkgroups.g1.processor.priority.k1 = 5
a1.sinkgroups.g1.processor.priority.k2 = 3
a1.sinkgroups.g1.processor.priority.k3 = 1
a1.sinkgroups.g1.processor.maxpenalty = 30000
# ---- 绑定 ----
a1.sources.r1.channels = c1
a1.sinks.k1.channel = c1
a1.sinks.k2.channel = c1
a1.sinks.k3.channel = c1
Agent2(hadoop102):接收数据并写入HDFS
properties
# ============================================================
# agent2-avro-to-hdfs.conf (hadoop102)
# 功能:接收上游Agent数据,写入HDFS
# ============================================================
a2.sources = r1
a2.channels = c1
a2.sinks = k1
# ---- Avro Source(监听来自上游Agent的连接) ----
a2.sources.r1.type = avro
a2.sources.r1.bind = 0.0.0.0
a2.sources.r1.port = 4141
a2.sources.r1.threads = 5
# ---- Channel ----
a2.channels.c1.type = memory
a2.channels.c1.capacity = 10000
a2.channels.c1.transactionCapacity = 1000
# ---- HDFS Sink ----
a2.sinks.k1.type = hdfs
a2.sinks.k1.hdfs.path = hdfs://hadoop101:8020/flume/failover/hadoop102/%Y%m%d/%H
a2.sinks.k1.hdfs.filePrefix = h102-
a2.sinks.k1.hdfs.useLocalTimeStamp = true
a2.sinks.k1.hdfs.rollInterval = 30
a2.sinks.k1.hdfs.rollSize = 134217728
a2.sinks.k1.hdfs.rollCount = 0
a2.sinks.k1.hdfs.fileType = DataStream
a2.sinks.k1.hdfs.writeFormat = Text
# ---- 绑定 ----
a2.sources.r1.channels = c1
a2.sinks.k1.channel = c1
十三、本章小结
13.1 核心知识点总结
| 知识点 | 说明 |
|---|---|
| Flume是什么 | 分布式日志采集、聚合、传输系统 |
| 核心架构 | Source → Channel → Sink |
| Event组成 | Header(键值对头信息)+ Body(字节数组) |
| 常用Source | Exec、Taildir、Spooling Directory、Avro、Netcat、Kafka |
| 常用Channel | Memory(快但不持久)、File(慢但可靠)、Kafka(高吞吐) |
| 常用Sink | HDFS、Logger、Avro、Kafka、HBase、File Roll |
| 拦截器 | Timestamp、Host、Static、Regex Filter、Regex Extractor |
| 通道选择器 | Replicating(复制,默认)、Multiplexing(多路复用) |
| 可靠性保证 | 事务机制(put/take事务)+ File Channel持久化 |
| 负载均衡 | SinkGroup + load_balance处理器(round_robin / random) |
| 故障恢复 | SinkGroup + failover处理器(按优先级切换) |
13.2 关键配置速查表
properties
# ---------- Source类型速查 ----------
a1.sources.r1.type = netcat # TCP端口监听
a1.sources.r1.type = exec # 执行Linux命令
a1.sources.r1.type = spooldir # 监控目录新文件
a1.sources.r1.type = taildir # 监控文件追加(推荐)
a1.sources.r1.type = avro # Avro RPC接收
a1.sources.r1.type = org.apache.flume.source.kafka.KafkaSource # Kafka消费者
# ---------- Channel类型速查 ----------
a1.channels.c1.type = memory # 内存Channel
a1.channels.c1.type = file # 文件Channel
# ---------- Sink类型速查 ----------
a1.sinks.k1.type = logger # 控制台日志
a1.sinks.k1.type = hdfs # HDFS
a1.sinks.k1.type = avro # Avro RPC发送
a1.sinks.k1.type = org.apache.flume.sink.kafka.KafkaSink # Kafka生产者
a1.sinks.k1.type = hbase # HBase
a1.sinks.k1.type = file_roll # 本地文件
# ---------- 拦截器类型速查 ----------
a1.sources.r1.interceptors.i1.type = timestamp # 时间戳
a1.sources.r1.interceptors.i1.type = host # 主机
a1.sources.r1.interceptors.i1.type = static # 静态键值对
a1.sources.r1.interceptors.i1.type = regex_filter # 正则过滤
a1.sources.r1.interceptors.i1.type = regex_extractor # 正则提取
# ---------- 通道选择器速查 ----------
a1.sources.r1.selector.type = replicating # 复制(默认)
a1.sources.r1.selector.type = multiplexing # 多路复用
# ---------- Sink处理器速查 ----------
a1.sinkgroups.g1.processor.type = load_balance # 负载均衡
a1.sinkgroups.g1.processor.type = failover # 故障转移
13.3 常见问题与注意事项
| 问题 | 解决方案 |
|---|---|
| Spooling Directory中正在写入的文件被读取 | 将临时文件的后缀加入ignorePattern,或先写到其他目录再移动 |
| Taildir Source重启后重复采集 | 确保positionFile配置正确,不要手动删除位置文件 |
| HDFS上产生大量小文件 | 调大rollInterval和rollCount,减小rollCount为0 |
| Agent崩溃后数据丢失 | 使用File Channel替代Memory Channel |
| 多个Sink写同一个文件冲突 | 确保hdfs.maxOpenFiles足够大,或使用BucketWriter |
| 启动时报端口被占用 | 使用netstat -tunlp查看端口占用情况,更换端口 |