大数据数据采集 Apache-Flume 笔记

1 Flume 安装部署

1.1 官网地址

Flume官网地址：http://flume.apache.org/
文档查看地址：http://flume.apache.org/FlumeUserGuide.html
下载地址：http://archive.apache.org/dist/flume/

1.2 安装部署

将apache-flume-1.10.1-bin.tar.gz上传到linux的/opt/software目录下
解压apache-flume-1.10.1-bin.tar.gz到/opt/module/目录下
shell 复制代码
```
tar -zxvf /opt/software/apache-flume-1.10.1-bin.tar.gz -C /opt/module/
```
修改apache-flume-1.10.1-bin的名称为flume
shell 复制代码
```
mv /opt/module/apache-flume-1.10.1-bin /opt/module/flume
```

修改conf下的log4j2.xml确定日志打印的位置,在53行后插入

xml 复制代码

<!--53       <AppenderRef ref="LogFile" /> -->

54       <AppenderRef ref="Console" />

2 Flume 入门案例

2.1 监控端口数据官方案例

1）案例需求

使用Flume监听一个端口，收集该端口数据，并打印到控制台

2）实现步骤

安装netcat工具
shell 复制代码
```
yum install -y nc
```
判断44444端口是否被占用
shell 复制代码
```
netstat -nlp | grep 44444
```
在conf文件夹下创建Flume Agent配置文件nc-flume-log.conf。
shell 复制代码
```
vim nc-flume-log.conf
```

在nc-flume-log.conf文件中添加如下内容。

tex 复制代码

# agent中组件的定义（a1是agent的名字）
a1.sources = r1
a1.sinks = k1
a1.channels = c1

# source组件的说明
#用来定义source的类型--Netcat TCP Source(读取网络数据)
a1.sources.r1.type = netcat
#监听的主机的地址
a1.sources.r1.bind = 0.0.0.0
#监听的端口号
a1.sources.r1.port = 44444


#Sink组件的说明
#用来定义sink组件的类型-LoggerSink(用来将数据输出到控制台)
a1.sinks.k1.type = logger

# Channel组件的说明
#定义channel组件的类型（MemoryChannel-将数组存储到内存）
a1.channels.c1.type = memory
#channel的容量（event的数量）
a1.channels.c1.capacity = 1000
#事务的容量（注意：transactionCapacity <= capacity）
a1.channels.c1.transactionCapacity = 100

# 用来说明（关联）各组件的关系
#r1这个source对应的是哪个channel（哪个source组件读取的数据写到哪个channel中）
a1.sources.r1.channels = c1
#k1这个sink对应的是哪个channel(哪个sink组件读取的数组来自于哪个channel中)
a1.sinks.k1.channel = c1

先开启flume监听端口

第一种方式：
shell 复制代码
```
flume-ng agent -n $agent_name -c conf -f conf/flume-conf.properties.template  -Dflume.root.logger=INFO,console
```
第二种方式：
shell 复制代码
```
flume-ng agent --conf conf --conf-file example.conf --name a1  -Dflume.root.logger=INFO,console
```
参数说明：

--conf/-c：表示配置文件存储在conf/目录

--name/-n：表示给agent起名为a1

--conf-file/-f：flume本次启动读取的配置文件是在conf文件夹下的nc-flume-log.conf文件。

-Dflume.root.logger=INFO,console ：-D表示flume运行时动态修改flume.root.logger参数属性值，并将控制台日志打印级别设置为INFO级别。日志级别包括:log、info、warn、error。日志参数已经在配置文件中修改了，不再需要重复输入。
使用netcat工具向本机的44444端口发送内容
shell 复制代码
```
nc localhost 44444
```

在Flume监听页面观察接收数据情况

tex 复制代码

Event: { headers:{} body: 31 30                                           10 }

event打印的源码介绍

LoggerSink的process方法：

java 复制代码

if (event != null) {
    if (logger.isInfoEnabled()) {
        logger.info("Event: " + EventHelper.dumpEvent(event, maxBytesToLog));
    }
}

dumpEvent方法返回值：buffer是固定长度的字符串，前端是16进制表示的字符的阿斯卡码值。

java 复制代码

return "{ headers:" + event.getHeaders() + " body:" + buffer + " }";