文章作者邮箱：yugongshiye@sina.cn 地址：广东惠州

▲ 本章节目的

⚪ 掌握Source的AVRO Source；

⚪ 掌握Source的Exec Source；

⚪ 掌握Source的Spooling Directory Source；

⚪ 掌握Source的Netcat Source；

⚪ 掌握Source的Sequence Generator Source；

⚪ 掌握Source的HTTP Source；

⚪ 掌握Source的Custom Source；

一、AVRO Source

1. 概述

AVRO Source监听指定的端口，接收其他节点发送来的被AVRO序列化的数据。
AVRO Source结合AVRO Sink可以实现更多的流动模型，包括多级流动、扇入流动以及扇出流动。

2. 配置属性

|--------|----------------|
| 属性 | 解释 |
| type | 必须是avro |
| bind | 要监听的主机的主机名或者IP |
| port | 要监听的端口 |

3. 案例

编辑格式文件，在格式文件中需要添加指定内容：

a1.sources = s1

a1.channels = c1

a1.sinks = k1

配置AVRO Source

必须是avro

a1.sources.s1.type = avro

指定要监听的主机

a1.sources.s1.bind = hadoop01

指定要监听的端口

a1.sources.s1.port = 8090

a1.channels.c1.type = memory

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1

a1.sinks.k1.channel = c1

启动Flume。

../bin/flume-ng agent -n a1 -c ../conf -f avrosource.conf -

Dflume.root.logger=INFO,console

cd /home/software/apache-flume-1.9.0-bin/data

vim a.txt

运行AVRO客户端。

../bin/flume-ng avro-client -H hadoop01 -p 8090 -F a.txt

二、Exec Source

1. 概述

Exec Source会运行指定的命令，然后将命令的执行结果作为日志进行收集。
利用这个Source可以实现对文件或者其他操作的实时监听。

2. 配置属性

|---------|-------------------------|
| 属性 | 解释 |
| type | 必须是exec |
| command | 要执行和监听的命令 |
| shell | 最好指定这个属性，表示指定Shell的运行方式 |

3. 案例

需求：实时监听/home/a.txt文件的变化。
编辑格式文件，添加如下内容：

a1.sources = s1

a1.channels = c1

a1.sinks = k1

配置Exec Source

必须是exec

a1.sources.s1.type = exec

指定要运行的命令

a1.sources.s1.command = tail -F /home/a.txt

指定Shell的运行方式/类型

a1.sources.s1.shell = /bin/bash -c

a1.channels.c1.type = memory

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1

a1.sinks.k1.channel = c1

启动Flume。

../bin/flume-ng agent -n a1 -c ../conf -f execsource.conf -

Dflume.root.logger=INFO,console

修改文件内容。

echo "hello" >> a.txt

三、Spooling Directory Source

1. 概述

Spooling Directory Source是监听指定的目录，自动将目录中出现的新文件的内容进行收集。
如果不指定，默认情况下，一个文件被收集之后，会自动添加一个后缀.COMPLETED，通过通过属性fileSuffix来修改。

2. 配置属性

|------------|---------------------------|
| 属性 | 解释 |
| type | 必须是spooldir |
| spoolDir | 要监听的目录 |
| fileSuffix | 收集之后添加的文件后缀，默认是.COMPLETED |

3. 案例

编辑格式文件，添加如下内容：

a1.sources = s1

a1.channels = c1

a1.sinks = k1

配置Spooling Directory Source

必须是spooldir

a1.sources.s1.type = spooldir

指定要监听的目录

a1.sources.s1.spoolDir = /home/flumedata

a1.channels.c1.type = memory

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1

a1.sinks.k1.channel = c1

启动Flume。

../bin/flume-ng agent -n a1 -c ../conf -f spoolingdirsource.conf -

Dflume.root.logger=INFO,console

四、Netcat Source

1. 概述

Netcat Source在Flume1.9之后分为Netcat TCP Source和Netcat UDP Source。
如果不指定，那么Netcat Source监听的是TCP请求。

2. 配置属性

|--------|----------------------------------------------|
| 属性 | 解释 |
| type | 如果监听TCP请求，那么使用netcat；如果监听UDP请求，那么使用netcatudp |
| bind | 要监听的主机的主机名或者IP |
| port | 要监听的端口 |

3. 案例

编辑格式文件，添加如下内容(以UDP为例)：

a1.sources = s1

a1.channels = c1

a1.sinks = k1

配置Netcat UDP Source

必须是netcatudp

a1.sources.s1.type = netcatudp

指定要监听的主机

a1.sources.s1.bind = 0.0.0.0

指定要监听的端口

a1.sources.s1.port = 8090

a1.channels.c1.type = memory

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1

a1.sinks.k1.channel = c1

启动Flume。

../bin/flume-ng agent -n a1 -c ../conf -f netcatudpsource.conf -

Dflume.root.logger=INFO,console

启动nc。

nc -u -k -l hadoop01 8090

五、Sequence Generator Source

1. 概述

Sequence Generator Source本质上就是一个序列产生器，会从0开始每次递增1个单位。
如果不指定，默认情况下递增到Long.MAX_VALUE。

2. 配置属性

|-------------|---------|
| 属性 | 解释 |
| type | 必须是seq |
| totalEvents | 递增的结束范围 |

3. 案例

编辑格式文件，添加如下内容：

a1.sources = s1

a1.channels = c1

a1.sinks = k1

配置Sequence Generator Source

必须是seq

a1.sources.s1.type = seq

指定结束范围

a1.sources.s1.totalEvents = 10

a1.channels.c1.type = memory

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1

a1.sinks.k1.channel = c1

启动Flume。

../bin/flume-ng agent -n a1 -c ../conf -f seqsource.conf -

Dflume.root.logger=INFO,console

六、HTTP Source

1. 概述

HTTP Source用于监听HTTP请求，但是只能监听POST和GET请求。
GET请求只用于试验阶段，所以实际过程中只用这个Source来监听POST请求。

2. 配置属性

|--------|---------|
| 属性 | 解释 |
| type | 必须是http |
| port | 要监听的端口 |

3. 案例

编辑格式文件，添加如下内容：

a1.sources = s1

a1.channels = c1

a1.sinks = k1

配置HTTP Source

必须是http

a1.sources.s1.type = http

指定要监听的端口

a1.sources.s1.port = 8090

a1.channels.c1.type = memory

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1

a1.sinks.k1.channel = c1

启动Flume。

../bin/flume-ng agent -n a1 -c ../conf -f httpsource.conf -

Dflume.root.logger=INFO,console

发送POST请求。

curl -X POST -d '[{"headers":

{"kind":"test","class":"bigdata"},"body":"testing"}]'

http://hadoop01:8090

七、Custom Source

1. 概述

自定义Source：需要定义一个类实现Source接口的子接口：EventDrivenSource或者PollableSource。

a. EventDrivenSource：事件驱动源 - 被动型Source。需要自己定义线程来获取数据处理数据。

b. PollableSource：拉取源 - 主动型Source。提供了线程来获取数据，只需要考虑怎么处理数据即可。

除了实现上述两个接口之一，这个自定义的类一般还需要考虑实现Configurable接口，通过接口的方法来获取指定的属性。

2. 步骤

需要构建Maven工程，导入对应的POM依赖。
定义类继承AbstractSource，实现EventDrivenSource和Configurable接口。
覆盖configure，start和stop方法。
定义完成之后，需要将类打成jar包放到Flume安装目录的lib目录下。
编写格式文件，例如：

a1.sources = s1

a1.channels = c1

a1.sinks = k1

配置自定义Source

必须是类的全路径名

a1.sources.s1.type =cn.tedu.flume.source.AuthSource

指定结束范围

a1.sources.s1.end = 100

指定递增的步长

a1.sources.s1.step = 5

a1.channels.c1.type = memory

a1.sinks.k1.type = logger

a1.sources.s1.channels = c1

a1.sinks.k1.channel = c1

启动Flume。

../bin/flume-ng agent -n a1 -c ../conf -f authsource.conf -

Dflume.root.logger=INFO,console

大数据课程D3——hadoop的Source

▲ 本章节目的

一、AVRO Source

1. 概述

2. 配置属性

3. 案例

配置AVRO Source

必须是avro

指定要监听的主机

指定要监听的端口

二、Exec Source

1. 概述

2. 配置属性

3. 案例

配置Exec Source

必须是exec

指定要运行的命令

指定Shell的运行方式/类型

三、Spooling Directory Source

1. 概述

2. 配置属性

3. 案例

配置Spooling Directory Source

必须是spooldir

指定要监听的目录

四、Netcat Source

1. 概述

2. 配置属性

3. 案例

配置Netcat UDP Source

必须是netcatudp

指定要监听的主机

指定要监听的端口

五、Sequence Generator Source

1. 概述

2. 配置属性

3. 案例

配置Sequence Generator Source

必须是seq

指定结束范围

六、HTTP Source

1. 概述

2. 配置属性

3. 案例

配置HTTP Source

必须是http

指定要监听的端口

七、Custom Source

1. 概述

2. 步骤

配置自定义Source

必须是类的全路径名

指定结束范围

指定递增的步长