二百一十一、Flume——Flume实时采集Linux中的Hive日志写入到HDFS中（亲测、附截图）

一、目的

为了实现用Flume实时采集Hive的操作日志到HDFS中，于是进行了一场实验

二、前期准备

（一）安装好Hadoop、Hive、Flume等工具

（二）查看Hive的日志在Linux系统中的文件路径

root@hurys23 conf\]# find / -name hive.log /home/log/hive312/hive.log ![](https://file.jishuzhan.net/article/1732227535908900866/925c0ac93f52132bd5e4e6b2536bc155.webp) ### （三）在HDFS中创建文件夹flume，即Hive日志写入的HDFS文件 ![](https://file.jishuzhan.net/article/1732227535908900866/4dd84bdbf2a302a772a996a281ab0ce6.webp) ## 三、创建Flume的任务文件 \[root@hurys23 conf\]# vi flume-file-hdfs.conf # Name the components on this agent a2.sources = r2 a2.sinks = k2 a2.channels = c2 # Describe/configure the source a2.sources.r2.type = exec a2.sources.r2.command = tail -F /home/log/hive312/hive.log # Describe the sink a2.sinks.k2.type = hdfs a2.sinks.k2.hdfs.path = hdfs://hurys23:8020/flume/%Y%m%d/%H #上传文件的前缀 a2.sinks.k2.hdfs.filePrefix = logs- #是否按照时间滚动文件夹 a2.sinks.k2.hdfs.round = true #多少时间单位创建一个新的文件夹 a2.sinks.k2.hdfs.roundValue = 1 #重新定义时间单位 a2.sinks.k2.hdfs.roundUnit = hour #是否使用本地时间戳 a2.sinks.k2.hdfs.useLocalTimeStamp = true #积攒多少个 Event 才 flush 到 HDFS 一次 a2.sinks.k2.hdfs.batchSize = 100 #设置文件类型，可支持压缩 a2.sinks.k2.hdfs.fileType = DataStream #多久生成一个新的文件 a2.sinks.k2.hdfs.rollInterval = 60 #设置每个文件的滚动大小 a2.sinks.k2.hdfs.rollSize = 134217700 #文件的滚动与 Event 数量无关 a2.sinks.k2.hdfs.rollCount = 0 # Use a channel which buffers events in memory a2.channels.c2.type = memory a2.channels.c2.capacity = 1000 a2.channels.c2.transactionCapacity = 100 # Bind the source and sink to the channel a2.sources.r2.channels = c2 a2.sinks.k2.channel = c2 ![](https://file.jishuzhan.net/article/1732227535908900866/5f52110b784700ee6c9857ca3892dec0.webp) 注意： 1、配置文件中的各项参数需要调式，这里只是为了演示，实现目的、打通路径即可！实际在项目中操作时需要调试参数。 2、a2.sources.r2.command = tail -F /home/log/hive312/hive.log 为hive.log在Linux中的路径 3、a2.sinks.k2.hdfs.path = hdfs://hurys23:8020/flume/%Y%m%d/%H 为写入的HDFS文件路径 ## 四、启动Flume任务文件 \[root@hurys23 flume190\]# bin/flume-ng agent -n a2 -f /usr/local/hurys/dc_env/flume/flume190/conf/flume-file-hdfs.conf ![](https://file.jishuzhan.net/article/1732227535908900866/3d00612ea4037dee9925f020805617ad.webp) ## 五、Flume任务运行时写入的HDFS文件状况 ### （一）目前时间 2023/12/5 14时 ![](https://file.jishuzhan.net/article/1732227535908900866/b8816e90474dad42caaf3c2af5a7edf2.webp) ### （二）HDFS的flume文件夹中根据时间戳自动生成20231205文件夹、14文件夹及其logs文件 ![](https://file.jishuzhan.net/article/1732227535908900866/4216ee9175f0ec8d45df81dd5b5d90ea.webp) ![](https://file.jishuzhan.net/article/1732227535908900866/7b6a06f7322c62883967989023481e09.webp) ### （三）HDFS的log文件内容，以logs-.1701757858263为例 ![](https://file.jishuzhan.net/article/1732227535908900866/1466a8b19563e6eac25b908b6d40f5a0.webp) ## 六、关闭Flume任务 首先jps查看Flume任务，然后直接kill程序 \[root@hurys23 conf\]# jps 28385 NodeManager 27938 SecondaryNameNode 16642 RunJar 27496 NameNode 27657 DataNode 8717 Jps 28215 ResourceManager 8282 Application \[root@hurys23 conf\]# kill -9 8282 ![](https://file.jishuzhan.net/article/1732227535908900866/d1d22f2102464dfc94a4b35b0adcc9b5.webp) Hive日志成功采集到HDFS！演示就先到这里吧，后面如果有需要就再更新。 再提醒一遍，博客中Flume配置文件里面的参数只是为了演示而已，具体实践时需要根据实际情况进行调整，不要照搬！！！