flink消费Kafka写入hdfs动态目录

概述

从其他工具同步MySQL的数据到Kafka中,通过flink消费Kafka的数据写入hdfs目录。 Kafka的数据是json格式:

json 复制代码
{
    "database": "gmall",
    "table": "order_refund_info",
    "type": "update",
    "ts": 1675414185,
    "xid": 38713,
    "commit": 'true',
    "data": {
        "id": 6903,
        "user_id": 191,
        "order_id": 36103,
        "sku_id": 34,
        "refund_type": "1502",
        "refund_num": 1,
        "refund_amount": 3927,
        "refund_reason_type": "1301",
        "refund_reason_txt": "退款原因具体:0445847967",
        "refund_status": "0705",
        "create_time": "2022-06-14 11:00:53",
        "operate_time": "2022-06-14 12:00:58"
    },
    "old": {
        "refund_status": "1006",
        "operate_time": "2022-06-14 12:00:53"
    }
}

其中data部分为MySQL表数据,ts为采集的时间,为时间戳。这里的ts默认的时间比数据时间的operate_time要大,这里的需求是按照数据时间将数据插入到HDFS对应的分区目录,粒度为天,如:/warehouse/order_refund_info/dtime=20220614

实现

flink:1.17 Java: 1.8 Kafka: Hadoop:

maven配置

xml 复制代码
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
         xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
         xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
    <modelVersion>4.0.0</modelVersion>

    <groupId>org.example</groupId>
    <artifactId>kafka2hdfs</artifactId>
    <version>1.0-SNAPSHOT</version>

    <properties>
        <maven.compiler.source>8</maven.compiler.source>
        <maven.compiler.target>8</maven.compiler.target>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <flink.version>1.17.1</flink.version>
    </properties>
    <dependencies>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-java</artifactId>
            <version>${flink.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-clients</artifactId>
            <version>${flink.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-files</artifactId>
            <version>${flink.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-kafka</artifactId>
            <version>${flink.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-datagen</artifactId>
            <version>${flink.version}</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>mysql</groupId>
            <artifactId>mysql-connector-java</artifactId>
            <version>8.0.27</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-jdbc</artifactId>
            <version>1.17-SNAPSHOT</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>com.fasterxml.jackson.core</groupId>
            <artifactId>jackson-databind</artifactId>
            <version>2.12.5</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.hadoop</groupId>
            <artifactId>hadoop-client</artifactId>
            <version>3.3.1</version>
            <scope>provided</scope>
        </dependency>
        <dependency>
            <groupId>com.alibaba</groupId>
            <artifactId>fastjson</artifactId>
            <version>1.2.78</version>
        </dependency>

    </dependencies>
    <repositories>
        <repository>
            <id>apache-snapshots</id>
            <name>apache snapshots</name>
            <url>https://repository.apache.org/content/repositories/snapshots/</url>
<!--            <url>https://maven.aliyun.com/repository/apache-snapshots</url>-->
        </repository>
    </repositories>
    <build>
        <plugins>
            <plugin>
                <artifactId>maven-compiler-plugin</artifactId>
                <version>3.6.0</version>
                <configuration>
                    <source>1.8</source>
                    <target>1.8</target>
                </configuration>
            </plugin>
            <plugin>
                <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.1.0</version>
                <executions>
                    <execution>
                        <phase>package</phase>
                        <goals>
                            <goal>shade</goal>
                        </goals>
                        <configuration>
                            <transformers>
                                <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                </transformer>
                                <transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
                                    <resource>reference.conf</resource>
                                </transformer>
                            </transformers>
                            <relocations>
                                <relocation>
                                    <pattern>org.codehaus.plexus.util</pattern>
                                    <shadedPattern>org.shaded.plexus.util</shadedPattern>
                                    <excludes>
                                        <exclude>org.codehaus.plexus.util.xml.Xpp3Dom</exclude>
                                        <exclude>org.codehaus.plexus.util.xml.pull.*</exclude>
                                    </excludes>
                                </relocation>
                            </relocations>
                        </configuration>
                    </execution>
                </executions>
            </plugin>
        </plugins>
    </build>
</project>

主程序

主程序比较简单,主要就是消费Kafka然后写入HDFS。

java 复制代码
package kafka2hdfs;

import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.serialization.SimpleStringEncoder;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.configuration.MemorySize;
import org.apache.flink.connector.file.sink.FileSink;
import org.apache.flink.connector.kafka.source.KafkaSource;
import org.apache.flink.connector.kafka.source.enumerator.initializer.OffsetsInitializer;
import org.apache.flink.core.fs.Path;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.filesystem.OutputFileConfig;
import org.apache.flink.streaming.api.functions.sink.filesystem.rollingpolicies.DefaultRollingPolicy;

import utils.Config;
import utils.WareHousePath;
import utils.str2Date;
import utils.warehouseAssigner;

import java.time.Duration;
import java.time.ZoneId;
import java.util.HashMap;

public class kafka2hdfs2 {
    public static void main(String[] args) throws Exception{
        // 通过命令行传入配置文件路径
        ParameterTool parameterTool = ParameterTool.fromArgs(args);
        String configPath = parameterTool.get("config");
        // 解析配置文件内容,获取目标参数
        Config config = new Config(configPath);
        HashMap<String, String> confMap = config.readData();
        String hdfspath = confMap.getOrDefault("hdfs_path", "hdfs://localhost:9000/warehouse/gmall/log/");
        WareHousePath wareHousePath = new WareHousePath(
                hdfspath,
                confMap.get("database")+"_"+confMap.get("table_name")
        );

        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
        env.setParallelism(1);  // 生产环境不建议使用此配置。默认设置为1,因为当前Kafka的topic为1.可以根据Kafka的topic分区数量来调整flink程序的并行度
        env.enableCheckpointing(2000, CheckpointingMode.EXACTLY_ONCE);

        String bootstrapServer = confMap.get("bootstrapservers");
        String groupId = confMap.get("groupid");
        String topic = confMap.get("topic");
        String appName = confMap.getOrDefault("appname", "flinkApplication");

        KafkaSource<String> kafkaSource = KafkaSource.<String>builder()
                .setBootstrapServers(bootstrapServer)
                .setGroupId(groupId)
                .setTopics(topic)
                .setValueOnlyDeserializer(new SimpleStringSchema())
                .setStartingOffsets(OffsetsInitializer.latest())
                .build();
        DataStreamSource<String> kafkaDS = env.fromSource(kafkaSource, WatermarkStrategy.noWatermarks(), "kafkaSource");

        SingleOutputStreamOperator<String> mappedDS = kafkaDS.map(
                new MapFunction<String, String>() {
                    @Override
                    public String map(String value) throws Exception {

                        JSONObject map = JSON.parseObject(value);
                        JSONObject data2 = map.getJSONObject("data");
                        String operateTime = data2.get("operate_time").toString();
                        str2Date dateobj = new str2Date(operateTime);
                        String dtime = dateobj.getDate("yyyyMMdd");
                        map.put("ts", dtime);
                        return map.toString();
                    }
                }
        );

        //? 输出到文件系统
        //? 实例化
        String hdfs_path = wareHousePath.getPath();
        Path path = new Path(hdfs_path);
        String tb_name = wareHousePath.getTb_name();
        FileSink<String> fs = FileSink.<String>forRowFormat(
                        path,
                        new SimpleStringEncoder<>("UTF-8")
                ).withOutputFileConfig(OutputFileConfig.builder()
                        .withPartPrefix(tb_name)  // ? 文件前缀
                        .withPartSuffix(".log")  //? 文件后缀
                        .build())
                //? 指定分桶策略,按照时间分桶 -> 影响目录
                .withBucketAssigner(new warehouseAssigner())
                //? 文件滚动策略,满足策略后该文件关闭不再写入,新建文件写入。策略只要满足其中一个:时间或者文件大小
                .withRollingPolicy(
                        DefaultRollingPolicy.builder()
                                .withRolloverInterval(Duration.ofSeconds(10 * 60))  //?  10分钟刷新一个文件到文件系统中
                                .withMaxPartSize(new MemorySize((1024 * 1024 * 256)))  //? 每个文件存放256MB
                                .build()
                ).build()
                ;
        mappedDS.sinkTo(fs);
        env.execute(appName);
    }

}

重写BucketAssigner

如果要实现动态HDFS目录的情况下,就需要重写BucketAssignergetBucketId方法。 正常的写法是写入当前的系统时间:

arduino 复制代码
.withBucketAssigner(new DateTimeBucketAssigner<>("yyyyMMdd", ZoneId.systemDefault()))

但是需求按照数据时间写入,这里重写getBucketId的方法:

java 复制代码
package utils;

import org.apache.flink.core.io.SimpleVersionedSerializer;
import org.apache.flink.streaming.api.functions.sink.filesystem.BucketAssigner;
import com.alibaba.fastjson.JSONObject;
import com.alibaba.fastjson.JSON;
import org.apache.flink.streaming.api.functions.sink.filesystem.bucketassigners.SimpleVersionedStringSerializer;
import utils.str2Date;

import java.text.ParseException;

public class warehouseAssigner implements BucketAssigner<String, String> {
    @Override
    public String getBucketId(String s, Context context) {
        // s对象为具体的数据,因为是json字符串,通过fastjson进行解析取出日期数据
        JSONObject map = JSON.parseObject(s);
        JSONObject data2 = map.getJSONObject("data");
        String operateTime = data2.get("operate_time").toString();
        str2Date dateobj = null;
        String dtime;
        try {
            dateobj = new str2Date(operateTime);
            dtime = dateobj.getDate("yyyyMMdd");
        } catch (ParseException e) {
            throw new RuntimeException(e);
        } catch (Exception e) {
            throw new RuntimeException(e);
        }
        return "dtime=" + dtime;  // 拼接成需求的格式返回
    }

    @Override
    public SimpleVersionedSerializer<String> getSerializer() {
        // 一定要返回这个实例,否则无法正常工作
        return SimpleVersionedStringSerializer.INSTANCE;
    }
}

最终效果

细节问题

HDFS目录无法正常写入: 首先确定HDFS的路径是否正确,包括IP端口等,可以查看Hadoop的core-site.xml确认配置的URL路径

然后就是HDFS目录的权限问题,最好就是将HDFS目录权限设置为允许所有用户写入,也可以配置相关的用户权限等。

参考博客

www.jianshu.com/p/22bf54da7...

www.cnblogs.com/lalalalawan...

相关推荐
拓端研究室TRL2 小时前
【梯度提升专题】XGBoost、Adaboost、CatBoost预测合集:抗乳腺癌药物优化、信贷风控、比特币应用|附数据代码...
大数据
黄焖鸡能干四碗2 小时前
信息化运维方案,实施方案,开发方案,信息中心安全运维资料(软件资料word)
大数据·人工智能·软件需求·设计规范·规格说明书
编码小袁2 小时前
探索数据科学与大数据技术专业本科生的广阔就业前景
大数据
WeeJot嵌入式3 小时前
大数据治理:确保数据的可持续性和价值
大数据
zmd-zk3 小时前
kafka+zookeeper的搭建
大数据·分布式·zookeeper·中间件·kafka
激流丶3 小时前
【Kafka 实战】如何解决Kafka Topic数量过多带来的性能问题?
java·大数据·kafka·topic
测试界的酸菜鱼4 小时前
Python 大数据展示屏实例
大数据·开发语言·python
时差9534 小时前
【面试题】Hive 查询:如何查找用户连续三天登录的记录
大数据·数据库·hive·sql·面试·database
Mephisto.java4 小时前
【大数据学习 | kafka高级部分】kafka中的选举机制
大数据·学习·kafka
Mephisto.java4 小时前
【大数据学习 | kafka高级部分】kafka的优化参数整理
大数据·sql·oracle·kafka·json·database