概述
从其他工具同步MySQL的数据到Kafka中,通过flink消费Kafka的数据写入hdfs目录。 Kafka的数据是json格式:
json
{
"database": "gmall",
"table": "order_refund_info",
"type": "update",
"ts": 1675414185,
"xid": 38713,
"commit": 'true',
"data": {
"id": 6903,
"user_id": 191,
"order_id": 36103,
"sku_id": 34,
"refund_type": "1502",
"refund_num": 1,
"refund_amount": 3927,
"refund_reason_type": "1301",
"refund_reason_txt": "退款原因具体:0445847967",
"refund_status": "0705",
"create_time": "2022-06-14 11:00:53",
"operate_time": "2022-06-14 12:00:58"
},
"old": {
"refund_status": "1006",
"operate_time": "2022-06-14 12:00:53"
}
}
其中data部分为MySQL表数据,ts为采集的时间,为时间戳。这里的ts默认的时间比数据时间的operate_time要大,这里的需求是按照数据时间将数据插入到HDFS对应的分区目录,粒度为天,如:/warehouse/order_refund_info/dtime=20220614
。
实现
flink:1.17 Java: 1.8 Kafka: Hadoop:
maven配置
xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0"
xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>org.example</groupId>
<artifactId>kafka2hdfs</artifactId>
<version>1.0-SNAPSHOT</version>
<properties>
<maven.compiler.source>8</maven.compiler.source>
<maven.compiler.target>8</maven.compiler.target>
<project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
<flink.version>1.17.1</flink.version>
</properties>
<dependencies>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-streaming-java</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-clients</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-files</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-kafka</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-datagen</artifactId>
<version>${flink.version}</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.27</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.flink</groupId>
<artifactId>flink-connector-jdbc</artifactId>
<version>1.17-SNAPSHOT</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.12.5</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-client</artifactId>
<version>3.3.1</version>
<scope>provided</scope>
</dependency>
<dependency>
<groupId>com.alibaba</groupId>
<artifactId>fastjson</artifactId>
<version>1.2.78</version>
</dependency>
</dependencies>
<repositories>
<repository>
<id>apache-snapshots</id>
<name>apache snapshots</name>
<url>https://repository.apache.org/content/repositories/snapshots/</url>
<!-- <url>https://maven.aliyun.com/repository/apache-snapshots</url>-->
</repository>
</repositories>
<build>
<plugins>
<plugin>
<artifactId>maven-compiler-plugin</artifactId>
<version>3.6.0</version>
<configuration>
<source>1.8</source>
<target>1.8</target>
</configuration>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.1.0</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
<configuration>
<transformers>
<transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
</transformer>
<transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
<resource>reference.conf</resource>
</transformer>
</transformers>
<relocations>
<relocation>
<pattern>org.codehaus.plexus.util</pattern>
<shadedPattern>org.shaded.plexus.util</shadedPattern>
<excludes>
<exclude>org.codehaus.plexus.util.xml.Xpp3Dom</exclude>
<exclude>org.codehaus.plexus.util.xml.pull.*</exclude>
</excludes>
</relocation>
</relocations>
</configuration>
</execution>
</executions>
</plugin>
</plugins>
</build>
</project>
主程序
主程序比较简单,主要就是消费Kafka然后写入HDFS。
java
package kafka2hdfs;
import com.alibaba.fastjson.JSON;
import com.alibaba.fastjson.JSONObject;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.api.common.serialization.SimpleStringEncoder;
import org.apache.flink.api.common.serialization.SimpleStringSchema;
import org.apache.flink.api.java.utils.ParameterTool;
import org.apache.flink.configuration.MemorySize;
import org.apache.flink.connector.file.sink.FileSink;
import org.apache.flink.connector.kafka.source.KafkaSource;
import org.apache.flink.connector.kafka.source.enumerator.initializer.OffsetsInitializer;
import org.apache.flink.core.fs.Path;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.datastream.DataStreamSource;
import org.apache.flink.streaming.api.datastream.SingleOutputStreamOperator;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.streaming.api.functions.sink.filesystem.OutputFileConfig;
import org.apache.flink.streaming.api.functions.sink.filesystem.rollingpolicies.DefaultRollingPolicy;
import utils.Config;
import utils.WareHousePath;
import utils.str2Date;
import utils.warehouseAssigner;
import java.time.Duration;
import java.time.ZoneId;
import java.util.HashMap;
public class kafka2hdfs2 {
public static void main(String[] args) throws Exception{
// 通过命令行传入配置文件路径
ParameterTool parameterTool = ParameterTool.fromArgs(args);
String configPath = parameterTool.get("config");
// 解析配置文件内容,获取目标参数
Config config = new Config(configPath);
HashMap<String, String> confMap = config.readData();
String hdfspath = confMap.getOrDefault("hdfs_path", "hdfs://localhost:9000/warehouse/gmall/log/");
WareHousePath wareHousePath = new WareHousePath(
hdfspath,
confMap.get("database")+"_"+confMap.get("table_name")
);
StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
env.setParallelism(1); // 生产环境不建议使用此配置。默认设置为1,因为当前Kafka的topic为1.可以根据Kafka的topic分区数量来调整flink程序的并行度
env.enableCheckpointing(2000, CheckpointingMode.EXACTLY_ONCE);
String bootstrapServer = confMap.get("bootstrapservers");
String groupId = confMap.get("groupid");
String topic = confMap.get("topic");
String appName = confMap.getOrDefault("appname", "flinkApplication");
KafkaSource<String> kafkaSource = KafkaSource.<String>builder()
.setBootstrapServers(bootstrapServer)
.setGroupId(groupId)
.setTopics(topic)
.setValueOnlyDeserializer(new SimpleStringSchema())
.setStartingOffsets(OffsetsInitializer.latest())
.build();
DataStreamSource<String> kafkaDS = env.fromSource(kafkaSource, WatermarkStrategy.noWatermarks(), "kafkaSource");
SingleOutputStreamOperator<String> mappedDS = kafkaDS.map(
new MapFunction<String, String>() {
@Override
public String map(String value) throws Exception {
JSONObject map = JSON.parseObject(value);
JSONObject data2 = map.getJSONObject("data");
String operateTime = data2.get("operate_time").toString();
str2Date dateobj = new str2Date(operateTime);
String dtime = dateobj.getDate("yyyyMMdd");
map.put("ts", dtime);
return map.toString();
}
}
);
//? 输出到文件系统
//? 实例化
String hdfs_path = wareHousePath.getPath();
Path path = new Path(hdfs_path);
String tb_name = wareHousePath.getTb_name();
FileSink<String> fs = FileSink.<String>forRowFormat(
path,
new SimpleStringEncoder<>("UTF-8")
).withOutputFileConfig(OutputFileConfig.builder()
.withPartPrefix(tb_name) // ? 文件前缀
.withPartSuffix(".log") //? 文件后缀
.build())
//? 指定分桶策略,按照时间分桶 -> 影响目录
.withBucketAssigner(new warehouseAssigner())
//? 文件滚动策略,满足策略后该文件关闭不再写入,新建文件写入。策略只要满足其中一个:时间或者文件大小
.withRollingPolicy(
DefaultRollingPolicy.builder()
.withRolloverInterval(Duration.ofSeconds(10 * 60)) //? 10分钟刷新一个文件到文件系统中
.withMaxPartSize(new MemorySize((1024 * 1024 * 256))) //? 每个文件存放256MB
.build()
).build()
;
mappedDS.sinkTo(fs);
env.execute(appName);
}
}
重写BucketAssigner
如果要实现动态HDFS目录的情况下,就需要重写BucketAssigner
的getBucketId
方法。 正常的写法是写入当前的系统时间:
arduino
.withBucketAssigner(new DateTimeBucketAssigner<>("yyyyMMdd", ZoneId.systemDefault()))
但是需求按照数据时间写入,这里重写getBucketId的方法:
java
package utils;
import org.apache.flink.core.io.SimpleVersionedSerializer;
import org.apache.flink.streaming.api.functions.sink.filesystem.BucketAssigner;
import com.alibaba.fastjson.JSONObject;
import com.alibaba.fastjson.JSON;
import org.apache.flink.streaming.api.functions.sink.filesystem.bucketassigners.SimpleVersionedStringSerializer;
import utils.str2Date;
import java.text.ParseException;
public class warehouseAssigner implements BucketAssigner<String, String> {
@Override
public String getBucketId(String s, Context context) {
// s对象为具体的数据,因为是json字符串,通过fastjson进行解析取出日期数据
JSONObject map = JSON.parseObject(s);
JSONObject data2 = map.getJSONObject("data");
String operateTime = data2.get("operate_time").toString();
str2Date dateobj = null;
String dtime;
try {
dateobj = new str2Date(operateTime);
dtime = dateobj.getDate("yyyyMMdd");
} catch (ParseException e) {
throw new RuntimeException(e);
} catch (Exception e) {
throw new RuntimeException(e);
}
return "dtime=" + dtime; // 拼接成需求的格式返回
}
@Override
public SimpleVersionedSerializer<String> getSerializer() {
// 一定要返回这个实例,否则无法正常工作
return SimpleVersionedStringSerializer.INSTANCE;
}
}
最终效果
细节问题
HDFS目录无法正常写入: 首先确定HDFS的路径是否正确,包括IP端口等,可以查看Hadoop的core-site.xml确认配置的URL路径
然后就是HDFS目录的权限问题,最好就是将HDFS目录权限设置为允许所有用户写入,也可以配置相关的用户权限等。