flink sink doris

接上文:一文说清flink从编码到部署上线

网上关于flink sink drois的例子较多,大部分不太全面,故本文详细说明,且提供完整代码。

flink doris版本对照表

1.添加依赖

xml 复制代码
<!--doris cdc-->
        <!-- 参考:"https://doris.apache.org/zh-CN/docs/ecosystem/flink-doris-connector"版本对照表。
        到"https://repo.maven.apache.org/maven2/org/apache/doris/"下载对应版本的jar-->
        <!--mvn install:install-file -Dfile=D:/maven/flink-doris-connector-1.14_2.11-1.1.1.jar -DgroupId=com.flink.doris -DartifactId=flink-doris-connector-1.14_2.11 -Dversion=1.1.1 -Dpackaging=jar-->
        <dependency>
            <groupId>com.flink.doris</groupId>
            <artifactId>flink-doris-connector-1.14_2.11</artifactId>
            <version>1.1.1</version>
        </dependency>

2.运行环境工具类

EnvUtil具体实现如下:

java 复制代码
package com.zl.utils;

import org.apache.flink.configuration.Configuration;
import org.apache.flink.contrib.streaming.state.EmbeddedRocksDBStateBackend;
import org.apache.flink.streaming.api.CheckpointingMode;
import org.apache.flink.streaming.api.environment.CheckpointConfig;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.bridge.java.StreamTableEnvironment;

import java.time.Duration;
import java.time.ZoneOffset;
import java.util.concurrent.TimeUnit;

/**
 * EnvUtil
 * @description:
 */
public class EnvUtil {
    /**
     * 设置flink执行环境
     * @param parallelism 并行度
     */
    public static StreamExecutionEnvironment setFlinkEnv(int parallelism) {
        // System.setProperty("HADOOP_USER_NAME", "用户名") 对应的是 hdfs文件系统目录下的路径:/user/用户名的文件夹名,本文为root
        System.setProperty("HADOOP_USER_NAME", "root");
        Configuration conf = new Configuration();
        conf.setInteger("rest.port", 1000);
        StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment(conf);

        if (parallelism >0 ){
            //设置并行度
            env.setParallelism(parallelism);
        } else {
            env.setParallelism(1);// 默认1
        }

        // 添加重启机制
//        env.setRestartStrategy(RestartStrategies.fixedDelayRestart(50, Time.minutes(6)));
        // 没有这个配置,会导致"Flink 任务没报错,但是无法同步数据到doris"。
        // 启动checkpoint,设置模式为精确一次 (这是默认值),10*60*1000=60000
        env.enableCheckpointing(60000, CheckpointingMode.EXACTLY_ONCE);
        //rocksdb状态后端,启用增量checkpoint
        env.setStateBackend(new EmbeddedRocksDBStateBackend(true));
        //设置checkpoint路径
        CheckpointConfig checkpointConfig = env.getCheckpointConfig();

        // 同一时间只允许一个 checkpoint 进行(默认)
        checkpointConfig.setMaxConcurrentCheckpoints(1);
        //最小间隔,10*60*1000=60000
        checkpointConfig.setMinPauseBetweenCheckpoints(60000);
        // 取消任务后,checkpoint仍然保存
        checkpointConfig.enableExternalizedCheckpoints(CheckpointConfig.ExternalizedCheckpointCleanup.RETAIN_ON_CANCELLATION);
        //checkpoint容忍失败的次数
        checkpointConfig.setTolerableCheckpointFailureNumber(5);
        //checkpoint超时时间 默认10分钟
        checkpointConfig.setCheckpointTimeout(TimeUnit.MINUTES.toMillis(10));
        //禁用operator chain(方便排查反压)
        env.disableOperatorChaining();
        return env;
    }
}

3.CDC实现

相关sql详见文末代码:"resources/doris"。

3.1 创建doris数据库脚本

sql 复制代码
CREATE DATABASE IF NOT EXISTS `flink_test`;
USE `flink_test`;

DROP TABLE IF EXISTS `rv_table`;
CREATE TABLE `rv_table` (`dt` date NULL COMMENT '分区时间',
                         `uuid` varchar(30) NULL COMMENT '',
                         `report_time` bigint(20) NULL COMMENT '过车时间'
) ENGINE=OLAP
DUPLICATE KEY(`dt`, `uuid`)
COMMENT 'RV 数据'
PARTITION BY RANGE(`dt`)
()
DISTRIBUTED BY HASH(`uuid`) BUCKETS 10
PROPERTIES (
"replication_allocation" = "tag.location.default: 1",
"is_being_synced" = "false",
"dynamic_partition.enable" = "true",
"dynamic_partition.time_unit" = "DAY",
"dynamic_partition.time_zone" = "Asia/Shanghai",
"dynamic_partition.start" = "-2147483648",
"dynamic_partition.end" = "3",
"dynamic_partition.prefix" = "p_rv_table",
"dynamic_partition.replication_allocation" = "tag.location.default: 1",
"dynamic_partition.buckets" = "32",
"dynamic_partition.create_history_partition" = "false",
"dynamic_partition.history_partition_num" = "-1",
"dynamic_partition.hot_partition_num" = "0",
"dynamic_partition.reserved_history_periods" = "NULL",
"dynamic_partition.storage_policy" = "",
"storage_format" = "V2",
"light_schema_change" = "true",
"disable_auto_compaction" = "false",
"enable_single_replica_compaction" = "false"
);

3.2 创建mysql数据库脚本

sql 复制代码
CREATE DATABASE IF NOT EXISTS `flink_test1`;
USE `flink_test1`;

SET FOREIGN_KEY_CHECKS=0;

-- ----------------------------
-- Table structure for rv_table
-- ----------------------------
DROP TABLE IF EXISTS `rv_table`;
CREATE TABLE `rv_table` (
  `dt` varchar(10) NOT NULL,
  `uuid` varchar(30) DEFAULT NULL,
  `report_time` bigint(20) DEFAULT NULL,
  PRIMARY KEY (`dt`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8mb4;

注意:"dt"是varchar类型,开始尝试使用"date"类型,运行正常。但是本来是"2024-12-20"结果到doris是"2020-07-09",由于不在动态分区,保存失败。后续继续定位......

3.3 核心代码实现

java 复制代码
package com.zl.doris;

import com.alibaba.fastjson.JSONObject;
import com.ververica.cdc.connectors.mysql.source.MySqlSource;
import com.ververica.cdc.connectors.mysql.table.StartupOptions;
import com.ververica.cdc.debezium.JsonDebeziumDeserializationSchema;
import com.zl.utils.EnvUtil;
import org.apache.doris.flink.cfg.DorisExecutionOptions;
import org.apache.doris.flink.cfg.DorisOptions;
import org.apache.doris.flink.cfg.DorisReadOptions;
import org.apache.doris.flink.sink.DorisSink;
import org.apache.doris.flink.sink.writer.JsonDebeziumSchemaSerializer;
import org.apache.flink.api.common.RuntimeExecutionMode;
import org.apache.flink.api.common.eventtime.WatermarkStrategy;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

import java.time.LocalDate;
import java.util.Arrays;
import java.util.List;
import java.util.Properties;

public class DorisExampleCDC {
    public static void main(String[] args) throws Exception {

        // 配置运行环境,并行度1
        StreamExecutionEnvironment env = EnvUtil.setFlinkEnv(1);
        // 程序间隔离,每个程序单独设置
        env.getCheckpointConfig().setCheckpointStorage("hdfs://10.86.97.191:9000/flinktest/DorisExampleCDC");
        env.setRuntimeMode(RuntimeExecutionMode.STREAMING);// 跟DataSteam、RowData不一样。


        DorisOptions.Builder dorisOptions = DorisOptions.builder();
        dorisOptions.setFenodes("10.86.97.191:8130")//FE_IP:HTTP_PORT
                .setTableIdentifier("flink_test.rv_table")// db.table
                .setUsername("root")// root
                .setPassword("pwd");// password


        Properties properties = new Properties();
        // 上游是 json 写入时,需要开启配置
        properties.setProperty("format", "json");
        properties.setProperty("read_json_by_line", "true");
        DorisExecutionOptions.Builder  executionBuilder = DorisExecutionOptions.builder();
        executionBuilder.setLabelPrefix("doris_example_cdc3") //streamload label prefix
                .setDeletable(true)// 此处配置与DataSteam、RowData不同
                .setStreamLoadProp(properties);

        DorisSink.Builder<String> builder = DorisSink.builder();
        builder.setDorisReadOptions(DorisReadOptions.builder().build())
                .setDorisExecutionOptions(executionBuilder.build())
                .setDorisOptions(dorisOptions.build())
                .setSerializer(JsonDebeziumSchemaSerializer.builder().setDorisOptions(dorisOptions.build()).build());

        /*
        // 这种写法,整个运行过程都正常,但是最后就是没把数据存储到doris,当然doris官方示例,也没这么写。具体原因后面再找......
        JSONObject rvJsonObject = new JSONObject();
        rvJsonObject.put("dt","2024-12-21");
        rvJsonObject.put("uuid","cdc-2");
        rvJsonObject.put("report_time",1733881971621L);
        env.fromElements(JSONObject.toJSONString(rvJsonObject)).sinkTo(builder.build()).name("doris_sink").uid("doris_sink");*/

        /// mysql source
        List<String> SYNC_TABLES = Arrays.asList("flink_test.rv_table");
        MySqlSource<String> mySqlSource = MySqlSource.<String>builder()
                .hostname("10.86.37.169")
                .port(3306)
                .databaseList("flink_test")
                .tableList(String.join(",", SYNC_TABLES))
                .username("root")
                .password("pwd")// 记得修改为实际密码
                .startupOptions(StartupOptions.initial())
                .deserializer(new JsonDebeziumDeserializationSchema())
                .build();

        env.fromSource(mySqlSource, WatermarkStrategy.noWatermarks(), "MySQL Source").sinkTo(builder.build()).name("doris_sink").uid("doris_sink");
        env.execute("dorisSinkJob");
    }// main

}

3.4 运行效果

控制台输出,如下图所示:

flink web UI:

cdc成功后mysql与doris数据一致如下图所示:

4.DataSteam实现

注意:相关数据库脚本,参考CDC部分说明。

java 复制代码
package com.zl.doris;

import com.alibaba.fastjson.JSONObject;
import com.zl.utils.EnvUtil;
import org.apache.doris.flink.cfg.DorisExecutionOptions;
import org.apache.doris.flink.cfg.DorisOptions;
import org.apache.doris.flink.cfg.DorisReadOptions;
import org.apache.doris.flink.sink.DorisSink;
import org.apache.doris.flink.sink.writer.SimpleStringSerializer;
import org.apache.flink.api.common.RuntimeExecutionMode;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;

import java.time.LocalDate;
import java.util.Properties;

public class DorisExampleDataSteam {
    public static void main(String[] args) throws Exception {

        // 配置运行环境,并行度1
        StreamExecutionEnvironment env = EnvUtil.setFlinkEnv(1);
        // 程序间隔离,每个程序单独设置
        env.getCheckpointConfig().setCheckpointStorage("hdfs://10.86.97.191:9000/flinktest/DorisExampleDataSteam");
        env.setRuntimeMode(RuntimeExecutionMode.BATCH);// 没有这个配置,会导致"Flink 任务没报错,但是无法同步数据到doris"。

        DorisOptions.Builder dorisOptions = DorisOptions.builder();
        dorisOptions.setFenodes("10.86.97.191:8130")//FE_IP:HTTP_PORT
                .setTableIdentifier("flink_test.rv_table")// db.table
                .setUsername("root")// root
                .setPassword("pwd");// password


        Properties properties = new Properties();
        // 上游是 json 写入时,需要开启配置
        properties.setProperty("format", "json");
        properties.setProperty("read_json_by_line", "true");
        DorisExecutionOptions.Builder  executionBuilder = DorisExecutionOptions.builder();
        executionBuilder.setLabelPrefix("doris_example_datasteam") //streamload label prefix
                .setDeletable(false)
                .setStreamLoadProp(properties);

        DorisSink.Builder<String> builder = DorisSink.builder();
        builder.setDorisReadOptions(DorisReadOptions.builder().build())
                .setDorisExecutionOptions(executionBuilder.build())
                .setSerializer(new SimpleStringSerializer()) //serialize according to string
                .setDorisOptions(dorisOptions.build());

        JSONObject rvJsonObject = new JSONObject();
        rvJsonObject.put("dt","2024-12-20");// 日期取当天
        rvJsonObject.put("uuid","data-stream-1");
        rvJsonObject.put("report_time",1733881971621L);

        env.fromElements(JSONObject.toJSONString(rvJsonObject)).sinkTo(builder.build()).name("doris_sink").uid("doris_sink");

        env.execute("dorisSinkJob");
    }// main

}

5.RowData实现

注意:相关数据库脚本,参考CDC部分说明。

java 复制代码
package com.zl.doris;

import com.zl.utils.EnvUtil;
import org.apache.doris.flink.cfg.DorisExecutionOptions;
import org.apache.doris.flink.cfg.DorisOptions;
import org.apache.doris.flink.cfg.DorisReadOptions;
import org.apache.doris.flink.sink.DorisSink;
import org.apache.doris.flink.sink.writer.RowDataSerializer;
import org.apache.flink.api.common.RuntimeExecutionMode;
import org.apache.flink.api.common.functions.MapFunction;
import org.apache.flink.streaming.api.datastream.DataStream;
import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
import org.apache.flink.table.api.DataTypes;
import org.apache.flink.table.data.GenericRowData;
import org.apache.flink.table.data.RowData;
import org.apache.flink.table.data.StringData;
import org.apache.flink.table.types.DataType;

import java.time.LocalDate;
import java.util.Properties;

public class DorisExampleRowData {
    public static void main(String[] args) throws Exception {

        // 配置运行环境,并行度1
        StreamExecutionEnvironment env = EnvUtil.setFlinkEnv(1);
        // 程序间隔离,每个程序单独设置
        env.getCheckpointConfig().setCheckpointStorage("hdfs://10.86.97.191:9000/flinktest/DorisExampleRowData");
        env.setRuntimeMode(RuntimeExecutionMode.BATCH);// 没有这个配置,会导致"Flink 任务没报错,但是无法同步数据到doris"。

        DorisOptions.Builder dorisOptions = DorisOptions.builder();
        dorisOptions.setFenodes("10.86.97.191:8130")//FE_IP:HTTP_PORT
                .setTableIdentifier("flink_test.rv_table")// db.table
                .setUsername("root")// root
                .setPassword("pwd");// password


        Properties properties = new Properties();
        // 上游是 json 写入时,需要开启配置
        properties.setProperty("format", "json");
        properties.setProperty("read_json_by_line", "true");
        DorisExecutionOptions.Builder  executionBuilder = DorisExecutionOptions.builder();
        executionBuilder.setLabelPrefix("doris_example_rowdata") //streamload label prefix
                .setDeletable(false)
                .setStreamLoadProp(properties);

        String[] fields = {"dt", "uuid", "report_time"};
        DataType[] types = {DataTypes.DATE(),DataTypes.VARCHAR(30), DataTypes.BIGINT() };
        DorisSink.Builder<RowData> builder = DorisSink.builder();
        builder.setDorisReadOptions(DorisReadOptions.builder().build())
                .setDorisExecutionOptions(executionBuilder.build())
                .setSerializer(RowDataSerializer.builder()    //serialize according to rowdata
                        .setFieldNames(fields)
                        .setType("json")           //json format
                        .setFieldType(types).build())
                .setDorisOptions(dorisOptions.build());

        DataStream<RowData> source = env.fromElements("")
                .map(new MapFunction<String, RowData>() {
                    @Override
                    public RowData map(String value) throws Exception {
                        GenericRowData genericRowData = new GenericRowData(3);
                        Long myDate= LocalDate.of(2024,12,21).toEpochDay();// 日期取当天
                        genericRowData.setField(0, myDate.intValue());// 计算指定日期与1970-01-01相差的天数
                        genericRowData.setField(1, StringData.fromString("row-data-1"));
                        genericRowData.setField(2, 1733881971621L);
                        return genericRowData;
                    }
                });

        source.sinkTo(builder.build()).setParallelism(1).name("doris_sink").uid("doris_sink").name("doris_sink").uid("doris_sink");

        env.execute("dorisSinkJob");
    }// main

}

6.完整代码

完整代码见:完整代码

官方文档:flink-doris-connector

首先还是要参考官方文档,其次是照着官方文档多尝试。不同产品、不同章节,编写文档水平、代码风格还是有些差异,多体会其中核心思想。

7.遇到问题

7.1 问题1

遇到问题:整个执行过程都很正常,但是最后表里面没有数据。

问题原因:插入的数据不属于任何分区。

解决方法:注意本实例中是以日期(dt)作为分区字段的,所以dt要是具体的日期,比如"2024-12-20"。

7.2 问题2

遇到问题:整个执行过程都很正常,但是最后表里面没有数据。

官方文档:Connector1.1.0 版本以前,是攒批写入的,写入均是由数据驱动,需要判断上游是否有数据写入。1.1.0 之后,依赖 Checkpoint,必须开启 Checkpoint 才能写入。

其实还要:env.setRuntimeMode(RuntimeExecutionMode.BATCH)。

7.3 问题3

问题:DorisRuntimeException: Failed to get backend via

解决:doris连接信息(ip、端口、账号、密码)错误导致。

7.4 问题4

问题:DorisRuntimeException: stream load error: [INTERNAL_ERROR]too many filtered rows

原因:所在动态分区不存在。

相关推荐
qiquandongkh7 分钟前
期权懂|期权新手入门知识:如何挑选期权活跃合约?
大数据·区块链
木易小熙23 分钟前
MySQL最左匹配原则是什么
数据库·mysql
重生之绝世牛码1 小时前
Java设计模式 —— 【结构型模式】享元模式(Flyweight Pattern) 详解
java·大数据·开发语言·设计模式·享元模式·设计原则
喝醉酒的小白1 小时前
ElasticSearch 的核心功能
大数据·elasticsearch·jenkins
苹果酱05671 小时前
Golang的文件解压技术研究与应用案例
java·vue.js·spring boot·mysql·课程设计
hu_yuchen1 小时前
从单机到微服务的转型之路
redis·mysql·微服务
知识的宝藏2 小时前
Django models中的增删改查与MySQL SQL的对应关系
sql·mysql·django·django models
路在脚下@2 小时前
MySQL的索引失效的原因有那些
数据库·mysql
新子-存在了3 小时前
linux中 mysql备份
linux·运维·mysql
蚂蚁数据AntData3 小时前
流批一体向量化计算引擎 Flex 在蚂蚁的探索和实践
大数据·数据仓库·spark·数据库架构