Flink DataStream 写入Starrocks实践

Starrocks介绍

新一代 OLAP 的 "全能选手"，胜在全场景适配和性能均衡，适合中大型企业全量数据分析，简单一句就是你可以像写mysql一样写starrocks。它主要有以下几个特点（我们用主要考虑一点starrocks的主键表支持实时更新而clickhouse做不到这一点）

聚合查询性能极佳，全面向量化引擎和CBO优化器，复杂查询表现突出
数据更新能力极佳，支持高效主键模型和实时更新
高并发支持极佳，专为高并发设计
高度兼容MySQL协议，易用性高

Flink DataStream API写入Starrocks

官方参考文档：从从 Apache Flink® 持续导入持续导入

flink写入starrocks官方推荐使用StarRocks 提供的 Flink connector，稳定性和性能都要更好。它的基本原理是 Flink connector 在内存中积攒小批数据，再通过Stream Load一次性导入 StarRocks，在代码中我们只需要添加对应的sink即可，如下

复制代码

// 5. 写入 StarRocks（使用 SinkFunction）
        enrichedStream.addSink(StarRocksStateTimingSinkFactory.create());

对应的sink factory类似如下

(说明：以下是flink-connector版本<= 1.2.7主键表的写法，高版本略有不同)

复制代码

public final class StarRocksStateTimingSinkFactory {

    private static final long MILLIS_PER_MINUTE = 60_000L;

    private StarRocksStateTimingSinkFactory() {
    }

    public static SinkFunction<StateTimingRecord> create() {
        TableSchema tableSchema = TableSchema.builder()
                //主键表主键列不能为null
                .field("id", DataTypes.BIGINT().notNull())
                .field("record_id", DataTypes.STRING().notNull())
                .field("create_time", DataTypes.TIMESTAMP().notNull())
                .field("is_deleted", DataTypes.INT().notNull())
                .field("biz_type", DataTypes.INT())
                .field("update_time", DataTypes.TIMESTAMP())
                //显式配置primary key
                .primaryKey("id", "record_id", "create_time")
                .build();

        StarRocksSinkOptions.Builder builder = StarRocksSinkOptions.builder()
                .withProperty("jdbc-url", ConfigUtils.getStarRocksJdbcUrl())
                .withProperty("load-url", ConfigUtils.getStarRocksLoadUrl())
                .withProperty("database-name", ConfigUtils.getStarRocksDatabase())
                .withProperty("table-name", ConfigUtils.getStarRocksStateTimeTable())
                .withProperty("username", ConfigUtils.getStarRocksUsername())
                .withProperty("password", ConfigUtils.getStarRocksPassword())
                .withProperty("sink.buffer-flush.max-rows", "500000")
                .withProperty("sink.buffer-flush.max-bytes", "94371840")
                .withProperty("sink.buffer-flush.interval-ms", "5000")
                .withProperty("sink.connect.timeout-ms", "30000")
                //定义并行度
                .withProperty("sink.parallelism", String.valueOf(ConfigUtils.getStarRocksSinkParallelism()))
                //only for Flink connector version <= 1.2.7
                .withProperty("sink.properties.columns","id,record_id,create_time,is_deleted,biz_type,update_time,__op")
                ;

        return StarRocksSink.sink(tableSchema, builder.build(), new StateTimingRowBuilder());
    }

    private static final class StateTimingRowBuilder implements StarRocksSinkRowBuilder<StateTimingRecord> {

        @Override
        public void accept(Object[] rowData, StateTimingRecord record) {
            int idx = 0;
            rowData[idx++] = record.getId();
            rowData[idx++] = record.getRecordId();
            rowData[idx++] = toTimestamp(record.getCreateTime());
            rowData[idx++] = defaultInteger(record.getIsDeleted());
            rowData[idx++] = defaultInteger(record.getBizType());
            rowData[idx++] = toTimestamp(record.getUpdateTime());
            // When the StarRocks table is a Primary Key table, you need to set the last element to indicate whether the data loading is an UPSERT or DELETE operation.
            rowData[idx++] = StarRocksSinkOP.UPSERT.ordinal();
        }

        private Timestamp toTimestamp(Long epochMillis) {
            return epochMillis == null ? null : new Timestamp(epochMillis);
        }

        private Long toMinute(Long epochMillis) {
            return epochMillis == null ? null : epochMillis / MILLIS_PER_MINUTE;
        }

        private Integer defaultInteger(Integer value) {
            return value == null ? 0 : value;
        }
    }
}

一些注意事项

采用主键表时
在建表语句中，主键列必须定义在其他列之前。主键必须包含分区列和分桶列。
主键列支持以下数据类型：数值（包括整型和布尔）、日期和字符串。默认设置下，单条主键值编码后的最大长度为 128 字节。
建表后不支持修改主键主键列的值不能更新，避免破坏数据一致性数据清洗针对字符串类型，需要过滤字段内容中诸如回车、换行等特殊字符
不同版本区别
如果代码中采用的是低版本（version <= 1.2.7）的flink-connector，需要做几个处理
代码中配置TableSchema以及SinkOptions和RowBuilder，做如下处理：主键表主键列不能为null

field("uuid", DataTypes.STRING().notNull())
需要显式配置primary key

primaryKey("uuid", "field_id", "create_time")
需要显示配置sink.properties.columns

//only for Flink connector version <= 1.2.7，最后增加__op字段
.withProperty("sink.properties.columns", "uuid,field_id,field_data_id,field_value_code_md5,create_time,is_deleted,space_id,category_code,field_component_name,field_label,field_data_name,field_data_type,field_value_string,field_value_long,field_value_double,field_value_boolean,field_value_code,__op");
rowData中最后一列为__op字段

// When the StarRocks table is a Primary Key table, you need to set the last element to indicate whether the data loading is an UPSERT or DELETE operation.
rowData[idx++] = StarRocksSinkOP.UPSERT.ordinal();