Apache SeaTunnel 增强对csv读取时分割字段的能力

概述

本次修改基于 SeaTunnel-2.3.4 版本

在产品的业务里面,需要一个通过 CSV 来同步数据的功能,但 SeaTunnel 的 CSV 读取功能,并没有根据 CSV 规则来分隔字符串,而是简单的用换行符来分隔行,用 String.split 方法来分隔列,一旦列的值里面包含有换行符或者分隔符,结果就会错误。

所以在这里,我们对 Apache SeaTunnel 原来的 CSV 读取功能进行改造,让它真正能够处理 CSV。

开始

CSV 的处理在 connector-file-base 模块,在 org.apache.seatunnel.connectors.seatunnel.file.config.FileFormat 这个枚举可以看到,对于读取 CSV 类型,用的是 org.apache.seatunnel.connectors.seatunnel.file.source.reader.TextReadStrategy

public enum FileFormat implements Serializable {
    CSV("csv") {
        @Override
        public WriteStrategy getWriteStrategy(FileSinkConfig fileSinkConfig) {
            fileSinkConfig.setFieldDelimiter(",");
            return new TextWriteStrategy(fileSinkConfig);
        }

        @Override
        public ReadStrategy getReadStrategy() {
            // csv读取策略
            return new TextReadStrategy();
        }
    },
    ...
    ...
}

跟踪进 TextReadStrategy 的 read 方法

// org.apache.seatunnel.connectors.seatunnel.file.source.reader.TextReadStrategy

    @Override
    public void read(String path, String tableId, Collector<SeaTunnelRow> output)
            throws FileConnectorException, IOException {
        ....
        ....
        ....

        // 用的是BufferedReader来读取文件流
        try (BufferedReader reader =
                new BufferedReader(new InputStreamReader(inputStream, StandardCharsets.UTF_8))) {
            // 使用换行符来识别行
            reader.lines()
                    .skip(skipHeaderNumber)
                    .forEach(
                            line -> {
                                try {
                                    // 这个deserializationSchema是org.apache.seatunnel.format.text.TextDeserializationSchema
                                    // 他的deserialize方法,是使用 String.split 方法进行切割
                                    SeaTunnelRow seaTunnelRow =
                                            deserializationSchema.deserialize(line.getBytes());
                                    if (!readColumns.isEmpty()) {
                                        // need column projection
                                        Object[] fields;
                                        if (isMergePartition) {
                                            fields =
                                                    new Object
                                                            [readColumns.size()
                                                                    + partitionsMap.size()];
                                        } else {
                                            fields = new Object[readColumns.size()];
                                        }
                                        for (int i = 0; i < indexes.length; i++) {
                                            fields[i] = seaTunnelRow.getField(indexes[i]);
                                        }
                                        seaTunnelRow = new SeaTunnelRow(fields);
                                    }
                                    if (isMergePartition) {
                                        int index = seaTunnelRowType.getTotalFields();
                                        for (String value : partitionsMap.values()) {
                                            seaTunnelRow.setField(index++, value);
                                        }
                                    }
                                    seaTunnelRow.setTableId(tableId);
                                    output.collect(seaTunnelRow);
                                } catch (IOException e) {
                                    String errorMsg =
                                            String.format(
                                                    "Deserialize this data [%s] failed, please check the origin data",
                                                    line);
                                    throw new FileConnectorException(
                                            FileConnectorErrorCode.DATA_DESERIALIZE_FAILED,
                                            errorMsg,
                                            e);
                                }
                            });
        }
    }

发现这个策略内部分割行使用的是 BufferedReader.lines 方法,其实就是按换行符来分割,分割字段用的是 String.split 方法按字符分割。

问题就来了,如果我字段的内容里面就包含有换行符,或者字段内容里面就包含有分割符,那么出来的结果就乱了,把不该切的切了。

所以咱们要把这些逻辑进行改造,让它正确的切割字符串。

新增单独的 CSV 处理逻辑

我们要做的是用一种 CSV 专用的处理逻辑来替换 seatunnel 原本的 CSV 处理逻辑

引入 CSV 第三方处理库

在这里我选用的是 opencsv 这个库来处理 CSV 。

connector-file-basepom.xml 里将 opencsv 引入。

<dependency>
    <groupId>com.opencsv</groupId>
    <artifactId>opencsv</artifactId>
    <version>5.7.1</version>
</dependency>

新增 CSV 序列化类 DeserializationSchema

新建一个包 org.apache.seatunnel.connectors.seatunnel.file.format.csv,在下面新建一个 CSV 序列化类 CsvDeserializationSchema 实现 DeserializationSchema 接口(参照 TextDeserializationSchema),下面是 DeserializationSchema 以及它需要用到的 StrToNumberUtil 的完整代码,不想仔细看的小伙伴可以直接复制去用。

新增工具类 org.apache.seatunnel.connectors.seatunnel.file.utils.StrToNumberUtil

package org.apache.seatunnel.connectors.seatunnel.file.utils;

import org.apache.commons.lang3.StringUtils;

import java.math.BigDecimal;
import java.util.Optional;

public class StrToNumberUtil {

    public static Double str2Double(String str) {
        if (StringUtils.isBlank(str)) {
            return null;
        }
        try {
            return Double.parseDouble(str.trim());
        } catch (Exception e) {
            return null;
        }
    }

    public static Long str2Long(String str) {
        if (StringUtils.isBlank(str)) {
            return null;
        }
        str = str.trim();
        // 多个小数点,不是数字,pass
        if (str.indexOf('.') != str.lastIndexOf('.')) {
            return null;
        }
        // 取整数位
        String sub = str.indexOf('.') >= 0 ? str.substring(0, str.indexOf('.')) : str;
        try {
            return Long.parseLong(sub);
        } catch (Exception e) {
            return null;
        }
    }

    public static Byte str2Byte(String s) {
        return Optional.ofNullable(str2Long(s)).map(Long::byteValue).orElse(null);
    }

    public static Short str2Short(String s) {
        return Optional.ofNullable(str2Long(s)).map(Long::shortValue).orElse(null);
    }

    public static Integer str2Int(String s) {
        return Optional.ofNullable(str2Long(s)).map(Long::intValue).orElse(null);
    }

    public static Float str2Float(String s) {
        return Optional.ofNullable(str2Double(s)).map(Double::floatValue).orElse(null);
    }

    public static BigDecimal str2BigDecimal(String s) {
        if (StringUtils.isBlank(s)) {
            return null;
        }
        try {
            return new BigDecimal(s.trim());
        } catch (Exception e) {
            return null;
        }
    }

}

新增 CSV 序列化类 org.apache.seatunnel.connectors.seatunnel.file.format.csv.CsvDeserializationSchema

package org.apache.seatunnel.connectors.seatunnel.file.format.csv;

import com.fasterxml.jackson.databind.ObjectMapper;
import com.opencsv.CSVParser;
import com.opencsv.CSVParserBuilder;
import com.opencsv.CSVReader;
import com.opencsv.CSVReaderBuilder;
import org.apache.seatunnel.api.serialization.DeserializationSchema;
import org.apache.seatunnel.api.table.type.ArrayType;
import org.apache.seatunnel.api.table.type.BasicType;
import org.apache.seatunnel.api.table.type.MapType;
import org.apache.seatunnel.api.table.type.SeaTunnelDataType;
import org.apache.seatunnel.api.table.type.SeaTunnelRow;
import org.apache.seatunnel.api.table.type.SeaTunnelRowType;
import org.apache.seatunnel.common.exception.CommonErrorCodeDeprecated;
import org.apache.seatunnel.common.utils.DateTimeUtils;
import org.apache.seatunnel.common.utils.DateUtils;
import org.apache.seatunnel.common.utils.TimeUtils;
import org.apache.seatunnel.connectors.seatunnel.file.utils.StrToNumberUtil;
import org.apache.seatunnel.format.text.constant.TextFormatConstant;
import org.apache.seatunnel.format.text.exception.SeaTunnelTextFormatException;

import org.apache.commons.lang3.StringUtils;

import lombok.NonNull;

import java.io.IOException;
import java.io.InputStreamReader;
import java.io.StringReader;
import java.math.BigDecimal;
import java.nio.charset.StandardCharsets;
import java.util.*;

public class CsvDeserializationSchema implements DeserializationSchema<SeaTunnelRow> {
    private final SeaTunnelRowType seaTunnelRowType;
    private final String[] separators;
    private final DateUtils.Formatter dateFormatter;
    private final DateTimeUtils.Formatter dateTimeFormatter;
    private final TimeUtils.Formatter timeFormatter;

    private CsvDeserializationSchema(
            @NonNull SeaTunnelRowType seaTunnelRowType,
            String[] separators,
            DateUtils.Formatter dateFormatter,
            DateTimeUtils.Formatter dateTimeFormatter,
            TimeUtils.Formatter timeFormatter) {
        this.seaTunnelRowType = seaTunnelRowType;
        this.separators = separators;
        this.dateFormatter = dateFormatter;
        this.dateTimeFormatter = dateTimeFormatter;
        this.timeFormatter = timeFormatter;
    }

    public static Builder builder() {
        return new Builder();
    }

    public static class Builder {
        private SeaTunnelRowType seaTunnelRowType;
        private String[] separators = TextFormatConstant.SEPARATOR.clone();
        private DateUtils.Formatter dateFormatter = DateUtils.Formatter.YYYY_MM_DD;
        private DateTimeUtils.Formatter dateTimeFormatter =
                DateTimeUtils.Formatter.YYYY_MM_DD_HH_MM_SS;
        private TimeUtils.Formatter timeFormatter = TimeUtils.Formatter.HH_MM_SS;

        private Builder() {}

        public Builder seaTunnelRowType(SeaTunnelRowType seaTunnelRowType) {
            this.seaTunnelRowType = seaTunnelRowType;
            return this;
        }

        public Builder delimiter(String delimiter) {
            this.separators[0] = delimiter;
            return this;
        }

        public Builder separators(String[] separators) {
            this.separators = separators;
            return this;
        }

        public Builder dateFormatter(DateUtils.Formatter dateFormatter) {
            this.dateFormatter = dateFormatter;
            return this;
        }

        public Builder dateTimeFormatter(DateTimeUtils.Formatter dateTimeFormatter) {
            this.dateTimeFormatter = dateTimeFormatter;
            return this;
        }

        public Builder timeFormatter(TimeUtils.Formatter timeFormatter) {
            this.timeFormatter = timeFormatter;
            return this;
        }

        public CsvDeserializationSchema build() {
            return new CsvDeserializationSchema(
                    seaTunnelRowType, separators, dateFormatter, dateTimeFormatter, timeFormatter);
        }
    }

    /**
     * 传入json字符串,key是列号(从0开始),value是列的值
     *
     * @param message 传入json字符串,key是列号(从0开始),value是列的值
     */
    @Override
    public SeaTunnelRow deserialize(byte[] message) throws IOException {
        String content = new String(message);
        ObjectMapper objectMapper = new ObjectMapper();
        com.fasterxml.jackson.databind.type.MapType javaType =
                objectMapper
                        .getTypeFactory()
                        .constructMapType(HashMap.class, Integer.class, String.class);
        Map<Integer, String> splitsMap = objectMapper.readValue(content, javaType);
        Object[] objects = new Object[seaTunnelRowType.getTotalFields()];
        for (int i = 0; i < objects.length; i++) {
            objects[i] = convert(splitsMap.get(i), seaTunnelRowType.getFieldType(i), 0);
        }
        return new SeaTunnelRow(objects);
    }

    @Override
    public SeaTunnelDataType<SeaTunnelRow> getProducedType() {
        return seaTunnelRowType;
    }

    private Map<Integer, String> splitLineBySeaTunnelRowType(
            String line, SeaTunnelRowType seaTunnelRowType, int level) {
        String[] splits = splitLineWithCsvMethod(line, separators[level].charAt(0));
        LinkedHashMap<Integer, String> splitsMap = new LinkedHashMap<>();
        SeaTunnelDataType<?>[] fieldTypes = seaTunnelRowType.getFieldTypes();
        for (int i = 0; i < splits.length; i++) {
            splitsMap.put(i, splits[i]);
        }
        if (fieldTypes.length > splits.length) {
            // contains partition columns
            for (int i = splits.length; i < fieldTypes.length; i++) {
                splitsMap.put(i, null);
            }
        }
        return splitsMap;
    }

    private String[] splitLineWithCsvMethod(String line, char sep) {
        CSVParser csvParser = new CSVParserBuilder().withSeparator(sep).build();
        try (CSVReader reader = new CSVReaderBuilder(new StringReader(line))
                        .withCSVParser(csvParser)
                        .build()) {
            Iterator<String[]> iterator = reader.iterator();
            if (iterator.hasNext()) {
                return iterator.next();
            }
            return new String[0];
        } catch (Exception e) {
            return new String[]{line};
        }
    }

    private Object convert(String field, SeaTunnelDataType<?> fieldType, int level) {
        if (StringUtils.isBlank(field)) {
            return null;
        }
        switch (fieldType.getSqlType()) {
            case ARRAY:
                BasicType<?> elementType = ((ArrayType<?, ?>) fieldType).getElementType();
                String[] elements = field.split(separators[level + 1]);
                ArrayList<Object> objectArrayList = new ArrayList<>();
                for (String element : elements) {
                    objectArrayList.add(convert(element, elementType, level + 1));
                }
                switch (elementType.getSqlType()) {
                    case STRING:
                        return objectArrayList.toArray(new String[0]);
                    case BOOLEAN:
                        return objectArrayList.toArray(new Boolean[0]);
                    case TINYINT:
                        return objectArrayList.toArray(new Byte[0]);
                    case SMALLINT:
                        return objectArrayList.toArray(new Short[0]);
                    case INT:
                        return objectArrayList.toArray(new Integer[0]);
                    case BIGINT:
                        return objectArrayList.toArray(new Long[0]);
                    case FLOAT:
                        return objectArrayList.toArray(new Float[0]);
                    case DOUBLE:
                        return objectArrayList.toArray(new Double[0]);
                    default:
                        throw new SeaTunnelTextFormatException(
                                CommonErrorCodeDeprecated.UNSUPPORTED_DATA_TYPE,
                                String.format(
                                        "SeaTunnel array not support this data type [%s]",
                                        elementType.getSqlType()));
                }
            case MAP:
                SeaTunnelDataType<?> keyType = ((MapType<?, ?>) fieldType).getKeyType();
                SeaTunnelDataType<?> valueType = ((MapType<?, ?>) fieldType).getValueType();
                LinkedHashMap<Object, Object> objectMap = new LinkedHashMap<>();
                String[] kvs = field.split(separators[level + 1]);
                for (String kv : kvs) {
                    String[] splits = kv.split(separators[level + 2]);
                    if (splits.length < 2) {
                        objectMap.put(convert(splits[0], keyType, level + 1), null);
                    } else {
                        objectMap.put(
                                convert(splits[0], keyType, level + 1),
                                convert(splits[1], valueType, level + 1));
                    }
                }
                return objectMap;
            case STRING:
                return field;
            case BOOLEAN:
                return Boolean.parseBoolean(field);
            case TINYINT:
                return StrToNumberUtil.str2Byte(field);
            case SMALLINT:
                return StrToNumberUtil.str2Short(field);
            case INT:
                return StrToNumberUtil.str2Int(field);
            case BIGINT:
                return StrToNumberUtil.str2Long(field);
            case FLOAT:
                return StrToNumberUtil.str2Float(field);
            case DOUBLE:
                return StrToNumberUtil.str2Double(field);
            case DECIMAL:
                return StrToNumberUtil.str2BigDecimal(field);
            case NULL:
                return null;
            case BYTES:
                return field.getBytes();
            case DATE:
                return DateUtils.parse(field, dateFormatter);
            case TIME:
                return TimeUtils.parse(field, timeFormatter);
            case TIMESTAMP:
                return DateTimeUtils.parse(field, dateTimeFormatter);
            case ROW:
                Map<Integer, String> splitsMap =
                        splitLineBySeaTunnelRowType(field, (SeaTunnelRowType) fieldType, level + 1);
                Object[] objects = new Object[splitsMap.size()];
                for (int i = 0; i < objects.length; i++) {
                    objects[i] =
                            convert(
                                    splitsMap.get(i),
                                    ((SeaTunnelRowType) fieldType).getFieldType(i),
                                    level + 1);
                }
                return new SeaTunnelRow(objects);
            default:
                throw new SeaTunnelTextFormatException(
                        CommonErrorCodeDeprecated.UNSUPPORTED_DATA_TYPE,
                        String.format(
                                "SeaTunnel not support this data type [%s]",
                                fieldType.getSqlType()));
        }
    }
}

它和 TextDeserializationSchema 的主要区别在 deserializesplitLineBySeaTunnelRowTypeconvert 这三个方法

deserialize 方法:
  • TextDeserializationSchema 接收的是上层读取的一行的字符串转字节,再调用 splitLineBySeaTunnelRowType 方法按分隔符切割,转成 列号->值 的 map,传给 convert 方法。
    • CsvDeserializationSchema 接收的是一个 json 字符串转字节,这个 json 已经是的 key 为列号(从 0 开始),value 为列的值的形式了,将 json 转成 map 之后,直接可以传给 convert 方法
    • convert 方法:在处理 TINYINTSMALLINTINTBIGINTFLOATDOUBLEDECIMAL 类型时,DeserializationSchema 使用了我们自定义的字符串转数值类的工具类(StrToNumberUtil),主要是为了避免用户提供的文件中,数值类型的值格式不符合 java 自带的字符串转数值方法的格式,从而导致异常的情况
splitLineBySeaTunnelRowType 方法:
  • TextDeserializationSchema 直接使用 String.split 根据分隔符进行切割,如果值里面也带有分隔符,切割结果会有问题
  • CsvDeserializationSchema 使用 opencsv 来进行切割,完全按照 CSV 规范来处理,稳得很!

其他细微的差异就不多过多描述了,可以使用 idea 的文件对比功能比较一下看看哈。

新增 CSV 读取策略类 CsvReadStrategy

org.apache.seatunnel.connectors.seatunnel.file.source.reader 包下新增 CsvReadStrategy 类,参考 org.apache.seatunnel.connectors.seatunnel.file.source.reader.TextReadStrategy 即可

下面是完整的 CsvReadStrategy 类代码

package org.apache.seatunnel.connectors.seatunnel.file.source.reader;

import com.fasterxml.jackson.databind.ObjectMapper;
import com.opencsv.CSVParser;
import com.opencsv.CSVParserBuilder;
import com.opencsv.CSVReader;
import com.opencsv.CSVReaderBuilder;
import org.apache.seatunnel.api.common.SeaTunnelAPIErrorCode;
import org.apache.seatunnel.api.configuration.ReadonlyConfig;
import org.apache.seatunnel.api.serialization.DeserializationSchema;
import org.apache.seatunnel.api.source.Collector;
import org.apache.seatunnel.api.table.catalog.CatalogTableUtil;
import org.apache.seatunnel.api.table.type.SeaTunnelDataType;
import org.apache.seatunnel.api.table.type.SeaTunnelRow;
import org.apache.seatunnel.api.table.type.SeaTunnelRowType;
import org.apache.seatunnel.common.utils.DateTimeUtils;
import org.apache.seatunnel.common.utils.DateUtils;
import org.apache.seatunnel.common.utils.TimeUtils;
import org.apache.seatunnel.connectors.seatunnel.file.config.BaseSourceConfigOptions;
import org.apache.seatunnel.connectors.seatunnel.file.config.CompressFormat;
import org.apache.seatunnel.connectors.seatunnel.file.config.FileFormat;
import org.apache.seatunnel.connectors.seatunnel.file.exception.FileConnectorErrorCode;
import org.apache.seatunnel.connectors.seatunnel.file.exception.FileConnectorException;
import org.apache.seatunnel.connectors.seatunnel.file.format.csv.CsvDeserializationSchema;
import org.apache.seatunnel.format.text.constant.TextFormatConstant;

import io.airlift.compress.lzo.LzopCodec;
import lombok.extern.slf4j.Slf4j;

import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.nio.charset.StandardCharsets;
import java.util.HashMap;
import java.util.Iterator;
import java.util.Map;
import java.util.Optional;

@Slf4j
public class CsvReadStrategy extends AbstractReadStrategy {
    private DeserializationSchema<SeaTunnelRow> deserializationSchema;
    private String fieldDelimiter = BaseSourceConfigOptions.FIELD_DELIMITER.defaultValue();
    private DateUtils.Formatter dateFormat = BaseSourceConfigOptions.DATE_FORMAT.defaultValue();
    private DateTimeUtils.Formatter datetimeFormat =
            BaseSourceConfigOptions.DATETIME_FORMAT.defaultValue();
    private TimeUtils.Formatter timeFormat = BaseSourceConfigOptions.TIME_FORMAT.defaultValue();
    private CompressFormat compressFormat = BaseSourceConfigOptions.COMPRESS_CODEC.defaultValue();
    private int[] indexes;

    @Override
    public void read(String path, String tableId, Collector<SeaTunnelRow> output)
            throws FileConnectorException, IOException {
        Map<String, String> partitionsMap = parsePartitionsByPath(path);
        InputStream inputStream;
        switch (compressFormat) {
            case LZO:
                LzopCodec lzo = new LzopCodec();
                inputStream = lzo.createInputStream(hadoopFileSystemProxy.getInputStream(path));
                break;
            case NONE:
                inputStream = hadoopFileSystemProxy.getInputStream(path);
                break;
            default:
                log.warn(
                        "Text file does not support this compress type: {}",
                        compressFormat.getCompressCodec());
                inputStream = hadoopFileSystemProxy.getInputStream(path);
                break;
        }

        char separator =
                (fieldDelimiter == null || fieldDelimiter.isEmpty())
                        ? ','
                        : fieldDelimiter.charAt(0);
        CSVParser csvParser = new CSVParserBuilder().withSeparator(separator).build();
        try (CSVReader reader = new CSVReaderBuilder(new InputStreamReader(inputStream, StandardCharsets.UTF_8))
                        .withSkipLines((int) skipHeaderNumber)
                        .withCSVParser(csvParser)
                        .build()) {
            Iterator<String[]> iterator = reader.iterator();
            ObjectMapper objectMapper = new ObjectMapper();
            while (iterator.hasNext()) {
                String[] row = iterator.next();
                if (row.length == 0) {
                    continue;
                }
                try {
                    Map<Integer, String> rowMap = new HashMap<>();
                    for (int i = 0; i < row.length; i++) {
                        rowMap.put(i, row[i]);
                    }
                    byte[] bytes = objectMapper.writeValueAsBytes(rowMap);
                    SeaTunnelRow seaTunnelRow = deserializationSchema.deserialize(bytes);
                    if (!readColumns.isEmpty()) {
                        // need column projection
                        Object[] fields;
                        if (isMergePartition) {
                            fields = new Object[readColumns.size() + partitionsMap.size()];
                        } else {
                            fields = new Object[readColumns.size()];
                        }
                        for (int i = 0; i < indexes.length; i++) {
                            fields[i] = seaTunnelRow.getField(indexes[i]);
                        }
                        seaTunnelRow = new SeaTunnelRow(fields);
                    }
                    if (isMergePartition) {
                        int index = seaTunnelRowType.getTotalFields();
                        for (String value : partitionsMap.values()) {
                            seaTunnelRow.setField(index++, value);
                        }
                    }
                    seaTunnelRow.setTableId(tableId);
                    output.collect(seaTunnelRow);
                } catch (Exception e) {
                    String errorMsg =
                            String.format(
                                    "Deserialize this data [%s] failed, please check the origin data",
                                    String.join(separator + "", row));
                    throw new FileConnectorException(
                            FileConnectorErrorCode.DATA_DESERIALIZE_FAILED, errorMsg, e);
                }
            }
        }
    }

    @Override
    public SeaTunnelRowType getSeaTunnelRowTypeInfo(String path) {
        this.seaTunnelRowType = CatalogTableUtil.buildSimpleTextSchema();
        this.seaTunnelRowTypeWithPartition =
                mergePartitionTypes(fileNames.get(0), seaTunnelRowType);
        initFormatter();
        if (pluginConfig.hasPath(BaseSourceConfigOptions.READ_COLUMNS.key())) {
            throw new FileConnectorException(
                    SeaTunnelAPIErrorCode.CONFIG_VALIDATION_FAILED,
                    "When reading json/text/csv files, if user has not specified schema information, "
                            + "SeaTunnel will not support column projection");
        }
        CsvDeserializationSchema.Builder builder =
                CsvDeserializationSchema.builder()
                        .delimiter(TextFormatConstant.PLACEHOLDER)
                        .dateFormatter(dateFormat)
                        .dateTimeFormatter(datetimeFormat)
                        .timeFormatter(timeFormat);
        if (isMergePartition) {
            deserializationSchema =
                    builder.seaTunnelRowType(this.seaTunnelRowTypeWithPartition).build();
        } else {
            deserializationSchema = builder.seaTunnelRowType(this.seaTunnelRowType).build();
        }
        return getActualSeaTunnelRowTypeInfo();
    }

    @Override
    public void setSeaTunnelRowTypeInfo(SeaTunnelRowType seaTunnelRowType) {
        SeaTunnelRowType userDefinedRowTypeWithPartition =
                mergePartitionTypes(fileNames.get(0), seaTunnelRowType);
        Optional<String> fieldDelimiterOptional =
                ReadonlyConfig.fromConfig(pluginConfig)
                        .getOptional(BaseSourceConfigOptions.FIELD_DELIMITER);
        if (fieldDelimiterOptional.isPresent()) {
            fieldDelimiter = fieldDelimiterOptional.get();
        } else {
            FileFormat fileFormat =
                    FileFormat.valueOf(
                            pluginConfig
                                    .getString(BaseSourceConfigOptions.FILE_FORMAT_TYPE.key())
                                    .toUpperCase());
            if (fileFormat == FileFormat.CSV) {
                fieldDelimiter = ",";
            }
        }
        initFormatter();
        CsvDeserializationSchema.Builder builder =
                CsvDeserializationSchema.builder()
                        .delimiter(fieldDelimiter)
                        .dateFormatter(dateFormat)
                        .dateTimeFormatter(datetimeFormat)
                        .timeFormatter(timeFormat);
        if (isMergePartition) {
            deserializationSchema =
                    builder.seaTunnelRowType(userDefinedRowTypeWithPartition).build();
        } else {
            deserializationSchema = builder.seaTunnelRowType(seaTunnelRowType).build();
        }
        // column projection
        if (pluginConfig.hasPath(BaseSourceConfigOptions.READ_COLUMNS.key())) {
            // get the read column index from user-defined row type
            indexes = new int[readColumns.size()];
            String[] fields = new String[readColumns.size()];
            SeaTunnelDataType<?>[] types = new SeaTunnelDataType[readColumns.size()];
            for (int i = 0; i < indexes.length; i++) {
                indexes[i] = seaTunnelRowType.indexOf(readColumns.get(i));
                fields[i] = seaTunnelRowType.getFieldName(indexes[i]);
                types[i] = seaTunnelRowType.getFieldType(indexes[i]);
            }
            this.seaTunnelRowType = new SeaTunnelRowType(fields, types);
            this.seaTunnelRowTypeWithPartition =
                    mergePartitionTypes(fileNames.get(0), this.seaTunnelRowType);
        } else {
            this.seaTunnelRowType = seaTunnelRowType;
            this.seaTunnelRowTypeWithPartition = userDefinedRowTypeWithPartition;
        }
    }

    private void initFormatter() {
        if (pluginConfig.hasPath(BaseSourceConfigOptions.DATE_FORMAT.key())) {
            dateFormat =
                    DateUtils.Formatter.parse(
                            pluginConfig.getString(BaseSourceConfigOptions.DATE_FORMAT.key()));
        }
        if (pluginConfig.hasPath(BaseSourceConfigOptions.DATETIME_FORMAT.key())) {
            datetimeFormat =
                    DateTimeUtils.Formatter.parse(
                            pluginConfig.getString(BaseSourceConfigOptions.DATETIME_FORMAT.key()));
        }
        if (pluginConfig.hasPath(BaseSourceConfigOptions.TIME_FORMAT.key())) {
            timeFormat =
                    TimeUtils.Formatter.parse(
                            pluginConfig.getString(BaseSourceConfigOptions.TIME_FORMAT.key()));
        }
        if (pluginConfig.hasPath(BaseSourceConfigOptions.COMPRESS_CODEC.key())) {
            String compressCodec =
                    pluginConfig.getString(BaseSourceConfigOptions.COMPRESS_CODEC.key());
            compressFormat = CompressFormat.valueOf(compressCodec.toUpperCase());
        }
    }
}

TextReadStrategy 的区别在于

read 方法:

  • TextReadStrategy 使用 BufferedReader 来分隔行
  • CsvReadStrategy 使用 opencsv 的 CSVReader 来分隔行

getSeaTunnelRowTypeInfosetSeaTunnelRowTypeInfo 方法:

  • TextReadStrategy 使用 TextDeserializationSchema.Builder 来获取 deserializationSchema
    • CsvReadStrategy 使用 CsvDeserializationSchema.Builder 来获取 deserializationSchema

注册 CSV 策略类

最后,将 CsvReadStrategy 写到 org.apache.seatunnel.connectors.seatunnel.file.config.FileFormat 这个枚举中,CSV 的 getReadStrategy 方法,从 new TextReadStrategy(),改成 new CsvReadStrategy() 即可

最后打包

全局生效

如果想让其他基于 connector-file-base 的插件都生效,那就打包 connector-file-base 模块。

mvn clean package -DskipTests=true -pl seatunnel-connectors-v2/connector-file/connector-file-base -am

然后将 connector-file-base 的 JAR 放进 SeaTunnel 部署目录的 Lib 目录下,所有基于 connector-file-base 的插件都会生效。

部分插件生效

想让某些基于 connector-file-base 的插件生效,那就只重新打包那一个插件即可。

mvn clean package -DskipTests=true -pl [你想要生效的插件路径] -am

用新 JAR 替换旧 JAR 即可。

其他

打包的时候可能会因为什么原因导致 Maven 的 Spotless 插件报错,试试先跑一下 mvn spotless:apply,再去跑打包。

本文完!

本文由 白鲸开源科技 提供发布支持!

相关推荐
节点。csn31 分钟前
Hadoop yarn安装
大数据·hadoop·分布式
arnold6640 分钟前
探索 ElasticSearch:性能优化之道
大数据·elasticsearch·性能优化
NiNg_1_2342 小时前
基于Hadoop的数据清洗
大数据·hadoop·分布式
成长的小牛2333 小时前
es使用knn向量检索中numCandidates和k应该如何配比更合适
大数据·elasticsearch·搜索引擎
goTsHgo3 小时前
在 Spark 上实现 Graph Embedding
大数据·spark·embedding
程序猿小柒3 小时前
【Spark】Spark SQL执行计划-精简版
大数据·sql·spark
隔着天花板看星星3 小时前
Spark-Streaming集成Kafka
大数据·分布式·中间件·spark·kafka
奥顺3 小时前
PHPUnit使用指南:编写高效的单元测试
大数据·mysql·开源·php
小屁孩大帅-杨一凡3 小时前
Flink 简介和简单的demo
大数据·flink
天冬忘忧3 小时前
Flink调优----反压处理
大数据·flink