1、下载spark源码并编译
mkdir -p /home/bigdata && cd /home/bigdata
wget https://archive.apache.org/dist/spark/spark-3.4.3/spark-3.4.3.tgz
解压文件
tar -zxf spark-3.4.3.tgz
cd spark-3.4.3
git apply Celeborn_Dynamic_Allocation_spark3_4.patch
源码构建编译./dev/make-distribution.sh --name lukeyan --pip --tgz -Dhadoop.version=3.3.6 -Phive -Phive-thriftserver -Pkubernetes -Pvolcano
编译成功
构建完成的进行解压操作并添加相应的jar文件
解压编译的文件
tar -zxvf spark-3.4.3-bin-lukeyan.tgz
cd spark-3.4.3-bin-lukeyan
添加jar文件
cd jars/
ls
wget https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.6/hadoop-aws-3.3.6.jar
添加 Paimon集成相关依赖
wget https://repo1.maven.org/maven2/org/apache/paimon/paimon-spark-3.4/0.9.0/paimon-spark-3.4-0.9.0.jar
如果Kubernetes 的发行版使用的是 K3s 、RKE2等,还需要加入以下依赖
wget https://repo1.maven.org/maven2/org/bouncycastle/bcpkix-jdk18on/1.77/bcpkix-jdk18on-1.77.jar
wget https://repo1.maven.org/maven2/org/bouncycastle/bcprov-jdk18on/1.77/bcprov-jdk18on-1.77.jar
cd ..
构建docker镜像
docker buildx build --load --platform linux/arm64 --tag spark-paimon-s3:3.4.3_2.12 .
查看镜像架构
docker inspect --format '{{.Architecture}}' azul/zulu-openjdk:17.0.9-17.46.19-jre
docker images
docker save -o jdk.tar azul/zulu-openjdk:17.0.9-17.46.19-jre
docker save -o flink.tar flink:1.19-scala_2.12-java17
docker pull --platform linux/arm64 azul/zulu-openjdk:17.0.9-17.46.19-jre
docker inspect --format '{{.Architecture}}' azul/zulu-openjdk:17.0.9-17.46.19-jre
docker buildx ls
x86上构建Arm镜像参考地址Centos7的x86上构建arm镜像docker_centos7 arm镜像-CSDN博客
将Dockerfile拷贝到当前目录下
FROM azul/zulu-openjdk:17.0.9-17.46.19-jre
ARG spark_uid=185
ENV HADOOP_CONF_DIR=/etc/hadoop/conf
Before building the docker image, first build and make a Spark distribution following
the instructions in https://spark.apache.org/docs/latest/building-spark.html.
If this docker file is being used in the context of building your images from a Spark
distribution, the docker build command should be invoked from the top level directory
of the Spark distribution. E.g.:
docker build -t spark:latest -f kubernetes/dockerfiles/spark/Dockerfile .
RUN set -ex && \
apt-get update && \
ln -s /lib /lib64 && \
apt install -y bash tini libc6 libpam-modules krb5-user libnss3 procps net-tools && \
mkdir -p /opt/spark && \
mkdir -p /opt/spark/examples && \
mkdir -p /opt/spark/work-dir && \
touch /opt/spark/RELEASE && \
rm /bin/sh && \
ln -sv /bin/bash /bin/sh && \
echo "auth required pam_wheel.so use_uid" >> /etc/pam.d/su && \
chgrp root /etc/passwd && chmod ug+rw /etc/passwd && \
rm -rf /var/cache/apt/* && rm -rf /var/lib/apt/lists/*
COPY jars /opt/spark/jars
Copy RELEASE file if exists
COPY RELEAS[E] /opt/spark/RELEASE
COPY bin /opt/spark/bin
COPY sbin /opt/spark/sbin
COPY kubernetes/dockerfiles/spark/entrypoint.sh /opt/
COPY kubernetes/dockerfiles/spark/decom.sh /opt/
COPY examples /opt/spark/examples
COPY kubernetes/tests /opt/spark/tests
COPY data /opt/spark/data
ENV SPARK_HOME /opt/spark
WORKDIR /opt/spark/work-dir
RUN chmod g+w /opt/spark/work-dir
RUN chmod a+x /opt/decom.sh
ENTRYPOINT [ "/opt/entrypoint.sh" ]
Specify the User that the actual main process will run as
USER ${spark_uid}
执行构建镜像的命令docker buildx build --load --platform linux/arm64 --tag spark-paimon-s3:3.4.3_2.12 .
得到基础镜像spark-paimon-s3:3.4.3_2.12
2、编写程序
KafkaSparkPaimonS3
使用spark读取消费kafka,将固定格式的数据保存到S3协议的对象存储上,
这里s3使用了Minio
程序代码
package com.example.cloud;
import org.apache.spark.sql.streaming.{DataStreamReader, StreamingQuery}
import org.apache.spark.sql.types._
import org.apache.spark.sql.{DataFrame, Dataset, Row, SparkSession}
object KafkaSparkPaimonS3 {
def main(args: Array[String]): Unit = {
val kafkaConsumer: String = "kafka-service:9092"
val kafkaTopic: String = "mysql-flink-cdc-kafka"
val startingOffsets: String = "latest"
val kafkaGroupId: String = "KafkaSparkPaimonS3Group"
val failOnDataLoss: Boolean = false
val maxOffsetsPerTrigger: Int = 3000
val lakePath: String = "s3a://paimon/warehouse"
val checkpointLocation: String = "s3a://spark/checkpoints"
val s3endpoint: String = "http://minio:9000"
val s3access: String = "uotAvnxXwcz90yNxWhq2"
val s3secret: String = "MlDBAOfRDG9lwFTUo9Qic9dLbuFfHsxJfwkjFD4v"
val schema_base = StructType(List(
StructField("before", StringType),
StructField("after", StringType),
StructField("source", MapType(StringType, StringType)),
StructField("op", StringType),
StructField("ts_ms", LongType),
StructField("transaction", StringType)
))
println("create spark session ..........................................................")
val sparkConf = SparkSession.builder()
.config("spark.serializer", "org.apache.spark.serializer.KryoSerializer")
.config("sspark.sql.catalog.paimon.metastore", "filesystem")
.config("spark.sql.catalog.paimon.warehouse", lakePath)
.config("spark.sql.catalog.paimon.s3.endpoint", s3endpoint)
.config("spark.sql.catalog.paimon.s3.access-key", s3access)
.config("spark.sql.catalog.paimon.s3.secret-key", s3secret)
.config("spark.sql.catalog.paimon", "org.apache.paimon.spark.SparkCatalog")
.config("spark.sql.catalog.paimon.s3.path-style.access", "true")
.config("spark.sql.extensions", "org.apache.paimon.spark.extensions.PaimonSparkSessionExtensions")
.config("spark.sql.catalog.paimon.s3.path-style.access", "true")
.config("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.S3SingleDriverLogStore")
.config("spark.hadoop.fs.s3a.multipart.size", "104857600")
.config("spark.hadoop.fs.s3a.impl", "org.apache.hadoop.fs.s3a.S3AFileSystem")
.config("spark.hadoop.fs.s3a.access.key", s3access)
.config("spark.hadoop.fs.s3a.secret.key", s3secret)
.config("spark.hadoop.fs.s3a.endpoint", s3endpoint)
.config("spark.hadoop.fs.s3a.connection.timeout", "200000")
val sparkSession: SparkSession = sparkConf.getOrCreate()
println("get spark DataStreamReader start ..........................................................")
val dsr: DataStreamReader = sparkSession
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafkaConsumer)
.option("subscribe", kafkaTopic)
.option("startingOffsets", startingOffsets)
.option("failOnDataLoss", failOnDataLoss)
.option("maxOffsetsPerTrigger", maxOffsetsPerTrigger)
.option("kafka.group.id", kafkaGroupId)
.option("includeHeaders", "true")
println("get spark DataStreamReader end ..........................................................")
val df: DataFrame = dsr.load()
println("配置kafka消费流 spark DataFrame end ..........................................................")
import org.apache.spark.sql.functions._
import sparkSession.implicits._
val frame: Dataset[Row] = df.select(from_json('value.cast("string"), schema_base) as "value").select($"value.*")
.alias("data")
.select(
get_json_object("data.after", ".uuid").as("uuid"),
get_json_object("data.after", ".product").as("product"),
get_json_object("data.after", ".promotion").as("promotion"),
get_json_object("data.after", ".value_added_service").as("value_added_service"),
get_json_object("data.after", ".logistics").as("logistics"),
get_json_object("data.after", ".weight").as("weight"),
get_json_object("data.after", ".color").as("color"),
get_json_object("data.after", ".version").as("version"),
get_json_object("data.after", ".shop").as("shop"),
get_json_object("data.after", ".evaluate").as("evaluate"),
get_json_object("data.after", ".order_num").as("order_num"),
get_json_object("data.after", ".rider").as("rider"),
get_json_object("data.after", ".order_time").as("order_time"),
get_json_object("data.after", ".create_time").as("create_time"),
get_json_object("data.after", ".pay_price").as("pay_price"),
get_json_object("data.after", ".pay_type").as("pay_type"),
get_json_object("data.after", ".address").as("address")
)
println("get spark Dataset from kafka ..........................................................")
sparkSession.sql("USE paimon;")
println("spark engine use paimon catalog ..........................................................")
sparkSession.sql("create database m31094;")
println("create my favourite database for u ..........................................................")
val tablePath = "paimon.m31094.my_table"
println("create table to store data ..........................................................")
sparkSession.sql("use m31094;")
sparkSession.sql(
s"""
CREATE TABLE IF NOT EXISTS $tablePath (
uuid STRING,
product STRING,
promotion STRING,
value_added_service STRING,
logistics STRING,
weight STRING,
color STRING,
version STRING,
shop STRING,
evaluate STRING,
order_num STRING,
rider STRING,
order_time STRING,
create_time STRING,
pay_price STRING,
pay_type STRING,
address STRING
) TBLPROPERTIES (
'partitioned_by' = 'uuid'
)
""")
println("将 DataFrame 写入 Paimon 表 ..........................................................")
println("尽可能的详细打印数据吧哈哈哈哈 ..........................................................")
val query: StreamingQuery = frame //是一个已经创建的 Dataset[Row],通常是从流数据源(如 Kafka、文件等)获得的数据。
.writeStream //开始一个流式写入操作。
.foreachBatch { (batchDF: Dataset[Row], batchId: Long) =>
println(s"处理批量流的UID是 batch ID: $batchId")
// 打印当前批次的数据
println("莫醒醒..........................................................")
batchDF.show(truncate = false) // 设置 truncate = false 以完整显示列内容
}
.format("paimon")
//指定数据输出格式为 Paimon。
.option("write.merge-schema", "true")
//允许在写入时合并模式(schema),即动态更新表的模式以适应新数据。
.option("write.merge-schema.explicit-cast", "true")
//在合并模式时,明确转换数据类型,以确保兼容性和正确性。
.outputMode("append")
//指定输出模式为追加模式,表示只将新的数据行添加到目标表中,不会更新或删除已有的数据。
.option("checkpointLocation", checkpointLocation)
//设置检查点位置,这对于流处理非常重要,有助于在故障恢复时重新启动流处理任务。
.start("s3a://paimon/warehouse/m31094.db/my_table") //启动流式查询并将数据写入指定的 S3 路径
println("spark流通过paimon方式写入数据湖 ..........................................................")
println("查看数据内容和结构 ..........................................................")
println(df.schema) // 打印 Schema
println("打印 Schema ..........................................................")
println("Stream processing started...")
query.awaitTermination() //使当前线程等待,直到流查询结束。这意味着程序会持续运行,直到手动停止或出现错误。
println("流处理已结束,程序终止。")
}
}
pom.xml
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>com.example.cloud</groupId>
<artifactId>KafkaSparkPaimonS3</artifactId>
<version>2.4.5</version>
<name>KafkaSparkPaimonS3</name>
<properties>
<java.version>1.8</java.version>
<scala.binary.version>2.12</scala.binary.version>
<spark.version>3.4.1</spark.version>
<paimon.version>0.9.0</paimon.version>
</properties>
<dependencies>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-bundle</artifactId>
<version>1.12.367</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version>3.3.6</version>
</dependency>
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-common</artifactId>
<version>3.3.6</version>
</dependency>
<dependency>
<groupId>com.fasterxml.jackson.core</groupId>
<artifactId>jackson-databind</artifactId>
<version>2.14.0</version>
</dependency>
<dependency>
<groupId>org.apache.commons</groupId>
<artifactId>commons-text</artifactId>
<version>1.8</version>
</dependency>
<dependency>
<groupId>org.apache.paimon</groupId>
<artifactId>paimon-spark-common</artifactId>
<version>${paimon.version}</version>
</dependency>
<dependency>
<groupId>org.apache.paimon</groupId>
<artifactId>paimon-s3</artifactId>
<version>${paimon.version}</version>
</dependency>
<dependency>
<groupId>org.apache.paimon</groupId>
<artifactId>paimon-spark-3.4</artifactId>
<version>${paimon.version}</version>
</dependency>
<dependency>
<groupId>org.apache.paimon</groupId>
<artifactId>paimon-s3-impl</artifactId>
<version>${paimon.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.12</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-core_2.12</artifactId>
<version>${spark.version}</version>
<exclusions>
<exclusion>
<groupId>org.apache.commons</groupId>
<artifactId>commons-text</artifactId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>${spark.version}</version>
<exclusions>
<exclusion>
<artifactId>audience-annotations</artifactId>
<groupId>org.apache.yetus</groupId>
</exclusion>
</exclusions>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql-kafka-0-10_2.12</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-token-provider-kafka-0-10_2.12</artifactId>
<version>${spark.version}</version>
</dependency>
<dependency>
<groupId>cn.hutool</groupId>
<artifactId>hutool-all</artifactId>
<version>5.8.20</version>
</dependency>
</dependencies>
<build>
<finalName>${project.artifactId}</finalName>
<resources>
<resource>
<directory>src/main/resources</directory>
</resource>
</resources>
<plugins>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<executions>
<execution>
<id>scala-compile-first</id>
<phase>process-resources</phase>
<goals>
<goal>add-source</goal>
<goal>compile</goal>
</goals>
</execution>
<execution>
<id>scala-test-compile</id>
<phase>process-test-resources</phase>
<goals>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<artifactId>maven-assembly-plugin</artifactId>
<configuration>
<archive>
<manifest>
<mainClass>com.example.cloud.KafkaSparkPaimonS3</mainClass>
</manifest>
</archive>
<descriptorRefs>
<descriptorRef>jar-with-dependencies</descriptorRef>
</descriptorRefs>
</configuration>
<executions>
<execution>
<id>make-assembly</id>
<phase>package</phase>
<goals>
<goal>single</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-dependency-plugin</artifactId>
<version>3.1.0</version>
<executions>
<execution>
<phase>prepare-package</phase>
<goals>
<goal>copy-dependencies</goal>
</goals>
<configuration>
<outputDirectory>${project.build.directory}/lib</outputDirectory>
<excludeTransitive>false</excludeTransitive>
<stripVersion>false</stripVersion>
<includeScope>runtime</includeScope>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-resources-plugin</artifactId>
<executions>
<execution>
<id>copy-resources</id>
<phase>package</phase>
<goals>
<goal>copy-resources</goal>
</goals>
<configuration>
<encoding>UTF-8</encoding>
<outputDirectory>
${project.build.directory}/config
</outputDirectory>
<resources>
<resource>
<directory>src/main/resources/</directory>
</resource>
</resources>
</configuration>
</execution>
<execution>
<id>copy-sh</id>
<phase>package</phase>
<goals>
<goal>copy-resources</goal>
</goals>
<configuration>
<encoding>UTF-8</encoding>
<outputDirectory>
${project.build.directory}
</outputDirectory>
<resources>
<resource>
<directory>bin/</directory>
</resource>
</resources>
</configuration>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>8</source>
<target>8</target>
</configuration>
</plugin>
</plugins>
</build>
</project>
Dockerfile
FROM spark-paimon-s3:3.4.3_2.12
RUN mkdir -p /opt/spark/examples/jars
COPY target /opt/spark/examples/jars
构建镜像的命令docker buildx build --load --platform linux/arm64 --tag spark-paimon-s3-app:3.4.3_2.12 --no-cache .
docker save -o spark-paimon-s3-app.tar spark-paimon-s3-app:3.4.3_2.12
3、配置minio
minio.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: minio
namespace: default
spec:
replicas: 1
selector:
matchLabels:
app: minio
template:
metadata:
labels:
app: minio
spec:
containers:
- name: minio
image: minio/minio:latest
imagePullPolicy: IfNotPresent
args:
server
/data
env:
- name: MINIO_ROOT_USER
value: "admin"
- name: MINIO_ROOT_PASSWORD
value: "密码"
command:
/bin/sh
-c
minio server /data --console-address ":5000"
ports:
- name: api
protocol: TCP
containerPort: 9000
- name: ui
protocol: TCP
containerPort: 5000
volumeMounts:
- name: minio-storage
mountPath: /data
volumes:
- name: minio-storage
persistentVolumeClaim:
claimName: minio-pvc
apiVersion: v1
kind: Service
metadata:
name: minio
namespace: default
spec:
selector:
app: minio
type: NodePort
ports:
- name: api
protocol: TCP
port: 9000
targetPort: 9000
- name: ui
protocol: TCP
port: 5000
targetPort: 5000
minio-pvc.yamlapiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: minio-pvc # PVC 的名称
namespace: default
spec:
accessModes:
- ReadWriteMany # 访问模式,此处为单节点读写
resources:
requests:
storage: 100Gi # 请求的存储容量大小
storageClassName: nfs-client # 存储类,根据需要选择
4、运行程序
4.1、springboot -mysql产生原始数据
产生的MySQL原始数据
4.2 数据从MySQL到kafka
mysql->flink cdc->kafka
MysqlFlinkCdcToKafka
在k8s上提交flink任务
/home/d/flink/bin/flink run-application --target kubernetes-application -Dkubernetes.namespace=default -Dkubernetes.cluster-id=flink-cdc-mysql -Dkubernetes.container.image.ref=flinkcdctokafka:0.1-snapshot -Dkubernetes.container.image.pull-policy=IfNotPresent -Dkubernetes.service-account=default -Dkubernetes.rest-service.exposed.type=NodePort -Djobmanager.memory.process.size=2048mb -Dtaskmanager.memory.process.size=2024mb -Dtaskmanager.numberOfTaskSlots=1 -Dhigh-availability.type=kubernetes -Dhigh-availability.storageDir=s3a://flink-cdc/recovery -Dstate.checkpoints.dir=s3a://flink-cdc/flink_cp -Dstate.savepoints.dir=s3a://flink-cdc/flink_sp -Dstate.backend.incremental=true -Ds3.access-key=uotAvnxXwcz90yNxWhq2 -Ds3.secret-key=MlDBAOfRDG9lwFTUo9Qic9dLbuFfHsxJfwkjFD4v -Ds3.path.style.access=true -Ds3.endpoint=http://minio:9000 -Duser.timezone=Asia/Shanghai -c "com.example.cloud.MysqlFlinkCdcToKafka" local:///opt/flink/usrlib/MysqlFlinkCdcToKafka-jar-with-dependencies.jar
通过flink cdc将MySQL的数据写入到kafka的指定topic
4.3 kafka到minio
kafka-spark-minio
spark提交命令,提交spark任务到k8s集群中运行
/opt/streaming/spark-3.4.3-bin-hadoop3/bin/spark-submit --name KafkaSparkPaimonS3 --master spark://10.10.10.99:7077 --deploy-mode client --driver-cores 2 --driver-memory 4g --num-executors 2 --executor-cores 2 --executor-memory 4g --class com.example.cloud.KafkaSparkPaimonS3 --conf spark.driver.extraClassPath=/opt/streaming/spark-3.4.3-bin-hadoop3/jars --conf spark.executor.extraClassPath=/opt/streaming/spark-3.4.3-bin-hadoop3/jars --jars /opt/lib/kafka-clients-3.8.0.jar,/opt/lib/spark-sql-kafka-0-10_2.13-3.4.3.jar,/opt/lib/spark-token-provider-kafka-0-10_2.13-3.4.3.jar /opt/KafkaSparkPaimonS3-jar-with-dependencies.jar
本地spark运行,可以通过spark sql查询数据的情况
本地执行spark-sql
/opt/streaming/spark-3.4.3-bin-hadoop3/bin/spark-sql --jars /opt/lib/paimon-spark-3.4-0.9.0.jar --conf 'spark.sql.catalog.paimon.metastore=filesystem' --conf 'spark.sql.catalog.paimon.warehouse=s3a://paimon/warehouse' --conf 'spark.sql.catalog.paimon.s3.endpoint=http://10.10.10.99:31212' --conf 'spark.sql.catalog.paimon.s3.access-key=uotAvnxXwcz90yNxWhq2' --conf 'spark.sql.catalog.paimon.s3.secret-key=MlDBAOfRDG9lwFTUo9Qic9dLbuFfHsxJfwkjFD4v' --conf 'spark.serializer=org.apache.spark.serializer.KryoSerializer' --conf 'spark.sql.catalog.paimon=org.apache.paimon.spark.SparkCatalog' --conf 'spark.sql.extensions=org.apache.paimon.spark.extensions.PaimonSparkSessionExtensions' --conf 'spark.sql.catalog.paimon.s3.path-style.access=true' --conf 'spark.delta.logStore.class=org.apache.spark.sql.delta.storage.S3SingleDriverLogStore' --conf 'spark.hadoop.fs.s3a.multipart.size=104857600' --conf 'spark.hadoop.fs.s3a.impl=org.apache.hadoop.fs.s3a.S3AFileSystem' --conf 'spark.hadoop.fs.s3a.access.key=uotAvnxXwcz90yNxWhq2' --conf 'spark.hadoop.fs.s3a.secret.key=MlDBAOfRDG9lwFTUo9Qic9dLbuFfHsxJfwkjFD4v' --conf 'spark.hadoop.fs.s3a.endpoint=http://10.10.10.99:31212' --conf 'spark.hadoop.fs.s3a.connectiopaimonn.timeout=200000'
执行上面的本地spark-sql,开启spark终端后
use paimon;
use databases;