StarRocks 存算分离 + Spark + Hive Metastore + MinIO 数据湖搭建全流程
目标:搭建一套完整的冷热分层数据湖架构,热数据留在 StarRocks,冷数据通过 Spark 搬迁到 MinIO 并通过 Hive Metastore 管理元数据,StarRocks 通过 External Catalog 直接查询。
整体架构
┌─────────────────────────────────────────────────────┐
│ StarRocks │
│ ┌──────────────┐ ┌──────────────────┐ │
│ │ Internal Cat │ │ External Cat │ │
│ │ (热数据) │ │ (Hive/Metastore) │ │
│ │ 读写 │ │ 只读 │ │
│ └──────────────┘ └──────────────────┘ │
│ │ │ │
│ │ Spark ETL │ 读元数据 │
│ ▼ ▼ │
│ ┌──────────┐ saveAsTable ┌──────────────┐ │
│ │ Spark │ ──────────────>│Hive Metastore│ │
│ │读取 SR │ │ (元数据) │ │
│ └──────────┘ └──────────────┘ │
│ │ │
│ │ 写 Parquet │
│ ▼ │
│ ┌──────────┐ 直读 Parquet │
│ │ MinIO │ <─────────────────────────────────────┘
│ │ (S3存储) │
│ └──────────┘
└─────────────────────────────────────────────────────┘
一、Docker Compose 统一管理
所有组件都放在 spark-docker/ 下,三个独立的 docker-compose.yml,共享同一个 Docker 网络 spark-docker_spark-net:
spark-docker/
├── docker-compose.yml ← Spark 集群
├── starrocks-cluster/
│ └── docker-compose.yml ← StarRocks FE/BE
└── hive-metastore/
└── docker-compose.yml ← Hive Metastore
MinIO 是之前已有的容器,直接加进该网络即可。
二、MinIO 容器部署(已有)
bash
# 查看 MinIO 端口映射,确认 S3 API 端口
docker port minio
# 输出:9000/tcp -> 0.0.0.0:9000(S3 API)
# 输出:9090/tcp -> 0.0.0.0:9090(Web 控制台)
# 加入 Spark 网络
docker network connect spark-docker_spark-net minio
三、Spark 集群容器化部署
3.1 docker-compose.yml
yaml
services:
spark-master:
image: apache/spark:3.5.4
container_name: spark-master
hostname: spark-master
restart: unless-stopped
environment:
- SPARK_NO_DAEMONIZE=false
ports:
- "8180:8080"
- "7077:7077"
command: /opt/spark/sbin/start-master.sh --host 0.0.0.0 --port 7077 --webui-port 8080
networks:
- spark-net
spark-worker-1:
image: apache/spark:3.5.4
container_name: spark-worker-1
hostname: spark-worker-1
restart: unless-stopped
depends_on:
- spark-master
environment:
- SPARK_NO_DAEMONIZE=false
ports:
- "8181:8081"
command: /opt/spark/sbin/start-worker.sh spark://spark-master:7077 --cores 4 --memory 4g --webui-port 8081
networks:
- spark-net
spark-worker-2:
image: apache/spark:3.5.4
container_name: spark-worker-2
hostname: spark-worker-2
restart: unless-stopped
depends_on:
- spark-master
environment:
- SPARK_NO_DAEMONIZE=false
ports:
- "8182:8081"
command: /opt/spark/sbin/start-worker.sh spark://spark-master:7077 --cores 4 --memory 4g --webui-port 8081
networks:
- spark-net
networks:
spark-net:
external: true
name: spark-docker_spark-net
关键点:
- 使用
apache/spark:3.5.4镜像,自带 Hadoop 和 JVM 11 --host 0.0.0.0确保 Master 监听所有网卡- 所有容器共用一个外部网络
spark-docker_spark-net
3.2 启动
bash
cd spark-docker
docker-compose up -d
docker ps --filter "name=spark"
3.3 JAR 包准备
Spark 连接 StarRocks 和 MinIO 需要额外 JAR:
bash
# StarRocks Connector
curl -o starrocks-spark-connector-3.5_2.12-1.1.3.jar \
https://repo1.maven.org/maven2/com/starrocks/starrocks-spark-connector-3.5_2.12/1.1.3/starrocks-spark-connector-3.5_2.12-1.1.3.jar
# MySQL JDBC Driver
cp ~/.m2/repository/mysql/mysql-connector-java/8.0.28/mysql-connector-java-8.0.28.jar .
# 拷入 Spark Master
docker cp starrocks-spark-connector-3.5_2.12-1.1.3.jar spark-master:/opt/spark/jars/
docker cp mysql-connector-java-8.0.28.jar spark-master:/opt/spark/jars/
四、StarRocks 存算分离架构搭建
4.1 为什么拆 FE/BE
StarRocks 官方 allin1-ubuntu 镜像将 FE 和 BE 打包在一个容器里,缺点是 BE 的 priority_networks 被硬编码为 127.0.0.1/32,导致 Spark Connector 从其他容器无法访问 BE 的 Thrift 端口。拆分成 FE 和 BE 两个独立容器后,可以分别配置网络绑定,Connector 直接在 Docker 网络内连 BE。
4.2 docker-compose.yml
yaml
services:
sr-fe:
image: starrocks/fe-ubuntu:3.5.0
container_name: sr-fe
hostname: sr-fe
restart: unless-stopped
environment:
- HOST_TYPE=FQDN
ports:
- "8030:8030"
- "9030:9030"
volumes:
- sr-fe-meta:/opt/starrocks/fe/meta
- sr-fe-log:/opt/starrocks/fe/log
command: /opt/starrocks/fe/bin/start_fe.sh --logconsole
networks:
- spark-net
sr-be:
image: starrocks/be-ubuntu:3.5.0
container_name: sr-be
hostname: sr-be
restart: unless-stopped
environment:
- HOST_TYPE=FQDN
ports:
- "8040:8040"
volumes:
- sr-be-data:/opt/starrocks/be/storage
- sr-be-log:/opt/starrocks/be/log
command: /opt/starrocks/be/bin/start_be.sh --logconsole
networks:
- spark-net
networks:
spark-net:
external: true
name: spark-docker_spark-net
volumes:
sr-fe-meta:
sr-fe-log:
sr-be-data:
sr-be-log:
关键点:
--logconsole让 FE/BE 前台运行,容器不会退出- 不指定固定 IP,Docker 自动分配
- BE 启动后需要手动注册到 FE
4.3 注册 BE
bash
docker exec sr-fe mysql -h 127.0.0.1 -P 9030 -u root \
-e "ALTER SYSTEM ADD BACKEND 'sr-be:9050';"
4.4 Internal Catalog 存算分离
StarRocks 3.5 支持 storage_volume 参数将内表数据直接存到 MinIO:
sql
-- 第一步:创建存储卷指向 MinIO
CREATE STORAGE VOLUME minio_volume
TYPE = S3
LOCATIONS = ("s3://spark-output")
PROPERTIES (
"enabled" = "true",
"aws.s3.endpoint" = "http://minio:9000",
"aws.s3.access_key" = "<MINIO_ACCESS_KEY>",
"aws.s3.secret_key" = "<MINIO_SECRET_KEY>",
"aws.s3.enable_path_style_access" = "true"
);
-- 第二步:设为默认存储卷
SET DEFAULT STORAGE VOLUME minio_volume;
-- 第三步:建表时指定
CREATE TABLE db1.dim_product (
product_id INT,
product_name VARCHAR(200),
...
) ENGINE=OLAP
DUPLICATE KEY(product_id)
DISTRIBUTED BY HASH(product_id) BUCKETS 10
PROPERTIES (
"storage_volume" = "minio_volume",
"replication_num" = "1"
);
这样 SR 内表的数据文件就存到了 MinIO,而非 BE 本地磁盘。对 DML 操作无感知,Insert/Update/Delete 仍然正常工作。
五、Hive Metastore 容器化部署
5.1 docker-compose.yml
yaml
services:
hive-metastore:
image: apache/hive:3.1.3
container_name: hive-metastore
restart: unless-stopped
environment:
SERVICE_NAME: metastore
DB_DRIVER: mysql
SERVICE_OPTS: >-
-Djavax.jdo.option.ConnectionDriverName=com.mysql.cj.jdbc.Driver
-Djavax.jdo.option.ConnectionURL=jdbc:mysql://宿主机IP:3306/hive_metastore
-Djavax.jdo.option.ConnectionUserName=hive
-Djavax.jdo.option.ConnectionPassword=<MYSQL_PASSWORD>
-Dhive.metastore.warehouse.dir=s3a://data-lake/warehouse
-Dfs.s3a.endpoint=http://minio:9000
-Dfs.s3a.access.key=<MINIO_ACCESS_KEY>
-Dfs.s3a.secret.key=<MINIO_SECRET_KEY>
-Dfs.s3a.path.style.access=true
ports:
- "9083:9083"
networks:
- spark-net
networks:
spark-net:
external: true
name: spark-docker_spark-net
关键点:
- 版本选择
apache/hive:3.1.3,与 Spark 3.5.x 内嵌 Hive 客户端兼容 - MySQL 元数据库需要提前建好(宿主机的 MySQL)
warehouse.dir指向 MinIO 上的 S3 路径
5.2 准备 MySQL 元数据库
sql
CREATE DATABASE hive_metastore DEFAULT CHARACTER SET latin1;
CREATE USER 'hive'@'%' IDENTIFIED BY '<MYSQL_PASSWORD>';
GRANT ALL ON hive_metastore.* TO 'hive'@'%';
FLUSH PRIVILEGES;
5.3 S3A JAR 依赖
Hive Metastore 默认不包含 S3A 驱动,需要手动放入 hadoop-aws 和 aws-java-sdk-bundle:
bash
curl -o hadoop-aws-3.3.4.jar https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-aws/3.3.4/hadoop-aws-3.3.4.jar
curl -o aws-java-sdk-bundle-1.12.367.jar https://repo1.maven.org/maven2/com/amazonaws/aws-java-sdk-bundle/1.12.367/aws-java-sdk-bundle-1.12.367.jar
docker cp hadoop-aws-3.3.4.jar hive-metastore:/opt/hive/lib/
docker cp aws-java-sdk-bundle-1.12.367.jar hive-metastore:/opt/hive/lib/
docker restart hive-metastore
5.4 StarRocks 创建 Hive Catalog
sql
CREATE EXTERNAL CATALOG minio_catalog
PROPERTIES (
"type" = "hive",
"hive.metastore.uris" = "thrift://宿主机IP:9083",
"aws.s3.endpoint" = "http://minio:9000",
"aws.s3.access_key" = "<MINIO_ACCESS_KEY>",
"aws.s3.secret_key" = "<MINIO_SECRET_KEY>",
"aws.s3.enable_path_style_access" = "true"
);
六、Spark 读取 SR 并写入 MinIO + Hive Metastore
6.1 Maven 项目配置
pom.xml 关键依赖:
xml
<properties>
<java.version>11</java.version>
<maven.compiler.source>11</maven.compiler.source>
<maven.compiler.target>11</maven.compiler.target>
</properties>
<dependencies>
<!-- Spark SQL(provided:集群自带)-->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-sql_2.12</artifactId>
<version>3.5.4</version>
<scope>provided</scope>
</dependency>
<!-- StarRocks Connector(打进 JAR)-->
<dependency>
<groupId>com.starrocks</groupId>
<artifactId>starrocks-spark-connector-3.5_2.12</artifactId>
<version>1.1.3</version>
</dependency>
<!-- MySQL JDBC Driver(打进 JAR)-->
<dependency>
<groupId>mysql</groupId>
<artifactId>mysql-connector-java</artifactId>
<version>8.0.28</version>
</dependency>
<!-- S3A -->
<dependency>
<groupId>org.apache.hadoop</groupId>
<artifactId>hadoop-aws</artifactId>
<version>3.3.4</version>
</dependency>
<dependency>
<groupId>com.amazonaws</groupId>
<artifactId>aws-java-sdk-bundle</artifactId>
<version>1.12.367</version>
</dependency>
</dependencies>
<build>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>3.5.1</version>
<executions>
<execution>
<phase>package</phase>
<goals><goal>shade</goal></goals>
</execution>
</executions>
</plugin>
</plugins>
</build>
编译目标设为 11,因为 Spark 容器的 JVM 版本是 Java 11。
6.2 Java 代码
java
package com.example;
import org.apache.spark.sql.*;
import org.apache.spark.sql.SparkSession;
public class SparkStarRocksDemo {
public static void main(String[] args) {
SparkSession spark = SparkSession.builder()
.appName("SR-to-MinIO-Hive")
// === MinIO S3A 配置 ===
.config("spark.hadoop.fs.s3a.endpoint", "http://minio:9000")
.config("spark.hadoop.fs.s3a.access.key", "<MINIO_ACCESS_KEY>")
.config("spark.hadoop.fs.s3a.secret.key", "<MINIO_SECRET_KEY>")
.config("spark.hadoop.fs.s3a.path.style.access", "true")
// === Hive Metastore 配置 ===
.config("hive.metastore.uris", "thrift://hive-metastore:9083")
.config("spark.sql.warehouse.dir", "s3a://data-lake/warehouse")
.enableHiveSupport()
.getOrCreate();
// 创建数据库(写入 Metastore)
spark.sql("CREATE DATABASE IF NOT EXISTS db1");
// 注册 StarRocks 表为临时视图
spark.sql(
"CREATE OR REPLACE TEMPORARY VIEW dim_product " +
"USING starrocks " +
"OPTIONS (" +
" 'starrocks.fe.http.url' = 'sr-fe:8030'," +
" 'starrocks.fe.jdbc.url' = 'jdbc:mysql://sr-fe:9030'," +
" 'starrocks.table.identifier' = 'db1.dim_product'," +
" 'starrocks.user' = 'root'," +
" 'starrocks.password' = ''" +
")"
);
// 从 StarRocks 读取数据
Dataset<Row> df = spark.sql("SELECT * FROM dim_product");
// saveAsTable 做了两件事:
// 1. 写 Parquet 到 MinIO(s3a://data-lake/warehouse/db1.db/dim_product/)
// 2. 注册表结构到 Hive Metastore
df.write()
.mode(SaveMode.Overwrite)
.saveAsTable("db1.dim_product");
System.out.println("Done.");
spark.stop();
}
}
6.3 打包与提交
bash
# 1. 打包
mvn clean package -DskipTests
# 2. 拷入 Spark 容器
docker cp target/spark-starrocks-demo-1.0.0.jar spark-master:/tmp/
# 3. 提交任务
docker exec spark-master /opt/spark/bin/spark-submit \
--class com.example.SparkStarRocksDemo \
--master spark://spark-master:7077 \
--deploy-mode client \
/tmp/spark-starrocks-demo-1.0.0.jar
6.4 验证
Spark 任务跑完后,在 StarRocks 端验证:
sql
-- 切换到 Hive Catalog
SET CATALOG minio_catalog;
-- 查看库
SHOW DATABASES;
-- 查询数据(直接读 MinIO 上的 Parquet)
USE db1;
SELECT COUNT(*) FROM dim_product;
SELECT * FROM dim_product LIMIT 10;
-- 跨 Catalog JOIN(内表 + 外表)
SELECT a.*, b.extra_info
FROM default_catalog.db1.hot_table a
JOIN minio_catalog.db1.dim_product b
ON a.product_id = b.product_id;
七、关键踩坑记录
| 问题 | 原因 | 解决 |
|---|---|---|
| all-in-one SR 的 Connector 连不上 BE | BE 元数据登记为 127.0.0.1,跨容器不可达 |
拆 FE/BE 两个容器 |
| Spark 容器 JVM 版本不匹配 | Spark 用 Java 11,代码用 Java 17 编译 | pom.xml 编译目标设 11 |
saveAsTable 报 Invalid method name: 'get_table' |
Spark 内嵌 Hive 2.3.x 与 Metastore 4.0.0 API 不兼容 | Metastore 降为 3.1.3 |
S3AFileSystem not found |
Metastore 容器缺 S3A JAR | 手动拷入 hadoop-aws + aws-sdk |
| Stream Load 重定向到 BE 内网 IP | Docker 内网 IP 宿主机不可达 | 从 FE 容器内部 curl |
| CSV 导入全部行被过滤 | Windows 生成 CSV 默认不是 UTF-8 | Python open() 加 encoding='utf-8' |
八、总结
最终搭建完成的组件清单:
| 组件 | 容器名 | 端口 | 说明 |
|---|---|---|---|
| Spark Master | spark-master | 8180 | Web UI |
| Spark Worker x2 | spark-worker-1/2 | 8181/8182 | 各 4 核 4G |
| StarRocks FE | sr-fe | 8030/9030 | HTTP/JDBC |
| StarRocks BE | sr-be | 8040/9060 | HTTP/Thrift |
| Hive Metastore | hive-metastore | 9083 | Thrift |
| MinIO | minio | 9000/9090 | S3 API / Console |
数据流向总结:
- 写链路 :Spark Connector 读 SR →
saveAsTable()→ Parquet 入 MinIO + 元数据入 Hive Metastore - 读链路:StarRocks Hive Catalog → 从 Metastore 拿 Schema → 直连 MinIO 读 Parquet
- 内表存算分离 :SR Internal Catalog 通过
storage_volume直接将表数据存 MinIO