【Hive实战】hive-testbench的调研

hive-testbench的调研

用于在任何数据规模下试验Apache Hive的测试平台。

文章目录

hive-testbench的调研

Overview

hive-testbench是一个数据生成器和一组查询，可以让您对Apache Hive进行大规模实验。测试平台允许您在大型数据集上体验基本Hive性能，并提供一种简单的方法来查看Hive调优参数和高级设置的影响。

生成基础表与数据

生成示例查询语句

参数调整？

前提条件

You will need:

Hadoop 2.2及以上版本的集群或Sandbox。
Apache Hive.
Between 15 minutes and 2 days to generate data (depending on the Scale Factor you choose and available hardware).在15分钟到2天之间生成数据（取决于您选择的比例因子和可用的硬件）

生成2GB的数据耗时1h+.
如果您计划生成1TB或更多的数据，强烈建议使用Apache Hive 13+生成数据。

安装和设置

所有这些步骤都应在您的Hadoop集群上执行。

Step 1: 环境准备

除了Hadoop和Hive之外，开始之前请确保gcc已安装且位于系统路径中。如果系统未安装，请使用yum或apt-get进行安装。
shell 复制代码
```
yum -y install gcc gcc-c++
```
Step 2: 决定要使用的测试套件。

hive-testbench 附带基于 TPC-DS 和 TPC-H 基准测试的数据生成器和示例查询。您可以选择使用其中一种或两种基准进行实验。
Step 3: Compile and package the appropriate data generator. 编译并打包相应的数据生成器。

对于TPC-DS，./tpcds-build.sh会下载、编译并打包TPC-DS数据生成器。对于TPC-H，./tpch-build.sh会下载、编译并打包TPC-H数据生成器。

编译的时候需要使用Maven

仓库中的关于 TPC的工具包已经无法下载，需要自己从别处下载，详细见MarkFile文件。

不同环境编译的过程中会生成不同的可执行文件，Windows系统下的是dsdgen.exe，Linux系统生成的是dsdgen。无法跨平台
Step 4: 决定您想要生成多少数据。

您需要确定一个"规模因子"，该参数代表您将生成的数据量。规模因子大致对应于千兆字节，因此规模因子100约等于100千兆字节，而1太字节对应规模因子1000。请确定所需数据量并记住该数值，以便进行下一步操作。若您拥有4-10个节点的集群或仅需小规模测试，1000倍量级（1TB）数据是理想起点。大型集群则建议选择10000倍量级（10TB）及以上。TPC-DS与TPC-H的量级概念具有相似性。

若需生成海量数据，请使用Hive 13及以上版本。Hive 13引入的优化机制可实现更具扩展性的数据分区。Hive 12及更早版本在生成数百GB以上数据时极易崩溃，且调试难度较大。您可在Hive 13中生成文本或RCFile数据，并兼容多版本Hive使用
Step 5: 生成并加载数据。

脚本 tpcds-setup.sh 和 tpch-setup.sh 分别用于生成并加载 TPC-DS 和 TPC-H 的数据。通用用法为 tpcds-setup.sh scale_factor [directory] 或 tpch-setup.sh scale_factor [directory]。

Some examples:

Build 1 TB of TPC-DS data: ./tpcds-setup.sh 1000

Build 1 TB of TPC-H data: ./tpch-setup.sh 1000

Build 100 TB of TPC-DS data: ./tpcds-setup.sh 100000

Build 30 TB of text formatted TPC-DS data: FORMAT=textfile ./tpcds-setup 30000

Build 30 TB of RCFile formatted TPC-DS data: FORMAT=rcfile ./tpcds-setup 30000

同时检查设置脚本中的其他参数，其中重要的参数是 BUCKET_DATA。
Step 6: Run queries.

包含50多个示例TPC-DS查询和全部TPC-H查询供您尝试。您可使用hive、beeline或任意SQL工具。该测试平台还提供了一组推荐配置。

本示例假设您已在步骤5中生成1TB的TPC-DS数据：
shell 复制代码
```
 cd sample-queries-tpcds
 hive -i testbench.settings
 hive> use tpcds_bin_partitioned_orc_1000;
 hive> source query55.sql;
```
请注意，数据库名称基于步骤3中选择的数据规模确定。在数据规模10000时，您的数据库将命名为tpcds_bin_partitioned_orc_10000；在数据规模1000时则命名为tpch_flat_orc_1000。您可随时执行show databases命令获取可用数据库列表。

同样地，若您在步骤5中生成1 TB的TPC-H数据：
shell 复制代码
```
cd sample-queries-tpch
hive -i testbench.settings
hive> use tpch_flat_orc_1000;
hive> source tpch_query1.sql;
```

tpcds使用举例

下载tpcds-kit-2.9.0.zip，放在hive-testbench-hdp3\tpcds-gen目录中。

修改hive-testbench-hdp3\tpcds-gen\MarkFile，进行tpcds_kit包下载进行调整，输出200.

shell 复制代码

all: target/lib/dsdgen.jar target/tpcds-gen-1.0-SNAPSHOT.jar

target/tpcds-gen-1.0-SNAPSHOT.jar: $(shell find -name *.java) 
	mvn package

target/tpcds_kit.zip: tpcds_kit.zip
	mkdir -p target/
	cp tpcds_kit.zip target/tpcds_kit.zip

tpcds_kit.zip:
#   curl https://public-repo-1.hortonworks.com/hive-testbench/tpcds/README
#   curl --output tpcds_kit.zip https://public-repo-1.hortonworks.com/hive-testbench/tpcds/TPCDS_Tools.zip
	echo "200"

target/lib/dsdgen.jar: target/tools/dsdgen
	cd target/; mkdir -p lib/; ( jar cvf lib/dsdgen.jar tools/ || gjar cvf lib/dsdgen.jar tools/ )

target/tools/dsdgen: target/tpcds_kit.zip
	test -d target/tools/ || (cd target; unzip tpcds_kit.zip)
	test -d target/tools/ || (cd target; mv */tools tools)
	cd target/tools; cat ../../patches/all/*.patch | patch -p0
	cd target/tools; cat ../../patches/${MYOS}/*.patch | patch -p1
	cd target/tools; make clean; make dsdgen

clean:
	mvn clean

修改hive-testbench-hdp3\tpcds-gen\pom.xml，调整对应的hadoop-client的版本

执行tpcds-build.sh文件，构建hive-testbench-hdp3/tpcds-gen/target/tpcds-gen-1.0-SNAPSHOT.jar成功

shell 复制代码

[INFO] Copying servlet-api-2.5.jar to /data/bigdata/software/hive-testbench-hdp3/tpcds-gen/target/lib/servlet-api-2.5.jar
[INFO] Copying guava-11.0.2.jar to /data/bigdata/software/hive-testbench-hdp3/tpcds-gen/target/lib/guava-11.0.2.jar
[INFO] Copying hadoop-mapreduce-client-common-2.9.1.jar to /data/bigdata/software/hive-testbench-hdp3/tpcds-gen/target/lib/hadoop-mapreduce-client-common-2.9.1.jar
[INFO] Copying mssql-jdbc-6.2.1.jre7.jar to /data/bigdata/software/hive-testbench-hdp3/tpcds-gen/target/lib/mssql-jdbc-6.2.1.jre7.jar
[INFO] Copying hadoop-yarn-registry-2.9.1.jar to /data/bigdata/software/hive-testbench-hdp3/tpcds-gen/target/lib/hadoop-yarn-registry-2.9.1.jar
[INFO] Copying httpcore-4.4.4.jar to /data/bigdata/software/hive-testbench-hdp3/tpcds-gen/target/lib/httpcore-4.4.4.jar
[INFO] Copying hamcrest-core-1.1.jar to /data/bigdata/software/hive-testbench-hdp3/tpcds-gen/target/lib/hamcrest-core-1.1.jar
[INFO] ------------------------------------------------------------------------
[INFO] BUILD SUCCESS
[INFO] ------------------------------------------------------------------------
[INFO] Total time:  6.214 s
[INFO] Finished at: 2025-09-05T19:42:23+08:00
[INFO] ------------------------------------------------------------------------
[WARNING] 
[WARNING] Plugin validation issues were detected in 3 plugin(s)
[WARNING] 
[WARNING]  * org.apache.maven.plugins:maven-compiler-plugin:3.10.1
[WARNING]  * org.apache.maven.plugins:maven-dependency-plugin:2.8
[WARNING]  * org.apache.maven.plugins:maven-resources-plugin:3.3.0
[WARNING] 
[WARNING] For more or less details, use 'maven.plugin.validation' property with one of the values (case insensitive): [BRIEF, DEFAULT, VERBOSE]
[WARNING]

在安装了hive的机器上执行。修改tpcds-setup.sh,直接使用hive命令
shell 复制代码
```
HIVE="hive"
```
生成数据写入hive库。执行耗时比较长，使用后台模式执行shell脚本。
shell 复制代码
```
nohup ./tpcds-setup.sh 2 /tmp  > out.log 2>&1 &
```

核心原理

tpcds-setup.sh 核心流程
Y N N Y Y N Y 开始异常结束数据载入结束检查tpcds-gen-1.0-SNAPSHOT.jar是否存在检查hive命令是否存在获取规模因子与目录参数规模因子是否是超过1的整数 HDFS上创建目录执行tpcds-gen-1.0-SNAPSHOT.jar构建数据文件并上传到HDFS目录修改HDFS目标目录权限777 定义Hive地址将数据灌入hive外部表中创建分区和分桶表填充表数据加载约束

其他

TPCDS 中生成的SQL 99，不是全部兼容的。需要执行调整。

tpch使用举例

下载tpch-kit-tpch_2_17_0.zip，放在hive-testbench-hdp3\tpch-gen目录中。 https://github.com/gregrahn/tpch-kit

修改hive-testbench-hdp3\tpch-gen\MarkFile，进行tpcds_kit包下载进行调整，输出200.

shell 复制代码

MYOS=$(shell uname -s)

all: target/lib/dbgen.jar target/tpch-gen-1.0-SNAPSHOT.jar

target/tpch-gen-1.0-SNAPSHOT.jar: $(shell find -name *.java) 
	mvn package

target/tpch_kit.zip: tpch_kit.zip
# 	mkdir -p target/
# 	cp tpch_kit.zip target/tpch_kit.zip
	echo "200"

tpch_kit.zip:
	curl http://dev.hortonworks.com.s3.amazonaws.com/hive-testbench/tpch/README
	curl --output tpch_kit.zip http://dev.hortonworks.com.s3.amazonaws.com/hive-testbench/tpch/tpch_kit.zip

target/lib/dbgen.jar: target/tools/dbgen
	cd target/; mkdir -p lib/; ( jar cvf lib/dbgen.jar tools/ || gjar cvf lib/dbgen.jar tools/ )

target/tools/dbgen: target/tpch_kit.zip
	test -d target/tools/ || (cd target; unzip tpch_kit.zip -x __MACOSX/; ln -sf $$PWD/*/dbgen/ tools)
	cd target/tools; cat ../../../patches/${MYOS}/*.patch | patch -p0
	cd target/tools; make -f makefile.suite clean; make -f makefile.suite CC=gcc DATABASE=ORACLE MACHINE=LINUX WORKLOAD=TPCH

clean:
	mvn clean

修改hive-testbench-hdp3\tpch-gen\pom.xml，调整对应的hadoop-client的版本
执行tpch-build.sh文件，构建hive-testbench-hdp3/tpch-gen/target/tpch-gen-1.0-SNAPSHOT.jar成功
生成数据写入hive库。执行耗时比较长，使用后台模式执行shell脚本。
shell 复制代码
```
nohup ./tpch-setup.sh 2 /tmp  > out.log 2>&1 &
```

其他

TPCH 中生成的SQL 22，不是全部兼容的。需要执行调整。