第06章-基于TPC-DS进行性能测试-摩杜云开发者社区

第06章基于TPC-DS进行性能测试 2

6.1搭建TPC-DS环境 2

6.1.1 下载项目 2

6.1.2 准备JAVA编译环境 2

6.1.3 准备本地编译环境 2

6.1.4 编译项目 4

6.1.5 生产测试数据和表 4

6.2 进行TPC-DS测试 5

6.2.1 编写提交脚本 5

6.2.2 运行脚本进行TPC-DS测试 6

6.3 5T数据规模下SPARK2/SPARK3性能测试结果 6

第06章基于TPC-DS进行性能测试

TPC-DS简介

TPC-DS采用星型、雪花型等多维数据模式。它包含7张事实表,17张纬度表平均每张表含有18列。其工作负载包含99个SQL查询,覆盖SQL99和2003的核心部分以及OLAP。这个测试集包含对大数据集的统计、报表生成、联机查询、数据挖掘等复杂应用,测试用的数据和值是有倾斜的,与真实数据一致。

可以说TPC-DS是与真实场景非常接近的一个测试集,也是难度较大的一个测试集。

6.1搭建TPC-DS环境

我们在对Spark SQL进行参数调优时，是否能真正提升性能，以及从hive升级到Spark SQL过程中，究竟提升了多少性能。我们心中的这个疑问需要有一个基准的性能测试工具进行测试。我们在这里通过在本地MAC电脑环境中搭建编译环境，由于有一些linux环境和MAC环境不一致，因此需要做一些修改。而在linux环境中，编译更简单，不需要修改任何代码就能编译成功。

备注：hive-testbench是hortonworks开源的一个大数据SQL性能测试工具，可以对Hive、Spark SQL进行TPC-DS、TPC-DH等性能测试。

6.1.1 下载项目

从github中将hive-testbench项目下载到本地。

6.1.2 准备JAVA编译环境

编译hive-testbench需要java1.8、maven等，并配置环境变量。

6.1.3 准备本地编译环境

1）将values.h进行替换

vi ./hive-testbench-hdp3/tpcds-gen/target/tools/porting.h

#include <limits.h>

#include <float.h>

2) 将malloc.h进行替换

cd ./hive-testbench-hdp3/tpcds-gen

sed -i "" 's/malloc.h/sys\/malloc.h/' ./target/tools/w_call_center.c

sed -i "" 's/malloc.h/sys\/malloc.h/' ./target/tools/permute.c

sed -i "" 's/malloc.h/sys\/malloc.h/' ./target/tools/dist.c

sed -i "" 's/malloc.h/sys\/malloc.h/' ./target/tools/dcgram.c

sed -i "" 's/malloc.h/sys\/malloc.h/' ./target/tools/date.c

sed -i "" 's/malloc.h/sys\/malloc.h/' ./target/tools/w_household_demographics.c

sed -i "" 's/malloc.h/sys\/malloc.h/' ./target/tools/dcomp.c

sed -i "" 's/malloc.h/sys\/malloc.h/' ./target/tools/query_handler.c

sed -i "" 's/malloc.h/sys\/malloc.h/' ./target/tools/misc.c

sed -i "" 's/malloc.h/sys\/malloc.h/' ./target/tools/StringBuffer.c

sed -i "" 's/malloc.h/sys\/malloc.h/' ./target/tools/tokenizer.l

sed -i "" 's/malloc.h/sys\/malloc.h/' ./target/tools/decimal.c

3) 初始化MAXINT

vi ./hive-testbench-hdp3/tpcds-gen/target/tools/genrand.c

55行后追加#define MAXINT INT_MAX

第06章-基于TPC-DS进行性能测试_sql

vi ./hive-testbench-hdp3/tpcds-gen/target/tools/nulls.c

40行后追加#define MAXINT INT_MAX

第06章-基于TPC-DS进行性能测试_sql_02

4）修改源码，将数据目录缩短，否则在mac环境运行会报错

cd ./hive-testbench-hdp3/tpcds-gen/src/main/java/org/notmysock/tpcds/

vi GenTable.java

对171行的map方法进行修改

// cmd[i] = (new File(".")).getAbsolutePath();

cmd[i] = (new File(".")).getAbsoluteFile().getParentFile().getParentFile().getAbsolutePath();

//File cwd = new File(".");

File cwd = new File( (new File(".")).getAbsoluteFile().getParentFile().getParentFile().getAbsolutePath() );

参考：

https://www.baifachuan.com/posts/34f97a60.html

https://www.baifachuan.com/posts/3696fe4b.html

http://nebofeng.com/2022/12/08/tpcds-hive-testbench%e8%bf%90%e8%a1%8c%e6%8a%a5%e9%94%99status-139%e7%9a%84%e8%a7%a3%e5%86%b3%e6%96%b9%e6%b3%95/

6.1.4 编译项目

cd hive-testbench-hdp3/tpch-gen执行make

第06章-基于TPC-DS进行性能测试_hive_03

如果需要重写编译，那么就把target目录删除

6.1.5 生产测试数据和表

./tpcds-setup.sh 数字，这里的数字代表数据规模，单位为GB。

比如./tpcds-setup.sh 10，支持成功后，会在Hive库中创建tpcds_text_10数据库，并且将mr产出的数据加载到Hive表中。这样我们就可以通过Spark SQL查询了。

6.2 进行TPC-DS测试

6.2.1 编写提交脚本

使用python脚本编写一个批量提交Spark SQL的功能。原理是python脚本调用spark-sql -f 执行TPC-DS的sql脚本。如果要比较2个引擎的性能测试时，利用控制变量法，需要在同时启动测试命令。

def main(sparkVersion,concurrentsize,dbname):
sparkSql=" "if("spark"==sparkVersion):
sparkSql = "spark-sql --hivevar DB="+dbname+" -i settings/init2.sql --queue default --conf spark.sql.crossJoin.enabled=true --executor-memory 3G --conf spark.dynamicAllocation.maxExecutors=5 --name "# else:
# sparkSql = "/usr/hdp/3.1.4.0-315/spark3/bin/spark-sql --conf spark.sql.hive.cnotallow=false --conf spark.sql.hive.metastore.versinotallow=2.3.9 --hivevar DB="+dbname+" -i settings/init2.sql --queue default --conf spark.sql.crossJoin.enabled=true --executor-memory 3G --conf spark.dynamicAllocation.maxExecutors=5 --name "pool = multiprocessing.Pool(int(concurrentsize))
filedir="spark-queries-tpcds/"filearr=os.listdir(filedir)for filepath in filearr:
# testsql(filepath,sparkSql,filedir)
pool.apply_async(testsql,(filepath,sparkSql,filedir,))
pool.close()
pool.join()if __name__ == "__main__":
sparkVersinotallow=sys.argv[1]
cnotallow=sys.argv[2]
dbname=sys.argv[3]
main(sparkVersion,concurrentsize,dbname)

6.2.2 运行脚本进行TPC-DS测试

运行test.py进行测试,会在Hive库中

nohup python2.7 test.py spark 1 tpcds_text_10 > log/test.log 2>&1 &

可以根据耗时关键字，过滤得到每个sql的执行时间

cat log/test.log | grep "耗时"

sql:q18.sql,耗时:70.207 seconds

sql:q7.sql,耗时:87.11 seconds

sql:q6.sql,耗时:73.565 seconds

sql:q19.sql,耗时:68.82 seconds

……

6.3 5T数据规模下SPARK2/SPARK3性能测试结果

这是我们在生产环境进行TPC-DS测试的报告，数据规模在5T，大数据集群在1000台，测试的Spark SQL参数相同，spark的版本是spark2.3.2和spark3.2.0。

第06章-基于TPC-DS进行性能测试_hive_04

我们从TPC-DS结果中发现，spark3相比spark2，平均减少了41%执行时间。相当于spark2运行sql需要100秒，而spark3运行同样的sql只需要59秒。

来自视频：《Spark SQL性能优化》

链接地址：

https://edu.51cto.com/course/34516.html

第06章 基于TPC-DS进行性能测试