Spark Setup (3) - TPCH and TPCDS benchmarks

May 20, 2023 4 min read

This blog introduces the steps to setup TPCH and TPCDS benchmarks in a local Spark environment on MacOS.

Deploy TPCH and TPCDS in Spark – an example in local Spark/HDFS/HIVE

Generate queries for TPCH and TPCDS in SparkSQL syntax

Get Code Repository and Package

Get the packaged jar file from my repository named spark-stage-tuning for the TPCH and TPCDS deployment, where I have prepared automated scripts to generate and run benchmarks.

git clone git@github.com:Angryrou/spark-stage-tuning.git
cd spark-stage-tuning
sbt package

Benchmark-kit

The tpch-kit and tpcds-kit are the official libraries to generate TPCH and TPCDS datasets for benchmark experiments. However, they are originally designed for single-node SQL systems. Therefore, one needs to customize the libraries in the big data system. Please check my customized version for tpch-kit and tpcds-kit.

TPCH

Here are the steps from getting downloading the customized tpch-kit to run the first TPCH query in the local Spark cluster.

Get tpch-kit

# ----------------
# TPCH
cd benchmark-res/script
bash 2.get-spark-tpch-kit-local.sh MACOS
# OR - from scratch    
mkdir -p benchmark-res/dataset-gen
cd benchmark-res/dataset-gen
git clone --branch spark-tpch-kit-gen --depth 1 git@github.com:Angryrou/tpch-kit.git
cd tpch-kit/dbgen
make MACHINE=MACOS DATABASE=POSTGRESQL
cd .. # to tpch-kit
bash validate_sparksql_gen.sh # should not be any output.
echo "Done"

Generate TPCH dataset

# ----------------
# TPCH
DIR=/Users/chenghao/ResearchHub/repos/spark-stage-tuning
SPARK_PATH=/Users/chenghao/ResearchHub/repos/spark/dist
SF=1
${SPARK_PATH}/bin/spark-submit \
--master "local[*]" \
--deploy-mode client \
--class edu.polytechnique.cedar.spark.benchmark.SetBenchmark \
--name set-benchmark \
--conf spark.driver.cores=5 \
--conf spark.driver.memory=4g \
--conf spark.default.parallelism=40 \
--conf spark.sql.adaptive.enabled=true \
--conf spark.sql.parquet.compression.codec=snappy \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.kryoserializer.buffer.max=512m \
--conf spark.driver.extraClassPath=file://${DIR}/benchmark-res/libs/mysql-connector-j-8.0.33.jar \
--jars file://${SPARK_PATH}/examples/jars/scopt_2.12-3.7.1.jar,file://${DIR}/benchmark-res/libs/spark-sql-perf_2.12-0.5.1-SNAPSHOT.jar \
${DIR}/target/scala-2.12/spark-stage-tuning_2.12-1.0-SNAPSHOT.jar \
-b TPCH -d ${DIR}/benchmark-res/dataset-gen/tpch-kit -s $SF -l hdfs://localhost:8020/user/spark_benchmark

Locate TPCH queries

As mentioned, tpch-kit is not designed for Spark, so I adjusted the query syntaxes to get its version of Spark SQL for execution. The list of default tempalte queries are at spark-stage-tuning/benchmark-res/dataset-gen/tpch-kit/groundtruth_spark_queries.

To generate more Spark SQLs by using the 22 templates, use the ./dbgen/spark_wrapper.sh

cd spark-stage-tuning/benchmark-res/dataset-gen/tpch-kit
export DSS_CONFIG=${PWD}/dbgen
export DSS_QUERY=${DSS_CONFIG}/queries_sparksql
OUTPATH=./spark-sqls
SF=1 # scale factor
t=1 # query template id (22 in total)
i=1 # query template random seed
mkdir -p ${OUTPATH}/${t}
dbgen/spark_wrapper.sh "qgen -r ${i} -s ${SF} ${t}" ${OUTPATH}/${t}/${t}-${i}.sql

Run TPCH Q1

DIR=/Users/chenghao/ResearchHub/repos/spark-stage-tuning
SPARK_PATH=/Users/chenghao/ResearchHub/repos/spark/dist
BM=TPCH
SF=1
t=1 # query template id (22 in total)
i=1 # query template random seed
${SPARK_PATH}/bin/spark-submit \
--master "local[*]" \
--deploy-mode client \
--class edu.polytechnique.cedar.spark.benchmark.RunTemplateQuery \
--name ${BM}-${t}-${i} \ 
--conf spark.default.parallelism=40 \
--conf spark.sql.adaptive.enabled=true \
--conf spark.sql.parquet.compression.codec=snappy \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.kryoserializer.buffer.max=512m \
--conf spark.driver.extraClassPath=file://${DIR}/benchmark-res/libs/mysql-connector-j-8.0.33.jar \
--jars file://${SPARK_PATH}/examples/jars/scopt_2.12-3.7.1.jar \
${DIR}/target/scala-2.12/spark-stage-tuning_2.12-1.0-SNAPSHOT.jar \
-b ${BM} -t $t -q $i -s $SF -l ${DIR}/benchmark-res/dataset-gen/tpch-kit/spark-sqls -d false

TPCDS

Here are the steps from getting downloading the customized tpcds-kit to run the first TPCDS query in the local Spark cluster.

Get tpcds-kit

# ----------------
# TPCDS
cd benchmark-res/script
bash 2.get-spark-tpcds-kit-local.sh MACOS
# OR - from scratch
mkdir -p benchmark-res/dataset-gen
cd benchmark-res/dataset-gen
git clone --depth 1 git@github.com:Angryrou/tpcds-kit.git
cd tpcds-kit/tools
make OS=MACOS
cd .. # to tpch-kit
bash validate_sparksql_gen.sh # should not be any output.
echo "Done"

Generate TPCDS dataset

# ----------------
# TPCDS
DIR=/Users/chenghao/ResearchHub/repos/spark-stage-tuning
SPARK_PATH=/Users/chenghao/ResearchHub/repos/spark/dist
SF=1
${SPARK_PATH}/bin/spark-submit \
--master "local[*]" \
--deploy-mode client \
--class edu.polytechnique.cedar.spark.benchmark.SetBenchmark \
--name set-benchmark \
--conf spark.driver.cores=5 \
--conf spark.driver.memory=4g \
--conf spark.default.parallelism=40 \
--conf spark.sql.adaptive.enabled=true \
--conf spark.sql.parquet.compression.codec=snappy \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.kryoserializer.buffer.max=512m \
--conf spark.driver.extraClassPath=file://${DIR}/benchmark-res/libs/mysql-connector-j-8.0.33.jar \
--jars file://${SPARK_PATH}/examples/jars/scopt_2.12-3.7.1.jar,file://${DIR}/benchmark-res/libs/spark-sql-perf_2.12-0.5.1-SNAPSHOT.jar \
${DIR}/target/scala-2.12/spark-stage-tuning_2.12-1.0-SNAPSHOT.jar \
-b TPCDS -d ${DIR}/benchmark-res/dataset-gen/tpcds-kit -s $SF -l hdfs://localhost:8020/user/spark_benchmark

Locate TPCDS queries

For the same reason as tpch-kit, I adjusted the query syntaxes in tpcds-kit to get its version of Spark SQL for execution. The list of default tempalte queries are at spark-stage-tuning/benchmark-res/dataset-gen/tpcds-kit/groundtruth_spark_queries_adjusted.