Spark Setup (1) - building and deploying Apache Spark from scratch

Last updated on Apr 28, 2023 2 min read

To enable the new features in Spark’s master branch that have not been merged to an official release (on the date I am writing the blog, the latest release is 3.4.0), I recorded my journey of building Apache Spark from scratch on M1/M2 chip.

Building from Scratch

set up environment on M1
- git v2.4.1
- JDK 8 (for M1/M2, try Zulu Community 8)

clone and target commit for the RP SPARK-42963 that has passed all the checks in the workflow actions.

# using --filter to reduce the size of the log history
git clone --filter=tree:0 git@github.com:apache/spark.git
cd spark 
git checkout 3bc66da6680844cc08cbc181a51d3a8d988b41bb

build a runnable version with pre-built Hadoop 3.3.0 (for deployment over a cluster)

export MAVEN_OPTS="-Xss64m -Xmx2g -XX:ReservedCodeCacheSize=1g"
./dev/make-distribution.sh --name stuning-spark --pip --tgz -Dhadoop.version=3.3.0 -Phive -Phive-thriftserver -Pyarn    
# for build locally
# ./build/mvn -DskipTests clean package

get the customized release at ./spark-3.5.0-SNAPSHOT-bin-stuning-spark.tgz

Deploying

An exmaple of deploying the customized Spark release to a cluster of 6 nodes, each with

2 x Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
32 cores, 754G memory
CentOS Linux 7 (Core)
JDK 8 “1.8.0_211”

Steps:

put the customized release under the same directory in the 6 nodes

# on each worker node
scp hex1@node1:~/chenghao/spark/spark-3.5.0-SNAPSHOT-bin-stuning-spark.tgz ~/chenghao
tar -xvzf spark-3.5.0-SNAPSHOT-bin-stuning-spark.tgz

# customized choice: softlink the spark directory to ~/spark
ln -s ~/chenghao/spark-3.5.0-SNAPSHOT-bin-stuning-spark ~/spark

setup the common configurations (more details can be found in the internet)

start the cluster

~/spark/sbin/start-history-server.sh
# to enable hive-metastore
nohup hive --service metastore &
bash ~/spark/sbin/start-thriftserver.sh

done.

Spark Spark-Setup

Spark Setup (1) - building and deploying Apache Spark from scratch

Building from Scratch

Deploying

Chenghao Lyu

Ph.D.