Spark Setup (1) - building and deploying Apache Spark from scratch
To enable the new features in Spark’s master branch that have not been merged to an official release (on the date I am writing the blog, the latest release is 3.4.0), I recorded my journey of building Apache Spark from scratch on M1/M2 chip.
Building from Scratch
-
set up environment on M1
- git v2.4.1
- JDK 8 (for M1/M2, try Zulu Community 8)
-
clone and target commit for the RP SPARK-42963 that has passed all the checks in the workflow actions.
# using --filter to reduce the size of the log history git clone --filter=tree:0 git@github.com:apache/spark.git cd spark git checkout 3bc66da6680844cc08cbc181a51d3a8d988b41bb
-
build a runnable version with pre-built
Hadoop 3.3.0
(for deployment over a cluster)export MAVEN_OPTS="-Xss64m -Xmx2g -XX:ReservedCodeCacheSize=1g" ./dev/make-distribution.sh --name stuning-spark --pip --tgz -Dhadoop.version=3.3.0 -Phive -Phive-thriftserver -Pyarn # for build locally # ./build/mvn -DskipTests clean package
-
get the customized release at
./spark-3.5.0-SNAPSHOT-bin-stuning-spark.tgz
Deploying
An exmaple of deploying the customized Spark release to a cluster of 6 nodes, each with
- 2 x Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
- 32 cores, 754G memory
- CentOS Linux 7 (Core)
- JDK 8 “1.8.0_211”
Steps:
-
put the customized release under the same directory in the 6 nodes
# on each worker node scp hex1@node1:~/chenghao/spark/spark-3.5.0-SNAPSHOT-bin-stuning-spark.tgz ~/chenghao tar -xvzf spark-3.5.0-SNAPSHOT-bin-stuning-spark.tgz # customized choice: softlink the spark directory to ~/spark ln -s ~/chenghao/spark-3.5.0-SNAPSHOT-bin-stuning-spark ~/spark
-
setup the common configurations (more details can be found in the internet)
-
start the cluster
~/spark/sbin/start-history-server.sh # to enable hive-metastore nohup hive --service metastore & bash ~/spark/sbin/start-thriftserver.sh
-
done.