Spark Setup (1) - building and deploying Apache Spark from scratch

To enable the new features in Spark’s master branch that have not been merged to an official release (on the date I am writing the blog, the latest release is 3.4.0), I recorded my journey of building Apache Spark from scratch on M1/M2 chip.

Building from Scratch

  1. set up environment on M1

  2. clone and target commit for the RP SPARK-42963 that has passed all the checks in the workflow actions.

    # using --filter to reduce the size of the log history
    git clone --filter=tree:0 git@github.com:apache/spark.git
    cd spark 
    git checkout 3bc66da6680844cc08cbc181a51d3a8d988b41bb 
    
  3. build a runnable version with pre-built Hadoop 3.3.0 (for deployment over a cluster)

    export MAVEN_OPTS="-Xss64m -Xmx2g -XX:ReservedCodeCacheSize=1g"
    ./dev/make-distribution.sh --name stuning-spark --pip --tgz -Dhadoop.version=3.3.0 -Phive -Phive-thriftserver -Pyarn    
    # for build locally
    # ./build/mvn -DskipTests clean package
    
  4. get the customized release at ./spark-3.5.0-SNAPSHOT-bin-stuning-spark.tgz

Deploying

An exmaple of deploying the customized Spark release to a cluster of 6 nodes, each with

  • 2 x Intel(R) Xeon(R) Gold 6130 CPU @ 2.10GHz
  • 32 cores, 754G memory
  • CentOS Linux 7 (Core)
  • JDK 8 “1.8.0_211”

Steps:

  1. put the customized release under the same directory in the 6 nodes

    # on each worker node
    scp hex1@node1:~/chenghao/spark/spark-3.5.0-SNAPSHOT-bin-stuning-spark.tgz ~/chenghao
    tar -xvzf spark-3.5.0-SNAPSHOT-bin-stuning-spark.tgz
    
    # customized choice: softlink the spark directory to ~/spark
    ln -s ~/chenghao/spark-3.5.0-SNAPSHOT-bin-stuning-spark ~/spark
    
  2. setup the common configurations (more details can be found in the internet)

  3. start the cluster

    ~/spark/sbin/start-history-server.sh
    # to enable hive-metastore
    nohup hive --service metastore &
    bash ~/spark/sbin/start-thriftserver.sh
    
  4. done.

Chenghao Lyu
Chenghao Lyu
Ph.D. Candidate

My research interests include big data analytics systems, machine learning and multi-objective optimizations.