Spark Dev (1) - debug Spark program on M1/M2 chip

When I first switched to an M2-chip MBA, I ran into trouble downloading JDK 8 from orcale.com. As a result, my plans to set up Spark locally were put on hold for a while. However, with the new need to re-compile Spark (3.2.1) for my research, I recently picked up where I left off and was finally able to complete the setup.

Set up

  1. JDK 8 (Zulu Community 8)
    $ java -version
    openjdk version "1.8.0_362"
    OpenJDK Runtime Environment (Zulu 8.68.0.21-CA-macos-aarch64) (build 1.8.0_362-b09)
    OpenJDK 64-Bit Server VM (Zulu 8.68.0.21-CA-macos-aarch64) (build 25.362-b09, mixed mode)
    
  2. Scala 2.12
    $ scala -version
    Scala code runner version 2.12.17 -- Copyright 2002-2022, LAMP/EPFL and Lightbend, Inc. 
    
  3. SBT version: 1.5.7
    # make sure to determine the SBT version at `./project/build.properties`
    sbt.version=1.5.7
    

Get hands dirty

I have put an example at https://github.com/Angryrou/spark-starter. And here are some key steps:

  1. the building file build.sbt to specify the dependency (you can add more dependencies in libraryDependencies)

    name := "Spark Starter"
    version := "1.0"
    scalaVersion := "2.12.17"
    libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.2.1"
    
  2. create a package src.main.scala.debug

    mkdir -p src/main/scala/debug
    
  3. have an example scala script at src/main/scala/debug/SimpleApp.scala

    import org.apache.spark.sql.SparkSession
    
    object SimpleApp {
      def main(args: Array[String]) {
        val logFile = "README.md" // Should be some file on your system
        val spark = try {
            SparkSession
            .builder()
            .appName("Simple Application")
            .getOrCreate()
        } catch {
          case _ =>
            SparkSession
            .builder()
            .appName("Simple Application")
            .config("spark.master", "local[2]")
            .getOrCreate()
        }
        val logData = spark.read.textFile(logFile).cache()
        val numAs = logData.filter(line => line.contains("a")).count()
        val numBs = logData.filter(line => line.contains("b")).count()
        println(s"Lines with a: $numAs, Lines with b: $numBs")
        spark.stop()
      }
    }   
    
  4. build package (cml)

    sbt package
    
  5. Intellij (IJ): directly run the class at the main entry.

Takeaways

  1. It is not necessary to deploy Spark locally for debugging your Spark program. You can run it over IJ by hardcoding the spark.master as local[4] in the program.

  2. To run the program on a server, you further need to pacakge your project (e.g., sbt package) after removing the hardcoding for creating the Spark Session (or using try-catch to automate the change)

Chenghao Lyu
Chenghao Lyu
Ph.D. Candidate

My research interests include big data analytics systems, machine learning and multi-objective optimizations.