Spark Dev (1) - debug Spark program on M1/M2 chip
When I first switched to an M2-chip MBA, I ran into trouble downloading JDK 8 from orcale.com. As a result, my plans to set up Spark locally were put on hold for a while. However, with the new need to re-compile Spark (3.2.1) for my research, I recently picked up where I left off and was finally able to complete the setup.
Set up
- JDK 8 (Zulu Community 8)
$ java -version openjdk version "1.8.0_362" OpenJDK Runtime Environment (Zulu 8.68.0.21-CA-macos-aarch64) (build 1.8.0_362-b09) OpenJDK 64-Bit Server VM (Zulu 8.68.0.21-CA-macos-aarch64) (build 25.362-b09, mixed mode)
- Scala 2.12
$ scala -version Scala code runner version 2.12.17 -- Copyright 2002-2022, LAMP/EPFL and Lightbend, Inc.
- SBT version: 1.5.7
# make sure to determine the SBT version at `./project/build.properties` sbt.version=1.5.7
Get hands dirty
I have put an example at https://github.com/Angryrou/spark-starter. And here are some key steps:
-
the building file
build.sbt
to specify the dependency (you can add more dependencies in libraryDependencies)name := "Spark Starter" version := "1.0" scalaVersion := "2.12.17" libraryDependencies += "org.apache.spark" %% "spark-sql" % "3.2.1"
-
create a package
src.main.scala.debug
mkdir -p src/main/scala/debug
-
have an example scala script at
src/main/scala/debug/SimpleApp.scala
import org.apache.spark.sql.SparkSession object SimpleApp { def main(args: Array[String]) { val logFile = "README.md" // Should be some file on your system val spark = try { SparkSession .builder() .appName("Simple Application") .getOrCreate() } catch { case _ => SparkSession .builder() .appName("Simple Application") .config("spark.master", "local[2]") .getOrCreate() } val logData = spark.read.textFile(logFile).cache() val numAs = logData.filter(line => line.contains("a")).count() val numBs = logData.filter(line => line.contains("b")).count() println(s"Lines with a: $numAs, Lines with b: $numBs") spark.stop() } }
-
build package (cml)
sbt package
-
Intellij (IJ): directly run the class at the main entry.
Takeaways
-
It is not necessary to deploy Spark locally for debugging your Spark program. You can run it over IJ by hardcoding the
spark.master
aslocal[4]
in the program. -
To run the program on a server, you further need to pacakge your project (e.g.,
sbt package
) after removing the hardcoding for creating the Spark Session (or using try-catch to automate the change)