How to remote debug Spark via Intellij?

May 3, 2023 2 min read

To enable debug remote Spark programming running, here are the set ups

Architecture

frontal node (accessible with a public IP)
Spark cluster (only accessible from the frontal node)

Prepare

The local Spark program setup
A spark cluster deployed in the cluster

Steps

Upload the packaged jar file to the master node of the remote Spark cluster.

In the master node of the remote cluster, set up the SPARK_SUBMIT_OPTS and submit the application.

export SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005

# run TPCH Q18 with specified Spark cnofigurations
spath=/opt/hex_users/$USER/chenghao/spark-stage-tuning
jpath=/opt/hex_users/$USER/spark-3.2.1-hadoop3.3.0/jdk1.8
lpath=/opt/hex_users/$USER/chenghao/spark-stage-tuning/src/main/resources/log4j2.properties
name=TPCH100_q18-1_8,5,4,2,2,False,True,60,2,2,1,True

~/spark/bin/spark-submit \
--class edu.polytechnique.cedar.spark.sql.RunTemplateQuery \
--name TPCH100_q18-1_8,5,4,2,2,False,True,60,2,2,1,True \
--master yarn \
--deploy-mode client \
--conf spark.executorEnv.JAVA_HOME=${jpath} \
--conf spark.yarn.appMasterEnv.JAVA_HOME=${jpath} \
--conf spark.executor.memory=16g \
--conf spark.executor.cores=5 \
--conf spark.executor.instances=4 \
--conf spark.default.parallelism=40 \
--conf spark.reducer.maxSizeInFlight=48m \
--conf spark.shuffle.sort.bypassMergeThreshold=200 \
--conf spark.shuffle.compress=true \
--conf spark.memory.fraction=0.6 \
--conf spark.sql.inMemoryColumnarStorage.batchSize=10000 \
--conf spark.sql.files.maxPartitionBytes=128MB \
--conf spark.sql.autoBroadcastJoinThreshold=10MB \
--conf spark.sql.shuffle.partitions=200 \
--conf spark.yarn.am.cores=5 \
--conf spark.yarn.am.memory=16g \
--conf spark.sql.adaptive.enabled=true \
--conf spark.sql.parquet.compression.codec=snappy \
--conf spark.sql.broadcastTimeout=10000 \
--conf spark.rpc.askTimeout=12000 \
--conf spark.shuffle.io.retryWait=60 \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.kryoserializer.buffer.max=512m \
--driver-java-options "-Dlog4j.configuration=file:$lpath" \
--conf "spark.driver.extraJavaOptions=-Xms20g" \
--conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:log4j2.properties" \
--files "$lpath" \
--jars ~/spark/examples/jars/scopt_2.12-3.7.1.jar \
$spath/target/scala-2.12/spark-stage-tuning_2.12-1.0-SNAPSHOT.jar \
-b TPCH -t 18 -q 1 -s 100 -l resources/tpch-kit/spark-sqls

The program in the cluster will wait until our local debugger is connected, as shown below:
```
Listening for transport dt_socket at address: 5005
```

Local debug

sbt package
export SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005
/Users/chenghao/ResearchHub/repos/spark/dist/bin/spark-submit \
--master "local[*]" \
--deploy-mode client \
--class edu.polytechnique.cedar.spark.sql.RunTemplateQuery \
--name 20-1 \
--conf spark.default.parallelism=40 \
--conf spark.reducer.maxSizeInFlight=48m \
--conf spark.shuffle.sort.bypassMergeThreshold=200 \
--conf spark.shuffle.compress=true \
--conf spark.memory.fraction=0.6 \
--conf spark.sql.inMemoryColumnarStorage.batchSize=10000 \
--conf spark.sql.files.maxPartitionBytes=128MB \
--conf spark.sql.autoBroadcastJoinThreshold=10MB \
--conf spark.sql.shuffle.partitions=200 \
--conf spark.sql.adaptive.enabled=true \
--conf spark.sql.parquet.compression.codec=snappy \
--conf spark.sql.broadcastTimeout=10000 \
--conf spark.rpc.askTimeout=12000 \
--conf spark.shuffle.io.retryWait=60 \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.kryoserializer.buffer.max=512m \
--conf spark.driver.extraClassPath=file:///Users/chenghao/ResearchHub/softwares/apache-hive-3.1.3-bin/lib/mysql-connector-j-8.0.33.jar \
--jars file:///Users/chenghao/ResearchHub/repos/spark/dist/examples/jars/scopt_2.12-3.7.1.jar /Users/chenghao/ResearchHub/repos/spark-stage-tuning/target/scala-2.12/spark-stage-tuning_2.12-1.0-SNAPSHOT.jar -b TPCH -t 20 -q 1 -s 1 -l /Users/chenghao/ResearchHub/repos/UDAO2022/resources/tpch-kit/spark-sqls -d false

References

Spark Remote Debugging

Spark

How to remote debug Spark via Intellij?

Architecture

Prepare

Steps

Local debug

References

Chenghao Lyu

Ph.D.