How to remote debug Spark via Intellij?

To enable debug remote Spark programming running, here are the set ups

Architecture

  • frontal node (accessible with a public IP)
  • Spark cluster (only accessible from the frontal node)

Prepare

Steps

  1. Upload the packaged jar file to the master node of the remote Spark cluster.

  2. In the master node of the remote cluster, set up the SPARK_SUBMIT_OPTS and submit the application.

    export SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005
    
    # run TPCH Q18 with specified Spark cnofigurations
    spath=/opt/hex_users/$USER/chenghao/spark-stage-tuning
    jpath=/opt/hex_users/$USER/spark-3.2.1-hadoop3.3.0/jdk1.8
    lpath=/opt/hex_users/$USER/chenghao/spark-stage-tuning/src/main/resources/log4j2.properties
    name=TPCH100_q18-1_8,5,4,2,2,False,True,60,2,2,1,True
    
    ~/spark/bin/spark-submit \
    --class edu.polytechnique.cedar.spark.sql.RunTemplateQuery \
    --name TPCH100_q18-1_8,5,4,2,2,False,True,60,2,2,1,True \
    --master yarn \
    --deploy-mode client \
    --conf spark.executorEnv.JAVA_HOME=${jpath} \
    --conf spark.yarn.appMasterEnv.JAVA_HOME=${jpath} \
    --conf spark.executor.memory=16g \
    --conf spark.executor.cores=5 \
    --conf spark.executor.instances=4 \
    --conf spark.default.parallelism=40 \
    --conf spark.reducer.maxSizeInFlight=48m \
    --conf spark.shuffle.sort.bypassMergeThreshold=200 \
    --conf spark.shuffle.compress=true \
    --conf spark.memory.fraction=0.6 \
    --conf spark.sql.inMemoryColumnarStorage.batchSize=10000 \
    --conf spark.sql.files.maxPartitionBytes=128MB \
    --conf spark.sql.autoBroadcastJoinThreshold=10MB \
    --conf spark.sql.shuffle.partitions=200 \
    --conf spark.yarn.am.cores=5 \
    --conf spark.yarn.am.memory=16g \
    --conf spark.sql.adaptive.enabled=true \
    --conf spark.sql.parquet.compression.codec=snappy \
    --conf spark.sql.broadcastTimeout=10000 \
    --conf spark.rpc.askTimeout=12000 \
    --conf spark.shuffle.io.retryWait=60 \
    --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
    --conf spark.kryoserializer.buffer.max=512m \
    --driver-java-options "-Dlog4j.configuration=file:$lpath" \
    --conf "spark.driver.extraJavaOptions=-Xms20g" \
    --conf "spark.executor.extraJavaOptions=-Dlog4j.configuration=file:log4j2.properties" \
    --files "$lpath" \
    --jars ~/spark/examples/jars/scopt_2.12-3.7.1.jar \
    $spath/target/scala-2.12/spark-stage-tuning_2.12-1.0-SNAPSHOT.jar \
    -b TPCH -t 18 -q 1 -s 100 -l resources/tpch-kit/spark-sqls  
    
    
  3. The program in the cluster will wait until our local debugger is connected, as shown below:

    Listening for transport dt_socket at address: 5005
    

Local debug

sbt package
export SPARK_SUBMIT_OPTS=-agentlib:jdwp=transport=dt_socket,server=y,suspend=y,address=5005
/Users/chenghao/ResearchHub/repos/spark/dist/bin/spark-submit \
--master "local[*]" \
--deploy-mode client \
--class edu.polytechnique.cedar.spark.sql.RunTemplateQuery \
--name 20-1 \
--conf spark.default.parallelism=40 \
--conf spark.reducer.maxSizeInFlight=48m \
--conf spark.shuffle.sort.bypassMergeThreshold=200 \
--conf spark.shuffle.compress=true \
--conf spark.memory.fraction=0.6 \
--conf spark.sql.inMemoryColumnarStorage.batchSize=10000 \
--conf spark.sql.files.maxPartitionBytes=128MB \
--conf spark.sql.autoBroadcastJoinThreshold=10MB \
--conf spark.sql.shuffle.partitions=200 \
--conf spark.sql.adaptive.enabled=true \
--conf spark.sql.parquet.compression.codec=snappy \
--conf spark.sql.broadcastTimeout=10000 \
--conf spark.rpc.askTimeout=12000 \
--conf spark.shuffle.io.retryWait=60 \
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
--conf spark.kryoserializer.buffer.max=512m \
--conf spark.driver.extraClassPath=file:///Users/chenghao/ResearchHub/softwares/apache-hive-3.1.3-bin/lib/mysql-connector-j-8.0.33.jar \
--jars file:///Users/chenghao/ResearchHub/repos/spark/dist/examples/jars/scopt_2.12-3.7.1.jar /Users/chenghao/ResearchHub/repos/spark-stage-tuning/target/scala-2.12/spark-stage-tuning_2.12-1.0-SNAPSHOT.jar -b TPCH -t 20 -q 1 -s 1 -l /Users/chenghao/ResearchHub/repos/UDAO2022/resources/tpch-kit/spark-sqls -d false

References

Chenghao Lyu
Chenghao Lyu
Ph.D. Candidate

My research interests include big data analytics systems, machine learning and multi-objective optimizations.