Spark Setup (2) - local deployment for Spark, Hadoop(Yarn), Hive

May 3, 2023 4 min read

The process of writing a Spark program and deploying it to the real cluster includes (1) debug a Spark program locally; (2) package the local-verified code to a jar file; (3) deploy or upload the jar file to the remote Spark cluster; (4) go back to the step one whenever there are bugs in the remote cluster.

It is non-trivial due to the gap between the local and a real cluster, summarized as the following table.

|               | Cluster Manager               | Associated Softwares                                                  |
|-------------- |---------------------------    |-------------------------------------------------------------------    |
| Local         | Local mode                    | Single-node Software                                                  |
| Real Cluster  | Yarn-cluster, Yarn-client     | Software deployed in a cluster, such as Hadoop (Yarn, HDFS), Hive     |

This blog tries to resolve the 2nd column to enable using the local version of those distributed softwares, including a local Hadoop, local Hive, and a local Spark.

Steps

Setup Hadoop (local)

Install Hadoop 3.3.0

from package management (brew or sdkman). E.g., skd install hadoop 3.3.0
from official release - see an example here

Setup configurations

Under Hadoop's configuration directory (follow the example of skd install)

locate the Hadoop’s configuration directory. In my example,

export HADOOP_HOME=/Users/chenghao/.sdkman/candidates/hadoop/3.3.0/
export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/

set JDK 8 in $HADOOP_CONF_DIR/hadoop-env.sh

export JAVA_HOME=`/usr/libexec/java_home -v 1.8` 
# or 
export JAVA_HOME=/Library/Java/JavaVirtualMachines/zulu-8.jdk/Contents/Home

configure $HADOOP_CONF_DIR/hdfs-site.xml

<configuration>
  <property>
    <name>dfs.replication</name>
    <value>1</value>
  </property>
  <property>
    <name>dfs.namenode.name.dir</name>
    <value>file:///Users/chenghao/Documents/spark3-tpc/gen-dataset/dfs/name</value>
  </property>
  <property>
    <name>dfs.datanode.data.dir</name>
    <value>file:///Users/chenghao/Documents/spark3-tpc/gen-dataset/dfs/data</value>
  </property>
  <property>
    <name>dfs.namenode.checkpoint.dir</name>
    <value>file:///Users/chenghao/Documents/spark3-tpc/gen-dataset/dfs/namesecondary</value>
  </property>
</configuration>

configure $HADOOP_CONF_DIR/core-site.xml for HDFS

<configuration>
</property>
  <property>
    <name>fs.default.name</name>
    <value>hdfs://localhost:8020</value>
  </property>
</configuration>

(Optional for running Spark) $HADOOP_CONF_DIR/mapred-site.xml for MapReduce

<configuration>
  <property>
    <name>mapred.job.tracker</name>
    <value>localhost:8021</value>
  </property>
</configuration>

Verify ssh localhost is enabled. Otherwise, enable the Remote Login and figure out ssh-key issue like this blog.

Run Hadoop

Format NameNode

# double check hdfs is in the environment path; otherwise cd to $HADOOP_HOME/bin
which hdfs
# /Users/chenghao/.sdkman/candidates/hadoop/current/bin/hdfs
hdfs namenode -format

Start/Stop Hadoop

cd $HADOOP_HOME
./sbin/start-dfs.sh # to start hadoop

./sbin/stop-dfs.sh # to stop hadoop

double check the Hadoop UI has been setup at http://localhost:9870

Setup Hive (local) to work with Spark (Hive metastore with MySQL)

Install mysql 8

Install MySQL 8.0.33

brew install mysql@8.0
brew services restart mysql

# test mysql login
mysql -u root # should be able to get into the mysql shell wihout a password

Assign a password for root account (In mysql-shell)

ALTER USER 'root'@'localhost' IDENTIFIED BY 'Root1234!';

Download hive 3.1.3

with Intel chips
```
brew install hive
```

with Apple Silicon (M1/M2) chips, we have to install from scratch (brew install does not support M1/M2 yet for hive)

cd /Users/chenghao/ResearchHub/softwares
wget https://dlcdn.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz
tar zxvf apache-hive-3.1.3-bin.tar.gz
sudo ln -s $PWD/apache-hive-3.1.3-bin /usr/local/bin/hive # add to the global environment

Configure hive

add hive environment variables in ~/.zshrc; source ~/.zshrc

export HIVE_HOME=/usr/local/bin/hive
export PATH=$PATH:$HIVE_HOME/bin
export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:.

prepare the hive-site.xml

cd $HIVE_HOME/conf
cp hive-default.xml.template hive-site.xml

configure hive metastore in hive-site.xml with MySQL as the connector

<property>
    <name>javax.jdo.option.ConnectionURL</name>
    <value>jdbc:mysql://localhost:3306/dreamlab?allowPublicKeyRetrieval=true&amp;auseSSL=false&amp;createDatabaseIfNotExist=true&amp;characterEncoding=utf-8&amp;useJDBCCompliantTimezoneShift=true&amp;useLegacyDatetimeCode=false&amp;serverTimezone=Europe/Paris</value>
</property>
<property>
    <name>javax.jdo.option.ConnectionDriverName</name>
    <value>com.mysql.jc.jdbc.Driver</value>
</property>
<property>
    <name>javax.jdo.option.ConnectionUserName</name>
    <value>root</value>
</property>
<property>
    <name>javax.jdo.option.ConnectionPassword</name>
    <value>Root1234!</value>
</property>

add mysql-connector-j-8.0.33.tar.gz to lib

cd ~/Downloads
wget https://cdn.mysql.com//Downloads/Connector-J/mysql-connector-j-8.0.33.tar.gz
tar zxvf mysql-connector-j-8.0.33.tar.gz
cp mysql-connector-j-8.0.33/mysql-connector-j-8.0.33.jar $HIVE_HOME/lib

Run hive

Initial meta schema with MySQL (one time)

cd $HIVE_HOME
bin/schematool -initSchema -dbType mysql

potential error 1: check this blog for a fix

$ bin/schematool -initSchema -dbType mysql
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/chenghao/ResearchHub/softwares/apache-hive-3.1.3-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/chenghao/.sdkman/candidates/hadoop/3.3.0/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)

potential error 2: delete the symbols in reported position (row 3216, col 96) in $HIVE_HOME/conf/hive-site.xml

$ bin/schematool -initSchema -dbType mysql
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/Users/chenghao/ResearchHub/softwares/apache-hive-3.1.3-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/Users/chenghao/.sdkman/candidates/hadoop/3.3.0/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
Exception in thread "main" java.lang.RuntimeException: com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion character (code 0x8
 at [row,col,system-id]: [3216,96,"file:/Users/chenghao/ResearchHub/softwares/apache-hive-3.1.3-bin/conf/hive-site.xml"]

Enable the Hive Metastore with Spark

cd $HIVE_HOME
nohup hive --service metastore &

Setup Spark (local)

Download and deploy

Download or Build from the source code

Setup configurations

locate the Spark’s configuration directory. In my example,

export SPARK_HOME=/Users/chenghao/ResearchHub/repos/spark/dist
export SPARK_HOME_CONF=$SPARK_HOME/conf

configure in $SPARK_HOME_CONF/spark-env.sh

SPARK_LOCAL_DIRS=/Users/chenghao/ResearchHub/softwares/spark_local
SPARK_CONF_DIR=${SPARK_HOME}/conf
SPARK_LOG_DIR=${SPARK_HOME}/logs
SPARK_PID_DIR=/tmp/hex_pids

configure in $SPARK_HOME_CONF/spark-defaults.conf

spark.eventLog.dir=hdfs://localhost:8020/user/spark/eventlog
spark.eventLog.enabled=true
spark.history.fs.logDirectory=hdfs://localhost:8020/user/spark/eventlog
spark.history.ui.port=18088
spark.yarn.historyServer.address=http://localhost:18088

# optional below
spark.serializer=org.apache.spark.serializer.KryoSerializer
spark.kryoserializer.buffer.max=512m
spark.sql.crossJoin.enabled=true
spark.dynamicAllocation.enabled=false
spark.shuffle.service.enabled=false
spark.sql.adaptive.enabled=true
spark.sql.cbo.enabled=true
spark.sql.cbo.joinReorder.dp.star.filter=true
spark.sql.cbo.joinReorder.enabled=true
spark.sql.cbo.planStats.enabled=true
spark.sql.cbo.starSchemaDetection=true
spark.sql.statistics.histogram.enabled=true
spark.locality.wait=0s

add the Hadoop/Hive configurations to Spark

cd $SPARK_HOME/conf
ln -s $HIVE_HOME/conf/hive-site.xml .
ln -s $HADOOP_HOME/etc/hadoop/hdfs-site.xml .
ln -s $HADOOP_HOME/etc/hadoop/core-site.xml .

Run Spark server

cd $SPARK_HOME
bash sbin/start-master.sh
bash sbin/start-history-server.sh
bash sbin/start-thriftserver.sh

References

Spark Spark-Setup

Spark Setup (2) - local deployment for Spark, Hadoop(Yarn), Hive

Steps

Setup Hadoop (local)

Install Hadoop 3.3.0

Setup configurations

Run Hadoop

Setup Hive (local) to work with Spark (Hive metastore with MySQL)

Install mysql 8

Download hive 3.1.3

Configure hive

Run hive

Enable the Hive Metastore with Spark

Setup Spark (local)

Download and deploy

Setup configurations

Run Spark server

References

Chenghao Lyu

Ph.D.