Spark Setup (2) - local deployment for Spark, Hadoop(Yarn), Hive
The process of writing a Spark program and deploying it to the real cluster includes (1) debug a Spark program locally; (2) package the local-verified code to a jar file; (3) deploy or upload the jar file to the remote Spark cluster; (4) go back to the step one whenever there are bugs in the remote cluster.
It is non-trivial due to the gap between the local and a real cluster, summarized as the following table.
| | Cluster Manager | Associated Softwares |
|-------------- |--------------------------- |------------------------------------------------------------------- |
| Local | Local mode | Single-node Software |
| Real Cluster | Yarn-cluster, Yarn-client | Software deployed in a cluster, such as Hadoop (Yarn, HDFS), Hive |
This blog tries to resolve the 2nd column to enable using the local version of those distributed softwares, including a local Hadoop, local Hive, and a local Spark.
Steps
Setup Hadoop (local)
Install Hadoop 3.3.0
- from package management (
brew
orsdkman
). E.g.,skd install hadoop 3.3.0
- from official release - see an example here
Setup configurations
Under Hadoop's configuration directory
(follow the example of skd install
)
- locate the Hadoop’s configuration directory. In my example,
export HADOOP_HOME=/Users/chenghao/.sdkman/candidates/hadoop/3.3.0/ export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/
- set JDK 8 in
$HADOOP_CONF_DIR/hadoop-env.sh
export JAVA_HOME=`/usr/libexec/java_home -v 1.8` # or export JAVA_HOME=/Library/Java/JavaVirtualMachines/zulu-8.jdk/Contents/Home
- configure
$HADOOP_CONF_DIR/hdfs-site.xml
<configuration> <property> <name>dfs.replication</name> <value>1</value> </property> <property> <name>dfs.namenode.name.dir</name> <value>file:///Users/chenghao/Documents/spark3-tpc/gen-dataset/dfs/name</value> </property> <property> <name>dfs.datanode.data.dir</name> <value>file:///Users/chenghao/Documents/spark3-tpc/gen-dataset/dfs/data</value> </property> <property> <name>dfs.namenode.checkpoint.dir</name> <value>file:///Users/chenghao/Documents/spark3-tpc/gen-dataset/dfs/namesecondary</value> </property> </configuration>
- configure
$HADOOP_CONF_DIR/core-site.xml
for HDFS<configuration> </property> <property> <name>fs.default.name</name> <value>hdfs://localhost:8020</value> </property> </configuration>
- (Optional for running Spark)
$HADOOP_CONF_DIR/mapred-site.xml
for MapReduce<configuration> <property> <name>mapred.job.tracker</name> <value>localhost:8021</value> </property> </configuration>
- Verify
ssh localhost
is enabled. Otherwise, enable the Remote Login and figure outssh-key
issue like this blog.
Run Hadoop
- Format NameNode
# double check hdfs is in the environment path; otherwise cd to $HADOOP_HOME/bin which hdfs # /Users/chenghao/.sdkman/candidates/hadoop/current/bin/hdfs hdfs namenode -format
- Start/Stop Hadoop
cd $HADOOP_HOME ./sbin/start-dfs.sh # to start hadoop ./sbin/stop-dfs.sh # to stop hadoop
- double check the Hadoop UI has been setup at http://localhost:9870
Setup Hive (local) to work with Spark (Hive metastore with MySQL)
Install mysql 8
- Install MySQL 8.0.33
brew install mysql@8.0 brew services restart mysql # test mysql login mysql -u root # should be able to get into the mysql shell wihout a password
- Assign a password for root account (In mysql-shell)
ALTER USER 'root'@'localhost' IDENTIFIED BY 'Root1234!';
Download hive 3.1.3
- with Intel chips
brew install hive
- with Apple Silicon (M1/M2) chips, we have to install from scratch (
brew install
does not support M1/M2 yet for hive)cd /Users/chenghao/ResearchHub/softwares wget https://dlcdn.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz tar zxvf apache-hive-3.1.3-bin.tar.gz sudo ln -s $PWD/apache-hive-3.1.3-bin /usr/local/bin/hive # add to the global environment
Configure hive
- add hive environment variables in
~/.zshrc
;source ~/.zshrc
export HIVE_HOME=/usr/local/bin/hive export PATH=$PATH:$HIVE_HOME/bin export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:.
- prepare the
hive-site.xml
cd $HIVE_HOME/conf cp hive-default.xml.template hive-site.xml
- configure hive metastore in
hive-site.xml
with MySQL as the connector<property> <name>javax.jdo.option.ConnectionURL</name> <value>jdbc:mysql://localhost:3306/dreamlab?allowPublicKeyRetrieval=true&auseSSL=false&createDatabaseIfNotExist=true&characterEncoding=utf-8&useJDBCCompliantTimezoneShift=true&useLegacyDatetimeCode=false&serverTimezone=Europe/Paris</value> </property> <property> <name>javax.jdo.option.ConnectionDriverName</name> <value>com.mysql.jc.jdbc.Driver</value> </property> <property> <name>javax.jdo.option.ConnectionUserName</name> <value>root</value> </property> <property> <name>javax.jdo.option.ConnectionPassword</name> <value>Root1234!</value> </property>
- add
mysql-connector-j-8.0.33.tar.gz
to libcd ~/Downloads wget https://cdn.mysql.com//Downloads/Connector-J/mysql-connector-j-8.0.33.tar.gz tar zxvf mysql-connector-j-8.0.33.tar.gz cp mysql-connector-j-8.0.33/mysql-connector-j-8.0.33.jar $HIVE_HOME/lib
Run hive
-
Initial meta schema with MySQL (one time)
cd $HIVE_HOME bin/schematool -initSchema -dbType mysql
-
potential error 1: check this blog for a fix
$ bin/schematool -initSchema -dbType mysql SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/Users/chenghao/ResearchHub/softwares/apache-hive-3.1.3-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/Users/chenghao/.sdkman/candidates/hadoop/3.3.0/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)
-
potential error 2: delete the symbols in reported position (row 3216, col 96) in
$HIVE_HOME/conf/hive-site.xml
$ bin/schematool -initSchema -dbType mysql SLF4J: Class path contains multiple SLF4J bindings. SLF4J: Found binding in [jar:file:/Users/chenghao/ResearchHub/softwares/apache-hive-3.1.3-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: Found binding in [jar:file:/Users/chenghao/.sdkman/candidates/hadoop/3.3.0/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation. SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory] Exception in thread "main" java.lang.RuntimeException: com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion character (code 0x8 at [row,col,system-id]: [3216,96,"file:/Users/chenghao/ResearchHub/softwares/apache-hive-3.1.3-bin/conf/hive-site.xml"]
Enable the Hive Metastore with Spark
cd $HIVE_HOME
nohup hive --service metastore &
Setup Spark (local)
Download and deploy
Download or Build from the source code
Setup configurations
-
locate the Spark’s configuration directory. In my example,
export SPARK_HOME=/Users/chenghao/ResearchHub/repos/spark/dist export SPARK_HOME_CONF=$SPARK_HOME/conf
-
configure in
$SPARK_HOME_CONF/spark-env.sh
SPARK_LOCAL_DIRS=/Users/chenghao/ResearchHub/softwares/spark_local SPARK_CONF_DIR=${SPARK_HOME}/conf SPARK_LOG_DIR=${SPARK_HOME}/logs SPARK_PID_DIR=/tmp/hex_pids
-
configure in
$SPARK_HOME_CONF/spark-defaults.conf
spark.eventLog.dir=hdfs://localhost:8020/user/spark/eventlog spark.eventLog.enabled=true spark.history.fs.logDirectory=hdfs://localhost:8020/user/spark/eventlog spark.history.ui.port=18088 spark.yarn.historyServer.address=http://localhost:18088 # optional below spark.serializer=org.apache.spark.serializer.KryoSerializer spark.kryoserializer.buffer.max=512m spark.sql.crossJoin.enabled=true spark.dynamicAllocation.enabled=false spark.shuffle.service.enabled=false spark.sql.adaptive.enabled=true spark.sql.cbo.enabled=true spark.sql.cbo.joinReorder.dp.star.filter=true spark.sql.cbo.joinReorder.enabled=true spark.sql.cbo.planStats.enabled=true spark.sql.cbo.starSchemaDetection=true spark.sql.statistics.histogram.enabled=true spark.locality.wait=0s
-
add the Hadoop/Hive configurations to Spark
cd $SPARK_HOME/conf ln -s $HIVE_HOME/conf/hive-site.xml . ln -s $HADOOP_HOME/etc/hadoop/hdfs-site.xml . ln -s $HADOOP_HOME/etc/hadoop/core-site.xml .
Run Spark server
cd $SPARK_HOME
bash sbin/start-master.sh
bash sbin/start-history-server.sh
bash sbin/start-thriftserver.sh