Spark Setup (2) - local deployment for Spark, Hadoop(Yarn), Hive

The process of writing a Spark program and deploying it to the real cluster includes (1) debug a Spark program locally; (2) package the local-verified code to a jar file; (3) deploy or upload the jar file to the remote Spark cluster; (4) go back to the step one whenever there are bugs in the remote cluster.

It is non-trivial due to the gap between the local and a real cluster, summarized as the following table.

|               | Cluster Manager               | Associated Softwares                                                  |
|-------------- |---------------------------    |-------------------------------------------------------------------    |
| Local         | Local mode                    | Single-node Software                                                  |
| Real Cluster  | Yarn-cluster, Yarn-client     | Software deployed in a cluster, such as Hadoop (Yarn, HDFS), Hive     |

This blog tries to resolve the 2nd column to enable using the local version of those distributed softwares, including a local Hadoop, local Hive, and a local Spark.

Steps

Setup Hadoop (local)

Install Hadoop 3.3.0

  • from package management (brew or sdkman). E.g., skd install hadoop 3.3.0
  • from official release - see an example here

Setup configurations

Under Hadoop's configuration directory (follow the example of skd install)

  1. locate the Hadoop’s configuration directory. In my example,
    export HADOOP_HOME=/Users/chenghao/.sdkman/candidates/hadoop/3.3.0/
    export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop/
    
  2. set JDK 8 in $HADOOP_CONF_DIR/hadoop-env.sh
    export JAVA_HOME=`/usr/libexec/java_home -v 1.8` 
    # or 
    export JAVA_HOME=/Library/Java/JavaVirtualMachines/zulu-8.jdk/Contents/Home
    
  3. configure $HADOOP_CONF_DIR/hdfs-site.xml
    <configuration>
      <property>
        <name>dfs.replication</name>
        <value>1</value>
      </property>
      <property>
        <name>dfs.namenode.name.dir</name>
        <value>file:///Users/chenghao/Documents/spark3-tpc/gen-dataset/dfs/name</value>
      </property>
      <property>
        <name>dfs.datanode.data.dir</name>
        <value>file:///Users/chenghao/Documents/spark3-tpc/gen-dataset/dfs/data</value>
      </property>
      <property>
        <name>dfs.namenode.checkpoint.dir</name>
        <value>file:///Users/chenghao/Documents/spark3-tpc/gen-dataset/dfs/namesecondary</value>
      </property>
    </configuration>
    
  4. configure $HADOOP_CONF_DIR/core-site.xml for HDFS
    <configuration>
    </property>
      <property>
        <name>fs.default.name</name>
        <value>hdfs://localhost:8020</value>
      </property>
    </configuration>          
    
  5. (Optional for running Spark) $HADOOP_CONF_DIR/mapred-site.xml for MapReduce
    <configuration>
      <property>
        <name>mapred.job.tracker</name>
        <value>localhost:8021</value>
      </property>
    </configuration>
    
  6. Verify ssh localhost is enabled. Otherwise, enable the Remote Login and figure out ssh-key issue like this blog.

Run Hadoop

  1. Format NameNode
    # double check hdfs is in the environment path; otherwise cd to $HADOOP_HOME/bin
    which hdfs
    # /Users/chenghao/.sdkman/candidates/hadoop/current/bin/hdfs
    hdfs namenode -format
    
  2. Start/Stop Hadoop
    cd $HADOOP_HOME
    ./sbin/start-dfs.sh # to start hadoop
    
    ./sbin/stop-dfs.sh # to stop hadoop
    
  3. double check the Hadoop UI has been setup at http://localhost:9870

Setup Hive (local) to work with Spark (Hive metastore with MySQL)

Install mysql 8

  1. Install MySQL 8.0.33
    brew install mysql@8.0
    brew services restart mysql
    
    # test mysql login
    mysql -u root # should be able to get into the mysql shell wihout a password
    
  2. Assign a password for root account (In mysql-shell)
    ALTER USER 'root'@'localhost' IDENTIFIED BY 'Root1234!';
    

Download hive 3.1.3

  1. with Intel chips
    brew install hive
    
  2. with Apple Silicon (M1/M2) chips, we have to install from scratch (brew install does not support M1/M2 yet for hive)
    cd /Users/chenghao/ResearchHub/softwares
    wget https://dlcdn.apache.org/hive/hive-3.1.3/apache-hive-3.1.3-bin.tar.gz
    tar zxvf apache-hive-3.1.3-bin.tar.gz
    sudo ln -s $PWD/apache-hive-3.1.3-bin /usr/local/bin/hive # add to the global environment
    

Configure hive

  1. add hive environment variables in ~/.zshrc; source ~/.zshrc
    export HIVE_HOME=/usr/local/bin/hive
    export PATH=$PATH:$HIVE_HOME/bin
    export CLASSPATH=$CLASSPATH:/usr/local/hive/lib/*:.
    
  2. prepare the hive-site.xml
    cd $HIVE_HOME/conf
    cp hive-default.xml.template hive-site.xml
    
  3. configure hive metastore in hive-site.xml with MySQL as the connector
    <property>
        <name>javax.jdo.option.ConnectionURL</name>
        <value>jdbc:mysql://localhost:3306/dreamlab?allowPublicKeyRetrieval=true&amp;auseSSL=false&amp;createDatabaseIfNotExist=true&amp;characterEncoding=utf-8&amp;useJDBCCompliantTimezoneShift=true&amp;useLegacyDatetimeCode=false&amp;serverTimezone=Europe/Paris</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionDriverName</name>
        <value>com.mysql.jc.jdbc.Driver</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionUserName</name>
        <value>root</value>
    </property>
    <property>
        <name>javax.jdo.option.ConnectionPassword</name>
        <value>Root1234!</value>
    </property>    
    
    
  4. add mysql-connector-j-8.0.33.tar.gz to lib
    cd ~/Downloads
    wget https://cdn.mysql.com//Downloads/Connector-J/mysql-connector-j-8.0.33.tar.gz
    tar zxvf mysql-connector-j-8.0.33.tar.gz
    cp mysql-connector-j-8.0.33/mysql-connector-j-8.0.33.jar $HIVE_HOME/lib
    

Run hive

  1. Initial meta schema with MySQL (one time)

    cd $HIVE_HOME
    bin/schematool -initSchema -dbType mysql
    
  2. potential error 1: check this blog for a fix

    $ bin/schematool -initSchema -dbType mysql
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/Users/chenghao/ResearchHub/softwares/apache-hive-3.1.3-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/Users/chenghao/.sdkman/candidates/hadoop/3.3.0/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
    Exception in thread "main" java.lang.NoSuchMethodError: com.google.common.base.Preconditions.checkArgument(ZLjava/lang/String;Ljava/lang/Object;)
    
  3. potential error 2: delete the symbols in reported position (row 3216, col 96) in $HIVE_HOME/conf/hive-site.xml

    $ bin/schematool -initSchema -dbType mysql
    SLF4J: Class path contains multiple SLF4J bindings.
    SLF4J: Found binding in [jar:file:/Users/chenghao/ResearchHub/softwares/apache-hive-3.1.3-bin/lib/log4j-slf4j-impl-2.17.1.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: Found binding in [jar:file:/Users/chenghao/.sdkman/candidates/hadoop/3.3.0/share/hadoop/common/lib/slf4j-log4j12-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class]
    SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
    SLF4J: Actual binding is of type [org.apache.logging.slf4j.Log4jLoggerFactory]
    Exception in thread "main" java.lang.RuntimeException: com.ctc.wstx.exc.WstxParsingException: Illegal character entity: expansion character (code 0x8
     at [row,col,system-id]: [3216,96,"file:/Users/chenghao/ResearchHub/softwares/apache-hive-3.1.3-bin/conf/hive-site.xml"]
    

Enable the Hive Metastore with Spark

cd $HIVE_HOME
nohup hive --service metastore &

Setup Spark (local)

Download and deploy

Download or Build from the source code

Setup configurations

  1. locate the Spark’s configuration directory. In my example,

    export SPARK_HOME=/Users/chenghao/ResearchHub/repos/spark/dist
    export SPARK_HOME_CONF=$SPARK_HOME/conf
    
  2. configure in $SPARK_HOME_CONF/spark-env.sh

    SPARK_LOCAL_DIRS=/Users/chenghao/ResearchHub/softwares/spark_local
    SPARK_CONF_DIR=${SPARK_HOME}/conf
    SPARK_LOG_DIR=${SPARK_HOME}/logs
    SPARK_PID_DIR=/tmp/hex_pids
    
  3. configure in $SPARK_HOME_CONF/spark-defaults.conf

    spark.eventLog.dir=hdfs://localhost:8020/user/spark/eventlog
    spark.eventLog.enabled=true
    spark.history.fs.logDirectory=hdfs://localhost:8020/user/spark/eventlog
    spark.history.ui.port=18088
    spark.yarn.historyServer.address=http://localhost:18088
    
    # optional below
    spark.serializer=org.apache.spark.serializer.KryoSerializer
    spark.kryoserializer.buffer.max=512m
    spark.sql.crossJoin.enabled=true
    spark.dynamicAllocation.enabled=false
    spark.shuffle.service.enabled=false
    spark.sql.adaptive.enabled=true
    spark.sql.cbo.enabled=true
    spark.sql.cbo.joinReorder.dp.star.filter=true
    spark.sql.cbo.joinReorder.enabled=true
    spark.sql.cbo.planStats.enabled=true
    spark.sql.cbo.starSchemaDetection=true
    spark.sql.statistics.histogram.enabled=true
    spark.locality.wait=0s    
    
  4. add the Hadoop/Hive configurations to Spark

    cd $SPARK_HOME/conf
    ln -s $HIVE_HOME/conf/hive-site.xml .
    ln -s $HADOOP_HOME/etc/hadoop/hdfs-site.xml .
    ln -s $HADOOP_HOME/etc/hadoop/core-site.xml .
    

Run Spark server

cd $SPARK_HOME
bash sbin/start-master.sh
bash sbin/start-history-server.sh
bash sbin/start-thriftserver.sh

References

Chenghao Lyu
Chenghao Lyu
Ph.D. Candidate

My research interests include big data analytics systems, machine learning and multi-objective optimizations.