Skip to content

Building and Deploying Hadoop Solutions on Linux on Power

Ashish Kumar edited this page Nov 9, 2016 · 27 revisions

Apache Hadoop Version : 2.7.2
Apache Hive Version : 1.2.1
Apache Derby Version : 10.12.1.1
Apache Spark Version : 2.0.0 => ( Below experiment is true for Spark 1.6.0 as well )
Protobuf ( Google Protocol Buffer ) : Protoc 2.5.0
R : 3.2.5
Scala : 2.11.8
**Node.js : Zeppelin : 0.6.0
Linux Distro : RHEL 7.2 PPC64LE

Build Hadoop Source code

    a)	Build Protobuf ( pre-requisite dependency for hadoop build )
    b)	Build Hadoop – 2.7.2 :
            -> wget http://www-eu.apache.org/dist/hadoop/common/hadoop-2.7.2/hadoop-2.7.2-src.tar.gz 
            -> tar -xvf hadoop-2.7.2-src.tar.gz
            -> mvn package -Pdist,native -DskipTests –Dtar
            Hadoop Dist Tar will be created at location below :
            /tempdisk/software/hadoop/hadoop-2.7.2-src/hadoop-dist/target/hadoop-2.7.2.tar.gz

Configure and run hadoop

AS ROOT USER

1. Disable ipv6 from all nodes ( for both Master and all the slave )

    vi /etc/sysctl.conf
    add ---->
    #disable ipv6
    net.ipv6.conf.all.disable_ipv6 = 1
    net.ipv6.conf.default.disable_ipv6 = 1
    net.ipv6.conf.lo.disable_ipv6 = 1

2. Disable selinux ( for both Master and all the slave )

    install selinux if not already installed
      -> yum install iptables-services -y
      -> vi /etc/selinux/config
      set ----->
      -> SELINUX=disabled
      also turn off firewall on boot
      -> /etc/init.d/iptables save
      -> /etc/init.d/iptables stop
      -> chkconfig iptables off

3. Add proper host name( for both Master and all the slave )

      vi /etc/sysconfig/network
      HOSTNAME=bigdatahdfs1.ibm.com

4. Create a hadoop user

      useradd -G hadoop hadoop
      passwd hadoop

AS HADOOP USER

5. Add environment variables( for both Master and all the slave )

      #JAVA ENVIRONMENT VARIABLES
      export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.31-2.b13.ael7b.ppc64le"
      export PATH=$JAVA_HOME/bin:$PATH
      export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/lib
      export JAVA_LDFLAGS="-L/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.31-2.b13.ael7b.ppc64le/jre/lib/ppc64/server -R/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.31-2.b13.ael7b.ppc64le/jre/lib/ppc64/server -ljvm"
      export JAVA_CPPFLAGS="-I/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.31-2.b13.ael7b.ppc64le/include -I/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.31-2.b13.ael7b.ppc64le/include/linux"
      export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.31-2.b13.ael7b.ppc64le/jre/lib/ppc64/server
      
      #HADOOP ENVIRONMENT VARIABLES
      export HADOOP_HOME=/home/hadoop/hadoop-2.7.2
      export HADOOP_MAPRED_HOME=$HADOOP_HOME
      export HADOOP_COMMON_HOME=$HADOOP_HOME
      export HADOOP_HDFS_HOME=$HADOOP_HOME
      export YARN_HOME=$HADOOP_HOME
      export HADOOP_YARN_HOME=$YARN_HOME
      export HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
      export YARN_CONF_DIR=$HADOOP_HOME/etc/hadoop

6. Setup passwordless ssh( for both Master and all the slave )

      ssh-keygen -t rsa -P ""
      cat /home/hadoop/.ssh/id_rsa.pub >> /home/hadoop/.ssh/authorized_keys
      chmod 700 ~/.ssh
      chmod 600 ~/.ssh/authorized_keys
      ssh localhost

7. Modify Hadoop Environment Variables( for both Master and all the slave )

      vi /opt/hadoop-2.2.0/libexec/hadoop-config.sh
      export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.31-2.b13.ael7b.ppc64le"
      vi /opt/hadoop-2.2.0/etc/hadoop/hadoop-env.sh
      export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.31-2.b13.ael7b.ppc64le"

8. Check Hadoop Installation( for both Master and all the slave )

      [hadoop@sys-77402 bin]$ ./hadoop version
      Hadoop 2.7.2
      Subversion Unknown -r Unknown
      Compiled by root on 2016-07-15T07:02Z
      Compiled with protoc 2.5.0
      From source with checksum d0fda26633fa762bff87ec759ebe689c
      This command was run using /home/hadoop/hadoop-2.7.2/share/hadoop/common/hadoop-common-2.7.2.jar
      [hadoop@sys-77402 bin]$

9. Create Hadoop Temp Folder( for both Master and all the slave )

      mkdir -p $HADOOP_HOME/tmp

10. Add Hadoop Slaves( for only Master )

      Add entries of all data nodes in hadoop configuration on MNode ( for only Master )
      vi /opt/hadoop-2.2.0/etc/hadoop/slaves

11. Hadoop Configuration( for Master Node as well as Slave nodes )

      Add the following entry for core-site.xml ( $HADOOP_HOME/etc/hadoop/core-site.xml )
               <configuration>
                   <property>
                        <name>fs.default.name</name>
                        <value>hdfs://bigdatahdfs1.ibm.com:9000</value>
                   </property>
                   <property>
                        <name>hadoop.tmp.dir</name>
                        <value>/home/hadoop/hadoop-2.7.2/tmp</value>
                   </property>
                </configuration>


      Add the following entry for hdfs-site.xml ( $HADOOP_HOME/etc/hadoop/hdfs-site.xml )
               <property>
                     <name>dfs.replication</name>
                     <value>1</value>
               </property>
               <property>
                     <name>dfs.namenode.name.dir</name>
                     <value>file://${hadoop.tmp.dir}/dfs/name</value>
               </property>
               <property>
                     <name>dfs.datanode.data.dir</name>
                     <value>file:/home/hadoop/hadoop-data</value>
               </property>
               <property>
                     <name>dfs.permissions</name>
                     <value>false</value>
               </property>

      Add the following entry for mapred-site.xml ( $HADOOP_HOME/etc/hadoop/mapred-site.xml )
                <configuration>
                   <property>
                        <name>mapreduce.framework.name</name>
                        <value>yarn</value>
                   </property>
                </configuration>

      Add the following entry for yarn-site.xml ( $HADOOP_HOME/etc/hadoop/yarn-site.xml )
                <property>
                    <name>yarn.nodemanager.aux-services</name>
                    <value>mapreduce_shuffle</value>
                </property>
                <property>
                    <name>yarn.nodemanager.aux-services.mapreduce.shuffle.class</name>
                    <value>org.apache.hadoop.mapred.ShuffleHandler</value>
                </property>
                <property>
                    <name>yarn.resourcemanager.resource-tracker.address</name>
                    <value>bigdatahdfs1.ibm.com:8025</value>
                </property>
                <property>
                    <name>yarn.resourcemanager.scheduler.address</name>
                    <value>bigdatahdfs1.ibm.com:8030</value>
                </property>
                <property>
                    <name>yarn.resourcemanager.address</name>
                    <value>bigdatahdfs1.ibm.com:8088</value>
                </property>

12 . Format the NameNode

            cd $HADOOP_HOME/bin
            ./hadoop namenode -format

13. Start all the hadoop processes

         su - hadoop
         cd $HADOOP_HOME/sbin
         ./start-dfs.sh ==> to start NameNode and DataNode
         ./start-yarn.sh ===> to start YARN
         mr-jobhistory-daemon.sh start historyserver  ===> Start job history server

14. Stop all processes

         a) Stop history server [ User : hadoop on existing node ]
          ./mr-jobhistory-daemon.sh stop historyserver
         b) Stop YARN [ User : hadoop on existing node ]
          ./stop-yarn.sh
         c) Stop HDFS daemons [ User : hadoop on existing node ]                
          ./stop-dfs.sh

Build , Configure and Run Hue

       -> Hue is a lightweight Web server that lets you use Hadoop directly from your browser. 
       -> Hue is just a ‘view on top of any Hadoop distribution’ and can be installed on any machine.

         a) Install pre-requisites
               $ yum install java-1.8.0-openjdk*
               $ yum install ant asciidoc
               $ yum install cyrus-sasl-devel cyrus-sasl-gssapi cyrus-sasl-plain gcc gcc-c++ krb5-devel libffi-devel libtidy libxml2-devel libxslt-devel make mvn mysql mysql-devel openldap-devel 
               $ yum install python-devel  sqlite-devel openssl-devel gmp-devel


          b) Build Hue
             $ git clone https://github.com/cloudera/hue.git
             $ cd hue
             $ make apps
             $ build/env/bin/hue runserver

          c) Hue should be up and running against above build/deploy 
             http://localhost:8000

Build , Configure and Run Hive

       -> data warehouse infrastructure tool
       -> is not a relational database
       -> is not for OLTP
       -> is not for Realtime
       -> stores schema into database and processes data in hdfs 
       -> designed for OLAP
       -> provides SQL type language support called HQL / HiveQL

       a) Pre-Requisite :
              Java 1.8
              Hadoop 2+
       b) Get the latest Hive Source
              wget https://archive.apache.org/dist/hive/stable/apache-hive-1.2.1-src.tar.gz
              tar -xzvf apache-hive-1.2.1-src.tar.gz
       c) Build hive 
              cd hive-1.2.1
              mvn clean install -Phadoop-2,dist -Dmaven.test.skip=true -e –X
       d) Hive packages are built at below location 
              /tempdisk/software/hive/apache-hive-1.2.1-src/packaging/target/apache-hive-1.2.1-bin.tar.gz
       e) Test Hive Setup 
              1. Setup HIVE Environment Variables :
                     #HIVE Environment Variables
                      export HIVE_HOME=/home/hadoop/apache-hive-1.2.1-bin
                      export PATH=$PATH:$HIVE_HOME/bin
                      export CLASSPATH=$CLASSPATH:$HADOOP_HOME/lib/*:.
                      export CLASSPATH=$CLASSPATH:$HIVE_HOME/lib/*:.
              2. Setup DERBY Environment Variables :
                     #DERBY ENVIRONMENT VARIABLES
                     export DERBY_HOME=/home/hadoop/db-derby-10.12.1.1-src
                     export DERBY_INSTALL=$DERBY_HOME
                     export CLASSPATH=$CLASSPATH:$DERBY_INSTALL/lib/derby.jar:$DERBY_INSTALL/lib/derbytools.jar
               3. Hive runs on top of Hadoop so we must have Hadoop on the path 
                     #HADOOP_HOME Environment Variable
                     export HADOOP_HOME=/home/hadoop/hadoop-2.7.2
               4. Create Hive Directories
                       $HADOOP_HOME/bin/hadoop fs -mkdir       /tmp
                       $HADOOP_HOME/bin/hadoop fs -mkdir       /user/hive/warehouse
                       $HADOOP_HOME/bin/hadoop fs -chmod g+w   /tmp
                       $HADOOP_HOME/bin/hadoop fs -chmod g+w   /user/hive/warehouse
               5. Run Hive CLI
                       $HIVE_HOME/bin/hive

Build , Configure and Run R on Power

      a) Get the latest Source of R
            wget https://cran.r-project.org/src/base/R-3/R-3.2.5.tar.gz
            tar zxvf R-3.2.5.tar.gz
            cd /tempdisk/software/R/R-3.2.5
      b) Install Dependency
            yum install readline-devel
      c) Build
            ./configure --with-x=no
            ./make
            make install
      e) Setup Environment Variable
             export R_HOME=/tempdisk/software/R/R-3.2.5
      d) Validate the build and install of R Packages 
            echo "sessionInfo()" | R --save

Build , Configure and Run Spark on Power

       a) Checkout Spark Code
             git clone https://github.com/apache/spark
             cd spark
             git checkout tags/v2.0.0

       b) Export SPARK Environment Variable ( Optional . Applicable only for functional test against built Spark )
            export SPARK_HOME=$HOME/spark

       c) Invoke the maven build process ( it takes quite some time .... 5/6 hours in cases )

            mvn -Pyarn -Phadoop-2.7 -Phive -Phive-thriftserver -DskipTests package

        d) Create Distributable package of Spark (SparkR will not be there in the package temporarily):

            ./dev/make-distribution.sh --name spark-2.0.0-hadoop2.7-ppc64le --tgz -Psparkr -Phadoop-2.7 -Phive -Phive-thriftserver -Pyarn

        f) Built package location :
              
            /tempdisk/software/spark/spark/spark-2.0.0-bin-spark-2.0.0-hadoop2.7-ppc64le.tgz

        g) Setup the Environment Variable for Spark 

             #SPARK ENVIRONMENT VARIABLES
             export SPARK_HOME=/home/hadoop/spark-2.0.0-bin-spark-2.0.0-hadoop2.7-ppc64le
             export R_HOME=/home/hadoop/R-3.2.5
             export PATH=$PATH:$SPARK_HOME/bin:$SPARK_HOME/sbin
             export CLASSPATH=$SPARK_HOME/jars/*:$CLASSPATH
             export SPARK_CLASSPATH=$HADOOP_CLASSPATH:$SPARK_HOME/jars/*:$CLASSPATH
             export SPARK_LOG_DIR=/home/hadoop/logs/spark
             export SPARK_WORKER_DIR=/tmp/spark


        h) Start Spark Master in Standalone Mode
                cd $SPARK_HOME/sbin
                ./start-master.sh
        i) Start Spark Worker in Standalone Mode
                cd $SPARK_HOME/sbin
                ./start-slave.sh <master_url> 
                    ( eg : ./start-slave.sh spark://sys-77402.dal-ebis.ihost.com:7077 )

Build , Configure and Run Scala on Power

        -> We do not need to build Scala Packages . 
        -> Pre-requisite for Scala Install is 
                 -> install wget and openjdk 1.8
        -> Install Scala
                 -> wget http://downloads.typesafe.com/scala/2.11.8/scala-2.11.8.rpm
                 -> sudo rpm -ivh scala-2.11.8.rpm
        -> Test Scala
                 -> scala -version
                    [root@sys-77402 scala-2.11.8]# scala -version
                              Scala code runner version 2.11.8 -- Copyright 2002-2016, LAMP/EPFL
                    [root@sys-77402 scala-2.11.8]# lscpu
                    Architecture:          ppc64le
                    Byte Order:            Little Endian
                    CPU(s):                8
                    On-line CPU(s) list:   0-7
                    Thread(s) per core:    8
                    Core(s) per socket:    1
                    Socket(s):             1
                    NUMA node(s):          1
                    Model:                 IBM pSeries (emulated by qemu)
                    L1d cache:             64K
                    L1i cache:             32K
                    NUMA node0 CPU(s):     0-7
                    [root@sys-77402 scala-2.11.8]# 

Build , Configure and Run Zeppelin on Power