Skip to content

Hadoop Performance Tuning on IBM OpenPOWER

Ashish Kumar edited this page Jun 25, 2016 · 13 revisions

System Tunings
A) System Parameters

     Disk readahead: 32MB
     Disk IO Schedule: deadline
     Disk queue depth: 128
     Disk nr_requests: 4096
     Huge Pages(16MB): 4104
     Infiniband send_queue_size:1024
     Infiniband recv_queue_size:2048
     Infiniband mode: connected
     Infiniband MTU:65520
     Infiniband txqueuelen:10000
     Loopback MTU: 65520

B) Kernel Parameters

       net.core.wmem_default=262144
       net.core.rmem_default=262144
       net.core.wmem_max=16777216
       net.core.rmem_max=16777216
       net.core.netdev_max_backlog=250000
       net.ipv4.tcp_wmem="4096 262144 16777216"
       net.ipv4.tcp_rmem="4096 262144 16777216"
       net.ipv4.tcp_timestamps=0
       net.ipv4.tcp_no_metrics_save=1
       net.ipv4.tcp_mem="16777216 16777216 16777216"
       net.ipv4.tcp_sack=1
       vm.dirty_background_ratio=1
       vm.dirty_ratio=2
       vm.dirty_writeback_centisecs=100
       vm.dirty_expire_centisecs=1000
       vm.swappiness=5
       kernel.sem="250 32000 100 128"
       fs.aio-max-nr=1048576
       net.core.somaxconn=1000
       vm.min_free_kbytes=16777216

C) For better data traffic across the nodes , tune the TCP window settings by enabling TCP window scaling, which increases the TCP buffer limit and the backlog size . Add the following lines to the "/etc/sysctl.conf"

       # set min free memory, replace 1024 with 6% of the total amount of physical memory
       vm.min_free_kbytes= 1024
       # enable tcp window scaling
       net.ipv4.tcp_window_scaling = 1
       # increase Linux TCP buffer limits
       net.core.rmem_max = 8388608
       net.core.wmem_max = 8388608
       # increase default and maximum Linux TCP buffer sizes
       net.ipv4.tcp_rmem = 4096 262144 8388608
       net.ipv4.tcp_wmem = 4096 262144 8388608
       # increase max backlog to avoid dropped packets
       net.core.netdev_max_backlog=2500

(If GPFS is integrated into Hadoop Cluster)
D).Intermediate data location should be on a local disk and not on gpfs .

If there is no local disk , we could assign several LUNs (with ext3 or ext4 created over) and assign those luns to "yarn.nodemanager.local-dirs" and "hadoop.tmp.dir" parameters.  Eg :         
-> mount ext4/LUN1 into /hadoop/yarn/local1, ext4/LUN2 into /hadoop/yarn/local2  on each node    
-> configure yarn.nodemanager.local-dirs="/hadoop/yarn/local1,/hadoop/yarn/local2"      
-> yarn.nodemanager.local-dirs are effective for all nodes.     
-> Assign the same number of ext4/LUN for each node    

(Applicable for Hadoop Cluster for all application scenarios)
E). Below are the hadoop configuration parameters to be considered for performance tuning

   a) dfs.block.size : Size of the data block in which the input data is split     
   b) mapred.compress.map.output : Specify whether to compress output of a map      
   c) mapred.tasktracker.map/reduce.tasks.maximum : The maximum number of map/reduce tasks that will be run simultaneously by a task tracker     
   d) io.sort.mb : The size of in-memory buffer (in MBs) used by map task for sorting its output.     
   e) io.sort.factor : The maximum number of streams to merge at once when sorting files. This property is also used in reduce phase. It’s fairly common to increase this to 100.      
   f) mapred.job.reuse.jvm.num.tasks : The maximum number of tasks to run for a given job for each JVM on a tasktracker. A value of -1 indicates no limit: the same JVM may be used for all tasks for a job.    
   g) mapred.reduce.parallel.copies : The number of threads used : to copy map outputs to the Reducer  
   h) mapreduce.map.java.opts : Apply JVM Settings for Map Tasks    
   i) mapreduce.reduce.java.opts : Apply JVM Settings for Reduce Tasks    

      The following let you limit total memory (possibly virtual) available for your tasks - including heap, 
      stack  and class definitions 
         mapreduce.map.memory.mb
         mapreduce.reduce.memory.mb

      The following properties let you specify options to be passed to the JVMs running your tasks . Values 
      of below should be approx 75% of values that are allocated against above parameter "memory.mb"
          mapreduce.map.java.opts
          mapreduce.reduce.java.opts

      Following parameters are used to control the memory available to Application Master in a YARN cluster . 
      Value should be 75% of the value of "resource.mb" 
          yarn.app.mapreduce.am.command-opts
          yarn.app.mapreduce.am.resource.mb    

F) Hadoop Configuration Files to be considered for tuning

   hadoop-env.sh : Environment variables that are used to run on Hadoop 
   core-site.xml : Configuration settings for Hadoop core such as I/O settings that are common to HDFS and Map-Reduce
   hdfs-site.xml : Configuration Settings for HDFS daemons , Name Nodes , Secondary Name Nodes , Data Nodes     
   mapred-site.xml : Configuration settings for Map-Reduce daemons , Job Tracker , Task Tracker         

G). JVM Settings

    **** With IBM Java ****    
     -server -Xnoclassgc -Xgcpolicy:optthruput -Xms890m -Xmx890m -Xgcthreads4 -XlockReservation -Djava.net.preferIPv4Stack=true -Xcodecache4m -Xcompressedrefs -Xlp:codecache:pagesize=16m -Xlp:objectheap:pagesize=16m -jit:compilationThreads=1

   **** With OpenJDK ****
        -server -Xms700M -Xmx700M -XX:+UseParallelGC -XX:ParallelGCThreads=4 -XX:-UseAdaptiveSizePolicy -XX:+DisableExplicitGC -XX:NewSize=60M -XX:MaxNewSize=60M -XX:PermSize=12M -XX:MaxPermSize=12M -XX:InitialSurvivorRatio=28

    ****  GC Logging can be enabled using below parameter for GC Monitoring  ****
     “-verbose:gc -Xloggc:<path_to_GC_log_folder>/@[email protected] -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps”      

    **** Sample mapred-site.xml file entry ****
    <property>
        <name>mapred.child.java.opts</name>
        <value>
             -server -Xms1024M -Xmx2048M -XX:+UseParallelGC -XX:ParallelGCThreads=4 -XX:-UseAdaptiveSizePolicy 
             -XX:+DisableExplicitGC -XX:NewSize=60M -XX:MaxNewSize=60M -XX:PermSize=12M -XX:MaxPermSize=12M 
             -XX:InitialSurvivorRatio=28
        </value>
    </property>
    <property>
        <name>mapreduce.map.java.opts</name>
        <value>
             -server -Xms1024M -Xmx2048M -XX:+UseParallelGC -XX:ParallelGCThreads=4 -XX:-UseAdaptiveSizePolicy 
             -XX:+DisableExplicitGC -XX:NewSize=60M  -XX:MaxNewSize=60M -XX:PermSize=12M -XX:MaxPermSize=12M 
             -XX:InitialSurvivorRatio=28
        </value>
    </property>
    <property>
        <name>mapreduce.reduce.java.opts</name>
        <value>
             -server -Xms1024M -Xmx2048M -XX:+UseParallelGC -XX:ParallelGCThreads=4 -XX:-UseAdaptiveSizePolicy 
             -XX:+DisableExplicitGC -XX:NewSize=60M  -XX:MaxNewSize=60M -XX:PermSize=12M -XX:MaxPermSize=12M  
             -XX:InitialSurvivorRatio=28
        </value>
     </property>

H) Huge Pages Tuning and application in OpenJDK

   Steps for huge-page setup against OpenJDK 1.8 to setup 16MB huge page

   a) Calculate system parameter values
           I already calculated it with my tool to setup 16MB huge page size

   b) Add below entries to /etc/sysctl.conf [ as root user ] 
           kernel.shmmax=69793218560
           vm.nr_hugepages=4160
           vm.hugetlb_shm_group=<user-group-id-against which the application-will-run> ====> I tested against my groupid

   c) Add below entries to /etc/security/limits.conf ( as root user ) 
           username soft memlock 68157440
           username hard memlock 68157440



   d) User Java Flags : -XX:+UseLargePages -XX:LargePageSizeInBytes=16m -XX:+UseSHM
           make sure you use "+UseSHM" option otherwise you will get error 

   e) Test with sample commands that huge page is enabled 

            watch -d grep Huge /sys/devices/system/node/node*/meminfo
             Every 2.0s: grep Huge /sys/devices/system/node/node0/...  Wed Sep 30 07:15:50 2015 
             /sys/devices/system/node/node0/meminfo:Node 0 AnonHugePages:         0 kB
             /sys/devices/system/node/node0/meminfo:Node 0 HugePages_Total:  1040
             /sys/devices/system/node/node0/meminfo:Node 0 HugePages_Free:   1040
             /sys/devices/system/node/node0/meminfo:Node 0 HugePages_Surp:      0
             /sys/devices/system/node/node16/meminfo:Node 16 AnonHugePages:         0 kB
             /sys/devices/system/node/node16/meminfo:Node 16 HugePages_Total:  1040
             /sys/devices/system/node/node16/meminfo:Node 16 HugePages_Free:   1040
             /sys/devices/system/node/node16/meminfo:Node 16 HugePages_Surp:      0
             /sys/devices/system/node/node17/meminfo:Node 17 AnonHugePages:         0 kB
             /sys/devices/system/node/node17/meminfo:Node 17 HugePages_Total:  1040
             /sys/devices/system/node/node17/meminfo:Node 17 HugePages_Free:   1040
             /sys/devices/system/node/node17/meminfo:Node 17 HugePages_Surp:      0
             /sys/devices/system/node/node1/meminfo:Node 1 AnonHugePages:         0 kB
             /sys/devices/system/node/node1/meminfo:Node 1 HugePages_Total:  1040
             /sys/devices/system/node/node1/meminfo:Node 1 HugePages_Free:   1040
             /sys/devices/system/node/node1/meminfo:Node 1 HugePages_Surp:      0
             root@p215n17:~# java  -XX:+UseLargePages -XX:LargePageSizeInBytes=16m -XX:+UseSHM -version
             openjdk version "1.8.0_45-internal"
             OpenJDK Runtime Environment (build 1.8.0_45-internal-b14)
             OpenJDK 64-Bit Server VM (build 25.45-b02, mixed mode)
             root@p215n17:~#

I) Sample Map-Reduce parameters

 mapreduce.task.io.sort.mb=650
 mapreduce.job.reduce.slowstart.completedmaps=0.0005
 dfs.blocksize=512MB
 Compression=Lz4
 mapreduce.reduce.input.buffer.percent=0.96
 mapreduce.reduce.shuffle.merge.percent=0.96
 mapreduce.reduce.shuffle.input.buffer.percent=0.8
 mapreduce.task.io.sort.factor=200
 mapreduce.reduce.shuffle.copy.mapout.unit=50
 mapreduce.reduce.shuffle.parallelcopies=10
 mapreduce.job.type=UnrecoverableJob
 mapreduce.job.local.dir.locator=roundrobin
 mapred.map.child.log.level=WARN
 mapred.reduce.child.log.level=WARN
 mapreduce.job.intermediatedata.checksum=false

J) Performance Benchmark - Tera-gen/Tera-sort/Tera-validate

   -> The TeraSort benchmark is probably the most well-known Hadoop benchmark.    
   -> The Goal of TeraSort is to sort 1TB of data as fast as possible.     
   -> It is a benchmark that combines testing the HDFS and MapReduce layers of an Hadoop  cluster.    
   -> A full TeraSort benchmark run consists of the following three steps:    
           -> Generating the input data via TeraGen.    
           -> Running the actual TeraSort on the input data.     
           -> Validating the sorted output data via TeraValidate.     

    1. Setup environment variables ( This should be setup against operating user )    
    export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.65-3.b17.el7.ppc64le"    
    export HADOOP_HOME=”/usr/iop/4.1.0.0/hadoop”        
    export HADOOP_MR_DIR=”/usr/iop/current/hadoop-mapreduce-client”    
    export SPARK_HOME=”/usr/iop/4.1.0.0/spark”     

    2. Execute Teragen    
    yarn jar $HADOOP_MR_DIR/hadoop-mapreduce-examples.jar teragen 1000 /hadoop/teragen-output  

    Note : Teragen program also accepts commandline arguments with a no. of setting parameters like :    
    hadoop jar $HADOOP_MR_DIR/hadoop-mapreduce-examples.jar teragen  -Ddfs.block.size=$BLOCK_SIZE 
    -Dmapred.map.tasks=$NUM_MAP $NUM_ROWS  $OUTPUT_DATA     
     NUM_MAP=157 # Number of map tasks to execute     
     NUM_ROWS=2500000000 # Number of rows to generate     
     BLOCK_SIZE=536870912 #Block size     
     OUTPUT_DATA="/hadoop/teragen-output" #Ouput directory where generated data is stored       

    3. Execute Terasort   
    yarn jar $HADOOP_MR_DIR/hadoop-mapreduce-examples.jar terasort /hadoop/teragen-output /hadoop/terasort-output       

    Note : 
       hadoop jar $HADOOP_MR_DIR/hadoop-mapreduce-examples.jar terasort 
            -Dmapred.reduce.tasks=$NUM_REDUCE 
            -Ddfs.block.size=$BLOCK_SIZE 
            -Dmapred.child.java.opts=-Xmx1200m 
            -Dmapred.map.child.java.opts=-Xmx1200m 
            -Dmapred.reduce.child.java.opts=$MEM_REDUCE 
            -Dio.sort.mb=$MEM_IOSORT 
            -Dmapred.reduce.slowstart.completed.maps=0.0 
            -Dio.sort.spill.percent=1.0 
            -Dio.sort.record.percent=0.17  
            $INPUT_DATA $OUTPUT_DATA"

            NUM_REDUCE=288 # Number of reducers task to execute
            BLOCK_SIZE=536870912 #Block size
            MEM_REDUCE=-Xmx1200m # MapReduce child java opts
            MEM_IOSORT=1024 # IO sort memory
            INPUT_DATA="/user/biadmin/in-dir" #Input directory where data is present
            OUTPUT_DATA="/user/biadmin/out-dir" #Ouput directory where sorted data is stored
                  

    4. Execute Teravalidate    
    yarn jar $HADOOP_MR_DIR/hadoop-mapreduce-examples.jar teravalidate /hadoop/terasort-output /hadoop/teravalidate-output        

    5. Tuning Teragen/Terasort tips     
         -> Set dfs.block.size the same for Teragen and Terasort.     
         -> Set mapred.map.tasks to at least the number of h/w threads in entire  cluster.      
                   For example : if you have 9 cluster nodes the 18,27,32..etc      
         -> You can compute the number of Terasort map tasks as follows:      
                   Terasort map tasks = Sum of blocks (of size dfs.block.size) per file generated by Teragen      
                   Terasort map tasks = (Number of files) * (blocks per file).     
         -> To minimize Terasort map tasks, avoid files where the last block (based on dfs.block.size) is small.         
            Instead, you want the last block to be large, closer to dfs.block.size is better.     
         -> Teragen fast, you can set dfs.replication=1.       
         -> But if you plan to run terasort on the generated data, then use dfs.replication=3.

J) Monitoring

     -> Job Tracker : http://hostname:50130/jobtracker.jsp      
     -> DFS Health  : http://hostname:50170/dfshealth.jsp