-
Notifications
You must be signed in to change notification settings - Fork 0
Hadoop Performance Tuning on IBM OpenPOWER
System Tunings
A) System Parameters
Disk readahead: 32MB
Disk IO Schedule: deadline
Disk queue depth: 128
Disk nr_requests: 4096
Huge Pages(16MB): 4104
Infiniband send_queue_size:1024
Infiniband recv_queue_size:2048
Infiniband mode: connected
Infiniband MTU:65520
Infiniband txqueuelen:10000
Loopback MTU: 65520
B) Kernel Parameters
net.core.wmem_default=262144
net.core.rmem_default=262144
net.core.wmem_max=16777216
net.core.rmem_max=16777216
net.core.netdev_max_backlog=250000
net.ipv4.tcp_wmem="4096 262144 16777216"
net.ipv4.tcp_rmem="4096 262144 16777216"
net.ipv4.tcp_timestamps=0
net.ipv4.tcp_no_metrics_save=1
net.ipv4.tcp_mem="16777216 16777216 16777216"
net.ipv4.tcp_sack=1
vm.dirty_background_ratio=1
vm.dirty_ratio=2
vm.dirty_writeback_centisecs=100
vm.dirty_expire_centisecs=1000
vm.swappiness=5
kernel.sem="250 32000 100 128"
fs.aio-max-nr=1048576
net.core.somaxconn=1000
vm.min_free_kbytes=16777216
C) For better data traffic across the nodes , tune the TCP window settings by enabling TCP window scaling, which increases the TCP buffer limit and the backlog size . Add the following lines to the "/etc/sysctl.conf"
# set min free memory, replace 1024 with 6% of the total amount of physical memory
vm.min_free_kbytes= 1024
# enable tcp window scaling
net.ipv4.tcp_window_scaling = 1
# increase Linux TCP buffer limits
net.core.rmem_max = 8388608
net.core.wmem_max = 8388608
# increase default and maximum Linux TCP buffer sizes
net.ipv4.tcp_rmem = 4096 262144 8388608
net.ipv4.tcp_wmem = 4096 262144 8388608
# increase max backlog to avoid dropped packets
net.core.netdev_max_backlog=2500
(If GPFS is integrated into Hadoop Cluster)
D).Intermediate data location should be on a local disk and not on gpfs .
If there is no local disk , we could assign several LUNs (with ext3 or ext4 created over) and assign those luns to "yarn.nodemanager.local-dirs" and "hadoop.tmp.dir" parameters. Eg :
-> mount ext4/LUN1 into /hadoop/yarn/local1, ext4/LUN2 into /hadoop/yarn/local2 on each node
-> configure yarn.nodemanager.local-dirs="/hadoop/yarn/local1,/hadoop/yarn/local2"
-> yarn.nodemanager.local-dirs are effective for all nodes.
-> Assign the same number of ext4/LUN for each node
(Applicable for Hadoop Cluster for all application scenarios)
E). Below are the hadoop configuration parameters to be considered for performance tuning
a) dfs.block.size : Size of the data block in which the input data is split
b) mapred.compress.map.output : Specify whether to compress output of a map
c) mapred.tasktracker.map/reduce.tasks.maximum : The maximum number of map/reduce tasks that will be run simultaneously by a task tracker
d) io.sort.mb : The size of in-memory buffer (in MBs) used by map task for sorting its output.
e) io.sort.factor : The maximum number of streams to merge at once when sorting files. This property is also used in reduce phase. It’s fairly common to increase this to 100.
f) mapred.job.reuse.jvm.num.tasks : The maximum number of tasks to run for a given job for each JVM on a tasktracker. A value of -1 indicates no limit: the same JVM may be used for all tasks for a job.
g) mapred.reduce.parallel.copies : The number of threads used : to copy map outputs to the Reducer
h) mapreduce.map.java.opts : Apply JVM Settings for Map Tasks
i) mapreduce.reduce.java.opts : Apply JVM Settings for Reduce Tasks
The following let you limit total memory (possibly virtual) available for your tasks - including heap,
stack and class definitions
mapreduce.map.memory.mb
mapreduce.reduce.memory.mb
The following properties let you specify options to be passed to the JVMs running your tasks . Values
of below should be approx 75% of values that are allocated against above parameter "memory.mb"
mapreduce.map.java.opts
mapreduce.reduce.java.opts
Following parameters are used to control the memory available to Application Master in a YARN cluster .
Value should be 75% of the value of "resource.mb"
yarn.app.mapreduce.am.command-opts
yarn.app.mapreduce.am.resource.mb
F) Hadoop Configuration Files to be considered for tuning
hadoop-env.sh : Environment variables that are used to run on Hadoop
core-site.xml : Configuration settings for Hadoop core such as I/O settings that are common to HDFS and Map-Reduce
hdfs-site.xml : Configuration Settings for HDFS daemons , Name Nodes , Secondary Name Nodes , Data Nodes
mapred-site.xml : Configuration settings for Map-Reduce daemons , Job Tracker , Task Tracker
G). JVM Settings
**** With IBM Java ****
-server -Xnoclassgc -Xgcpolicy:optthruput -Xms890m -Xmx890m -Xgcthreads4 -XlockReservation -Djava.net.preferIPv4Stack=true -Xcodecache4m -Xcompressedrefs -Xlp:codecache:pagesize=16m -Xlp:objectheap:pagesize=16m -jit:compilationThreads=1
**** With OpenJDK ****
-server -Xms700M -Xmx700M -XX:+UseParallelGC -XX:ParallelGCThreads=4 -XX:-UseAdaptiveSizePolicy -XX:+DisableExplicitGC -XX:NewSize=60M -XX:MaxNewSize=60M -XX:PermSize=12M -XX:MaxPermSize=12M -XX:InitialSurvivorRatio=28
**** GC Logging can be enabled using below parameter for GC Monitoring ****
“-verbose:gc -Xloggc:<path_to_GC_log_folder>/@[email protected] -XX:+PrintGC -XX:+PrintGCDetails -XX:+PrintGCTimeStamps”
**** Sample mapred-site.xml file entry ****
<property>
<name>mapred.child.java.opts</name>
<value>
-server -Xms1024M -Xmx2048M -XX:+UseParallelGC -XX:ParallelGCThreads=4 -XX:-UseAdaptiveSizePolicy
-XX:+DisableExplicitGC -XX:NewSize=60M -XX:MaxNewSize=60M -XX:PermSize=12M -XX:MaxPermSize=12M
-XX:InitialSurvivorRatio=28
</value>
</property>
<property>
<name>mapreduce.map.java.opts</name>
<value>
-server -Xms1024M -Xmx2048M -XX:+UseParallelGC -XX:ParallelGCThreads=4 -XX:-UseAdaptiveSizePolicy
-XX:+DisableExplicitGC -XX:NewSize=60M -XX:MaxNewSize=60M -XX:PermSize=12M -XX:MaxPermSize=12M
-XX:InitialSurvivorRatio=28
</value>
</property>
<property>
<name>mapreduce.reduce.java.opts</name>
<value>
-server -Xms1024M -Xmx2048M -XX:+UseParallelGC -XX:ParallelGCThreads=4 -XX:-UseAdaptiveSizePolicy
-XX:+DisableExplicitGC -XX:NewSize=60M -XX:MaxNewSize=60M -XX:PermSize=12M -XX:MaxPermSize=12M
-XX:InitialSurvivorRatio=28
</value>
</property>
H) Huge Pages Tuning and application in OpenJDK
Steps for huge-page setup against OpenJDK 1.8 to setup 16MB huge page
a) Calculate system parameter values
I already calculated it with my tool to setup 16MB huge page size
b) Add below entries to /etc/sysctl.conf [ as root user ]
kernel.shmmax=69793218560
vm.nr_hugepages=4160
vm.hugetlb_shm_group=<user-group-id-against which the application-will-run> ====> I tested against my groupid
c) Add below entries to /etc/security/limits.conf ( as root user )
username soft memlock 68157440
username hard memlock 68157440
d) User Java Flags : -XX:+UseLargePages -XX:LargePageSizeInBytes=16m -XX:+UseSHM
make sure you use "+UseSHM" option otherwise you will get error
e) Test with sample commands that huge page is enabled
watch -d grep Huge /sys/devices/system/node/node*/meminfo
Every 2.0s: grep Huge /sys/devices/system/node/node0/... Wed Sep 30 07:15:50 2015
/sys/devices/system/node/node0/meminfo:Node 0 AnonHugePages: 0 kB
/sys/devices/system/node/node0/meminfo:Node 0 HugePages_Total: 1040
/sys/devices/system/node/node0/meminfo:Node 0 HugePages_Free: 1040
/sys/devices/system/node/node0/meminfo:Node 0 HugePages_Surp: 0
/sys/devices/system/node/node16/meminfo:Node 16 AnonHugePages: 0 kB
/sys/devices/system/node/node16/meminfo:Node 16 HugePages_Total: 1040
/sys/devices/system/node/node16/meminfo:Node 16 HugePages_Free: 1040
/sys/devices/system/node/node16/meminfo:Node 16 HugePages_Surp: 0
/sys/devices/system/node/node17/meminfo:Node 17 AnonHugePages: 0 kB
/sys/devices/system/node/node17/meminfo:Node 17 HugePages_Total: 1040
/sys/devices/system/node/node17/meminfo:Node 17 HugePages_Free: 1040
/sys/devices/system/node/node17/meminfo:Node 17 HugePages_Surp: 0
/sys/devices/system/node/node1/meminfo:Node 1 AnonHugePages: 0 kB
/sys/devices/system/node/node1/meminfo:Node 1 HugePages_Total: 1040
/sys/devices/system/node/node1/meminfo:Node 1 HugePages_Free: 1040
/sys/devices/system/node/node1/meminfo:Node 1 HugePages_Surp: 0
root@p215n17:~# java -XX:+UseLargePages -XX:LargePageSizeInBytes=16m -XX:+UseSHM -version
openjdk version "1.8.0_45-internal"
OpenJDK Runtime Environment (build 1.8.0_45-internal-b14)
OpenJDK 64-Bit Server VM (build 25.45-b02, mixed mode)
root@p215n17:~#
I) Sample Map-Reduce parameters
mapreduce.task.io.sort.mb=650
mapreduce.job.reduce.slowstart.completedmaps=0.0005
dfs.blocksize=512MB
Compression=Lz4
mapreduce.reduce.input.buffer.percent=0.96
mapreduce.reduce.shuffle.merge.percent=0.96
mapreduce.reduce.shuffle.input.buffer.percent=0.8
mapreduce.task.io.sort.factor=200
mapreduce.reduce.shuffle.copy.mapout.unit=50
mapreduce.reduce.shuffle.parallelcopies=10
mapreduce.job.type=UnrecoverableJob
mapreduce.job.local.dir.locator=roundrobin
mapred.map.child.log.level=WARN
mapred.reduce.child.log.level=WARN
mapreduce.job.intermediatedata.checksum=false
J) Performance Benchmark - Tera-gen/Tera-sort/Tera-validate
-> The TeraSort benchmark is probably the most well-known Hadoop benchmark.
-> The Goal of TeraSort is to sort 1TB of data as fast as possible.
-> It is a benchmark that combines testing the HDFS and MapReduce layers of an Hadoop cluster.
-> A full TeraSort benchmark run consists of the following three steps:
-> Generating the input data via TeraGen.
-> Running the actual TeraSort on the input data.
-> Validating the sorted output data via TeraValidate.
1. Setup environment variables ( This should be setup against operating user )
export JAVA_HOME="/usr/lib/jvm/java-1.8.0-openjdk-1.8.0.65-3.b17.el7.ppc64le"
export HADOOP_HOME=”/usr/iop/4.1.0.0/hadoop”
export HADOOP_MR_DIR=”/usr/iop/current/hadoop-mapreduce-client”
export SPARK_HOME=”/usr/iop/4.1.0.0/spark”
2. Execute Teragen
yarn jar $HADOOP_MR_DIR/hadoop-mapreduce-examples.jar teragen 1000 /hadoop/teragen-output
Note : Teragen program also accepts commandline arguments with a no. of setting parameters like :
hadoop jar $HADOOP_MR_DIR/hadoop-mapreduce-examples.jar teragen -Ddfs.block.size=$BLOCK_SIZE
-Dmapred.map.tasks=$NUM_MAP $NUM_ROWS $OUTPUT_DATA
NUM_MAP=157 # Number of map tasks to execute
NUM_ROWS=2500000000 # Number of rows to generate
BLOCK_SIZE=536870912 #Block size
OUTPUT_DATA="/hadoop/teragen-output" #Ouput directory where generated data is stored
3. Execute Terasort
yarn jar $HADOOP_MR_DIR/hadoop-mapreduce-examples.jar terasort /hadoop/teragen-output /hadoop/terasort-output
Note :
hadoop jar $HADOOP_MR_DIR/hadoop-mapreduce-examples.jar terasort
-Dmapred.reduce.tasks=$NUM_REDUCE
-Ddfs.block.size=$BLOCK_SIZE
-Dmapred.child.java.opts=-Xmx1200m
-Dmapred.map.child.java.opts=-Xmx1200m
-Dmapred.reduce.child.java.opts=$MEM_REDUCE
-Dio.sort.mb=$MEM_IOSORT
-Dmapred.reduce.slowstart.completed.maps=0.0
-Dio.sort.spill.percent=1.0
-Dio.sort.record.percent=0.17
$INPUT_DATA $OUTPUT_DATA"
NUM_REDUCE=288 # Number of reducers task to execute
BLOCK_SIZE=536870912 #Block size
MEM_REDUCE=-Xmx1200m # MapReduce child java opts
MEM_IOSORT=1024 # IO sort memory
INPUT_DATA="/user/biadmin/in-dir" #Input directory where data is present
OUTPUT_DATA="/user/biadmin/out-dir" #Ouput directory where sorted data is stored
4. Execute Teravalidate
yarn jar $HADOOP_MR_DIR/hadoop-mapreduce-examples.jar teravalidate /hadoop/terasort-output /hadoop/teravalidate-output
5. Tuning Teragen/Terasort tips
-> Set dfs.block.size the same for Teragen and Terasort.
-> Set mapred.map.tasks to at least the number of h/w threads in entire cluster.
For example : if you have 9 cluster nodes the 18,27,32..etc
-> You can compute the number of Terasort map tasks as follows:
Terasort map tasks = Sum of blocks (of size dfs.block.size) per file generated by Teragen
Terasort map tasks = (Number of files) * (blocks per file).
-> To minimize Terasort map tasks, avoid files where the last block (based on dfs.block.size) is small.
Instead, you want the last block to be large, closer to dfs.block.size is better.
-> Teragen fast, you can set dfs.replication=1.
-> But if you plan to run terasort on the generated data, then use dfs.replication=3.
J) Monitoring
-> Job Tracker : http://hostname:50130/jobtracker.jsp
-> DFS Health : http://hostname:50170/dfshealth.jsp