-
Notifications
You must be signed in to change notification settings - Fork 0
Spectrum Scale ( GPFS ) Setup and Hadoop Integration
Spectrum scale is new brand for GPFS
Spectrum Scale Installables
GPFS_4.1_ADV_LSP.tar.gz => Base GPFS
GPFS-4.1.0.4-power-Linux.advanced.tar.gz ==> Service Pack
RHEL 6.5 BE ( At the time of writing this blog LE was not available )
POWER8 - S822L ( 24 Core , 256 GB RAM )
Benefits of GPFS-FPO
- POSIX compliant file system
- Used in IBM Blue Gene , Watson , Argonne National Labs MIRA system
- Used in Biomedical research , financial analytics
- High I/O Performance
- GPFS-FPO is an implementation of shared nothing architecture
- GPFS supports block size ranging from 16KB to 16MB and defaults to 256KB
Setup Configuration
- GPFS Installables Location : /usr/lpp/4.1
- GPFS Installation Sample Configuration
- Server-1 : bigdatagpfs01
- Server-2 : bigdatagpfs02
- Server-3 : bigdatagpfs03
Installation pre-requisite setup configuration
a) Configure "/etc/hosts" ( all nodes )
b) Passwordless ssh configuration for root user ( all nodes )
c) Disable selinux ( all nodes )
d) Prerequisite install - ksh , expect , imake , gcc* , kernel-devel ( all nodes )
Installation Setup
a) Extract the installer from "GPFS_4.1_ADV_LSP.tar.gz" ( all nodes )
tar -xvf GPFS_4.1_ADV_LSP.tar.gz
b) Extract the GPFS RPMs from the installer ( all nodes )
-navigate to extracted directory in step e and run the installer
-./gpfs_install-4.1.0-0_ppc64
c) RPMs will be extracted to default directory /usr/lpp/mmfs/4.1 . Install all the rmps ( all nodes )
-rpm -ivh *.rpm
d) Update the GPFS install with updated service pack ( all nodes )
- tar -xvf GPFS-4.1.0.4-power-Linux.advanced.tar.gz
- rpm -Uvh *.rpm
e) validate all the rpms installed ( all nodes )
- rpm -qa | grep gpfs
gpfs.gnr-4.1.0-0.ppc64
gpfs.gpl-4.1.0-4.noarch
gpfs.base-4.1.0-4.ppc64
gpfs.crypto-4.1.0-4.ppc64
gpfs.msg.en_US-4.1.0-4.noarch
gpfs.docs-4.1.0-4.noarch
gpfs.ext-4.1.0-4.ppc64
gpfs.gskit-8.0.50-32.ppc64
f) Build GPFS compatibility ( all nodes )
go to directory "/usr/lpp/mmfs/src/"
make Autoconfig
make World
make InstallImages
Add below environment variables in .bashrc
export SHARKCLONEROOT=/usr/lpp/mmfs/src
export PATH=$PATH:/usr/lpp/mmfs/bin/
g) Create the cluster by using the mmcrcluster command ( from one node . I have chosen bigdatagpfs01 )
1) Create the cluster
# mmcrcluster -N bigdatagpfs01:quorum-manager,bigdatagpfs02:quorum-manager,bigdatagpfs03:quorum -p bigdatagpfs01 -s bigdatagpfs02 -r /usr/bin/ssh -R /usr/bin/scp
2) Accept the license
# mmchlicense server --accept -N bigdatagpfs01,bigdatagpfs02,bigdatagpfs03
3) Start the cluster
# mmstartup -a
4) Check the status
# mmgetstate -a
h) Create NSDs :
Create a stanza file which is a config file required for nsd configuration .
Format of Stanza File :
%nsd: device=DiskName nsd=NsdName servers=ServerList usage={dataOnly | metadataOnly | dataAndMetadata | descOnly} failureGroup=FailureGroup pool=StoragePool
Sample Stanza File : diskfile.txt
%nsd: nsd=gpfs1nsd device=/dev/vdc servers=bigdatagpfs01 usage=dataAndMetadata
%nsd: nsd=gpfs2nsd device=/dev/vdc servers=bigdatagpfs02 usage=dataAndMetadata
%nsd: nsd=gpfs3nsd device=/dev/vdc servers=bigdatagpfs03 usage=dataAndMetadata
Command to create the nsd :
mmcrnsd -F diskfile.txt
i) Create the file-system using the nsd just created .
mmcrfs gpfs1 -F diskfile.txt -A yes -T /gpfs
j) mount the gpfs file-system on all the nodes
mmmount gpfs1 -a
Integrating GPFS-FPO with Hadoop
The GPFS FPO Hadoop Connector allows Hadoop users to access data from a GPFS FPO filesystem
(As hadoop user)
a) Configuration
export HADOOP_HOME=/hadoop
cd /usr/lpp/mmfs/fpo/hadoop-2.4
ln -s /usr/lpp/mmfs/fpo/hadoop-2.4/hadoop-gpfs-2.4.jar $HADOOP_HOME/share/hadoop/common
ln -fs /usr/lpp/mmfs/fpo/hadoop-2.4/libgpfshadoop.64.so $HADOOP_HOME/lib/native/libgpfshadoop.so
ln -fs /usr/lpp/mmfs/lib/libgpfs.so $HADOOP_HOME/lib/native/libgpfs.so
ln -fs /usr/lpp/mmfs/fpo/hadoop-2.4/libgpfshadoop.64.so $HADOOP_HOME/lib/native/libgpfshadoop.so
ln -fs /usr/lpp/mmfs/lib/libgpfs.so $HADOOP_HOME/lib/libgpfs.so
ln -fs /usr/lpp/mmfs/fpo/hadoop-2.4/libgpfshadoop.64.so $HADOOP_HOME/lib/libgpfshadoop.so
cp /usr/lpp/mmfs/fpo/hadoop-2.4/hadoop-gpfs-2.4.jar $HADOOP_HOME/lib
cp gpfs-connector-daemon /var/mmfs/etc/
cp install_script/gpfs-callback_start_connector_daemon.sh /var/mmfs/etc/
cp install_script/gpfs-callback_stop_connector_daemon.sh /var/mmfs/etc/
export JAVA_LIBRARY_PATH=$HADOOP_HOME/lib
b) configure core-site.xml
core-site.xml entries
<configuration>
<property>
<name>hadoop.tmp.dir</name>
<value>[path_on_local_os]</value>
<!-- To use GPFS instead of a local OS:
[hostname] must be set to the unique hostname for each node.
This is because the namespace of a GPFS directory is globally
visible by all the nodes, not just the local node. This results
in each node having a unique configuration file. You must
use GPFS policies to set data replication to 1 for these local
directories.-->
<value>/mnt/gpfs/tmp/[hostname]</value>
</property>
<property>
<name>fs.default.name</name>
<value>gpfs:///</value>
</property>
<property>
<name>fs.gpfs.impl</name>
<value>org.apache.hadoop.fs.gpfs.GeneralParallelFileSystem</value>
</property>
<property>
<name>fs.AbstractFileSystem.gpfs.impl</name>
<value>org.apache.hadoop.fs.gpfs.GeneralParallelFs</value>
</property>
<property>
<name>gpfs.mount.dir</name>
<value>/mnt/gpfs</value>
</property>
<property>
<!-- Required when not using root user to run the job,please configure the groups that will be privileged on filesystem. Multiple groups are seperated by comma -->
<name>gpfs.supergroup</name>
<value></value>
</property>
<property>
<!--Optional. The default dfs.blocksize in hadoop is 128MB. It can be overide if any other blocksize is adopted.-->
<name>dfs.blocksize</name>
<value>268435456</value> </property>
</configuration>
c) configure mapred-site.xml
<configuration>
<property>
<name>mapreduce.cluster.local.dir</name>
<value>[path_on_local_os]</value>
<!-- To use GPFS instead of a local OS:
[hostname] must be set to the unique hostname for each node.
This is because the namespace of a GPFS directory is globally
visible by all the nodes, not just the local node. This results
in each node having a unique configuration file. You must
use GPFS policies to set data replication to 1 for these local
directories. <value>/mnt/gpfs/mapred/local/[hostname]</value>
-->
</property>
</configuration>
** Some handy GPFS Commands **
1) Get the status of all GPFS nodes
#mmgetstate -a
2) List all the nodes that have GPFS file system mounted
#mmlsmount all -L
3) Mounting all the the gpfs filesystems on all the nodes
#mmmount all -a
4) Stating GPFS cluster
#mmstartup -a
5) Shutdown the GPFS cluster
#mmshutdown -a
** Some Additional Notes **
- We only need to start the yarn processes with gpfs hadoop and not hdfs processes
- We should start history server as well