Building Native Hadoop (v 2.5.1) libraries for OS X

Tags

,

In one of earlier blogs I explained how we can build Native Hadoop(2.4.1) libraries on OS X. In the meanwhile Hadoop 2.5.1 was released, so I was curious if now the source code has been patched and building libraries on OS X works out of box. But to my surprise, it still doesn’t work.

So in this blog I won’t go into much details, for that you can check the other blog.

Issues faced on Building Native libraries On Mac OS X

1.Problem in hadoop-hdfs maven module

error:

 [exec] /Users/gaurav/GitHub/hadoop/hadoop-hdfs-project/hadoop-hdfs/src/main/native/libhdfs/test/vecsum.c:61:23: error: use of undeclared identifier 'CLOCK_MONOTONIC'
 [exec]     if (clock_gettime(CLOCK_MONOTONIC, &watch->start)) {
 [exec]                       ^
 [exec] /Users/gaurav/GitHub/hadoop/hadoop-hdfs-project/hadoop-hdfs/src/main/native/libhdfs/test/vecsum.c:79:23: error: use of undeclared identifier 'CLOCK_MONOTONIC'
 [exec]     if (clock_gettime(CLOCK_MONOTONIC, &watch->stop)) {
 [exec]                       ^ 

Solution: Download the Patch from Jira issue HDFS-6534. Download link-> HDFS-6534.v2.patch

  • git apply HDFS-6534.v2.patch
  • mvn package -Pdist,native -DskipTests -Dtar

2.Problems in hadoop-yarn-server-nodemanager maven module

error:

 [exec] /Users/gaurav/GitHub/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c:501:33: error: use of undeclared identifier 'LOGIN_NAME_MAX'
 [exec]       if (strncmp(*users, user, LOGIN_NAME_MAX) == 0) {
 [exec]                                 ^   
 [exec] /Users/gaurav/GitHub/hadoop/hadoop-yarn-project/hadoop-yarn/hadoop-yarn-server/hadoop-yarn-server-nodemanager/src/main/native/container-executor/impl/container-executor.c:1266:48: error: too many arguments to function call, expected 4, have 5
 [exec]     if (mount("none", mount_path, "cgroup", 0, controller) == 0) {
 [exec]         ~~~~~                                  ^~~~~~~~~~
 [exec] /usr/include/sys/mount.h:384:1: note: 'mount' declared here
 [exec] int     mount(const char *, const char *, int, void *);
 [exec] ^

Solution: Download the Patch from Jira issue YARN-2161. Download link-> YARN-2161.v1.patch

  • git apply YARN-2161.v1.patch
  • mvn package -Pdist,native -DskipTests -Dtar

Result

hadoop-dist/target/hadoop-2.5.1/lib/native folder should contain the native libraries. Copy them to hadoop-2.5.1/lib/native folder and restart Hadoop cluster.

Building Native Hadoop (v 2.4.1) libraries for OS X

Tags

, , ,

If you are reading this blog, I assume that you already have Hadoop(v 2.4.1) installed on your OS X machine and that you are bit annoyed by the following error message

WARN util.NativeCodeLoader: Unable to load native-hadoop library for your platform… using builtin-java classes where applicable

If you are only planning to use Hadoop on OS X for development procedures, this error should not bother you. For me it was the same case, but I was just annoyed with this warning message and wanted to try building Native libraries from the source code.

Steps to build Native Hadoop libraries

  1. Download source from GitHub
  • git clone git@github.com:apache/hadoop.git
  • git checkout branch-2.4.1
  1. Dependencies
    Install cmake and zlib using homebrew package manager
  • brew install cmake
  • brew install zlib
  1. Run maven command
  • mvn package -Pdist,native -DskipTests -Dtar

On linux machines, the above procedure should be enough, but not for MAC OS X with Java 1.7. So for that you have to go with few more changes.

Issues faced on Building Native libraries On Mac OS X

1.Missing tools.jar

If you are building using Java 1.7, you would see an error talking about missing tools.jar, which is a bug in Maven JSPC. The related Jira issue is HADOOP-9350. The JSPC Plugin expects classes.jar in ../Classes folder, so we create a symlink.

error:

Exception in thread “main” java.lang.AssertionError: Missing tools.jar at: /Library/Java/JavaVirtualMachines/jdk1.7.0_17.jdk/Contents
/Home/Classes/classes.jar. Expression: file.exists()

Solution: Create a symbolic link to trick Java into believing that classes.jar is same as tools.jar

  • sudo mkdir /usr/libexec/java_home/Classes
  • sudo ln -s /usr/libexec/java_home/lib/tools.jar /usr/libexec/java_home/Classes/classes.jar

2. Incompatible source code

Some code in Hadoop v2.4.1 is not compatible with Mac system, so need to apply the patch HADOOP-9648.v2.patch and the related Jira issue is HADOOP-10699

error:

     [exec] /Users/gaurav/GitHub/hadoop/hadoop-common-project/hadoop-common/src/main/native/src/org/apache/hadoop/security/JniBasedUnixGroupsNetgroupMapping.c:77:26: error: invalid operands to binary expression ('void' and 'int')
     [exec]   if(setnetgrent(cgroup) == 1) {
     [exec]      ~~~~~~~~~~~~~~~~~~~ ^  ~
     [exec] 1 error generated.

Solution: Download the Patch from Jira issue HADOOP-10699. Download link-> HADOOP-9648.v2.patch

  • git apply HADOOP-9648.v2.patch
  • mvn package -Pdist,native -DskipTests -Dtar

Result

hadoop-dist/target/hadoop-2.4.1/lib/native folder should contain the native libraries. Copy them to hadoop-2.4.1/lib/native folder and restart Hadoop cluster.

References

  1. Native Libraries Guide documentation page.
  2. Hadoop Git repo
  3. HADOOP-10699 V2 Patch
  4. Details about Maven JSPC Issue

Apache Oozie Installation on Hadoop 2.4.1

Tags

,

Today I would like to explain how I managed to compile and install Apache Oozie 4.0.1 against the lastest stable Hadoop version 2.4.1

Prerequisites:

  • Hadoop 2.4.1 : installation explained in another post
  • Maven
  • Java 1.6+
  • Unix/Mac machine

Download Oozie

wget http://apache.hippo.nl/oozie/4.0.1/oozie-4.0.1.tar.gz
tar xzvf oozie-4.0.1.tar.gz
cd oozie-4.0.1

Building against Hadoop 2.4.1

By default Oozie builds against Hadoop 1.1.1, so to build against Hadoop 2.4.1, we will have to configure maven dependencies in pom.xml

Change hadoop-2 maven profile

In the downloaded Oozie source code (pom.xml), the hadoop-2 maven profile specifies hadoop.version & hadoop.auth.version to be 2.3.0. So we change them to use 2.4.1

        <profile>
            <id>hadoop-2</id>
            <activation>
                <activeByDefault>false</activeByDefault>
            </activation>
            <properties>
               <hadoop.version>2.4.1</hadoop.version>
               <hadoop.auth.version>2.4.1</hadoop.auth.version>
               <pig.classifier>h2</pig.classifier>
               <sqoop.classifier>hadoop200</sqoop.classifier>
            </properties>
        </profile>

Change Hadooplibs maven module

Next step is to configure hadooplibs maven module to build libs for 2.4.1 version. So we change the pom.xml in hadoop-2,hadoop-distcp-2 & hadoop-test-2 maven modules within Hadooplibs maven module

cd hadooplibs
File hadoop-2/pom.xml : change hadoop-client & hadoop-auth dependency version to 2.4.1
File hadoop-distcp-2/pom.xml: change hadoop-distcp version to 2.4.1
File hadoop-test-2/pom.xml: change hadoop-minicluster version to 2.4.1

Build Oozie distro

Use Maven profile hadoop-2 to compile Oozie 4.0.1 against Hadoop 2.4.1

cd ..
bin/mkdistro.sh -P hadoop-2 -DskipTests 
or 
mvn clean package assembly:single -P hadoop-2 -DskipTests 

Setup Oozie server

Copy the Oozie distro to new directory

cd ..
mkdir Oozie
cp -R oozie-4.0.1/distro/target/oozie-4.0.1-distro/oozie-4.0.1/ Oozie
cd oozie
mkdir libext
cp -R ../oozie-4.0.1/hadooplibs/hadoop-2/target/hadooplibs/hadooplib-2.4.1.oozie-4.0.1/* libext
wget -P libext http://extjs.com/deploy/ext-2.2.zip

Prepare the Oozie war

./bin/oozie-setup.sh prepare-war

Create Sharelib Directory on HDFS

Following command will internally issue a HDFS create directory command to the Name node running at hdfs://localhost:9000 and then copy the shared library to that directory.

./bin/oozie-setup.sh sharelib create -fs hdfs://localhost:9000 

*make sure you select the right port number, otherwise you might get some error like Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag. That is the case when oozie tries to talk to some other web service instead of HFDS FS.

Oozie Database

./bin/ooziedb.sh create -sqlfile oozie.sql -run

Configure Hadoop

Configure the Hadoop cluster with proxyuser for the Oozie process. The following two properties are required in Hadoop etc/hadoop/core-site.xml. If you are using Hadoop higher than version 1.1.0, you can use wildcards to specify the properties in configuration files. Replace “gaurav” with the username you would be running Oozie with.

<property>
<name>hadoop.proxyuser.gaurav.hosts</name>
<value>*</value>
</property>
<property>
<name>hadoop.proxyuser.gaurav.groups</name>
<value>*</value>
</property>

Start Oozie

 ./bin/oozied.sh start

Oozie should now be accessible at http://localhost:11000/oozie

Submit a Test Workflow

Now we will try to submit a Workflow provided in the examples with Oozie: map-reduce. First we need to copy the examples directory in Oozie to your home directory on hdfs and then we submit the oozie job

From Hadoop Directory: bin/hdfs dfs -put path-to-oozie-directory/examples examples 
From Oozie Directory: bin/oozie job -oozie http://localhost:11000/oozie/ -config examples/apps/map-reduce/job.properties  -run

You might need to change job.properties before your submit the workflow to use the correct NameNode and JobTracker ports. If you are running Yarn ( MapReduce 2 ) then JobTracker will be referencing to the ResourceManager port.

nameNode=hdfs://localhost:9000
jobTracker=localhost:8032
queueName=default
examplesRoot=examples

oozie.wf.application.path=${nameNode}/user/${user.name}/${examplesRoot}/apps/map-reduce
outputDir=map-reduce

Status and output of Workflow

Map-Reduce submitted in Oozie : http://localhost:11000/oozie/
Map-reduce-running

Status Of Map-Reduce in Hadoop Cluster
Job-status-Hadoop-cluster

Map-Reduce Finished status in Oozie
Map-reduce-succeeded

That’s It

So we have successfully configured Oozie 4.0.1 with Hadoop 2.4.1 and were also able to submit a Job. In the next post we will talk about other aspects of Oozie, like sub-workflow and how we can link or make workflows depend on each other.

Possible Issues

Java heap space or PermGen space

While running maven command to compile you might faced either of PermGen space or OutOfMemory Java heap space error. So in that case you need to increase the memory allocated to maven process

export 'MAVEN_OPTS=-Xmx1024m -XX:MaxPermSize=128m'

Hadoop History server

The Oozie server needs to talk to Hadoop History server, to know the previous state of the Jobs, so we need to keep history server started while running Oozie. This error occurs when you try to run a workflow.

sbin/mr-jobhistory-daemon.sh start historyserver

Error related to impersonation

RemoteException: User: oozie is not allowed to impersonate oozie. This is caused when you fail to configure proper hadoop.proxyuser.oozie.hosts and hadoop.proxyuser.oozie.groups properties in Hadoop, make sure you use wildcards only if Hadoop is 1.1.0+ version.

InvalidProtocolBufferException

Failed on local exception: com.google.protobuf.InvalidProtocolBufferException: Protocol message end-group tag did not match expected tag.;

This happens when you have compiled your oozie with a Protobuf library which is incompatible with the one used in Hadoop. For my use case I had compiled Oozie 4.0.1 with Protobuf 2.5.0 to work with Hadoop 2.4.1

Jira setup with GlusterFS and Heartbeat for High Availability Part-II

Tags

In the previous post, I told you about the setup of making high-available cluster for Jira. In this post I will continue the same thing and tell you how to configure GlusterFS & HeartBeat to make that setup happen.

How it works

So we start with installing GlusterFS first. GlusterFs is a clustered File system which can be used for scaling the storage capacity by having distributed volumes or we can just use GlusterFS for providing high availability of storage using replicated volumes feature. In this we use multiple servers to maintain a replicated view of the main storage, so in case any one of them is down the other storage sever/node can be used.

So for our use case we would use the replicated volume to store JIRA_HOME directory and the tomcat application, which would imply that all the data is always present on more than one server at any given instance.

Before we can create replicated volumes, we should have GlusterFS server and the client which would be used by Jira to access GlusterFS setup on both the servers. In this setup we would use two servers of same configurations (memory,cpu), so that each of them can be used interchangebly as the master in case the original master crashes.

GlusterFS setup

  1. Install GlusterFS on both the servers. Download latest debian package from http://www.gluster.org/download/ and run following commands:
    sudo apt-get install openssh-server wget nfs-common
    sudo dpkg -i glusterfs-3.2.0.deb
  2. Install Gluster Native Client, because it provides high concurrency, performance and transparent failover compared to NFS or CIFS clients. For this we should have the fuse-utils packages installed
    sudo apt-get install fuse-utils

Continue reading

JIRA setup with GlusterFS and Heartbeat for High Availability Part-I

Tags

JIRA is a great tool for Scrum/Kanban and it can actually be used as a tracking tool for any set of activities which have a defined workflow. Over a period of last 2-3 years atlassian has introduced lot of new features, latest been Rapid Board, Workflow designer and many more which are worth exploring for anyone who is looking for Ticketing/Bug-tracking/Project-managing tool. In this blog I would only be talking about one specific problem related to deployment and high-availability of JIRA as a service and how I solved it.

Problem

For one of my clients, I did the migration of their old JIRA 3.13 installation to JIRA 4.3.2 and during that period I investigated options of doing clustering of JIRA so as to provide high-availablity and scalability. And this is one thing really annoying thing about JIRA that they don’t provide any options for clustering of JIRA at both application level or Operating system level. So you can’t have a setup in which JIRA is installed on two servers and have a load balancer forward the request based on work load.

So only option for scalability is to scale vertically, but for high-availablity I tried some tricks using GlusterFS & HeartBeat, which allows me now to have a passive node on which I can start JIRA automatically when the active JIRA node goes down.

Solution

The trick was to use GlusterFS as a filesystem to provide data replication between two servers, so that all the attachments and other  application data is always replicated in real time on two physical servers.This approach is better then taking daily or hourly backup and using that data in case the server goes down and starting a new server with that data, because that would mean a downtime of atleast few hours and might also involve some data loss.

Continue reading

Working with Joda-time and creating Extensions in Play

Tags

,

I started using this new Web Framework Play few weeks back and came across the requirement to decide if to use java.util.Date or Joda-time API. I started with java.util.date as it was a simple tryout of writing a small CRUD application in Play, but eventually moved on to Joda-time when more functionality was needed out of the date fields ( I should have thought that from starting ).

But the moving from java.util.Date to Joda-time had some side effects, and I thought there should be some easy solutions to those. One of the problem was, Play has these cool Java extensions  format(pattern)  and format(pattern,language) which you can use with the Date object in the view templates directly.

${new Date(1275910970000).format('dd MMMM yyyy hh:mm:ss')}
07 June 2010 01:42:50

And when you use Joda-time you don’t have these extension, but the good part about Play is that it is equally easy to create new extensions. Just create a new class which extends play.templates.JavaExtensions and define a method and you are good to go.

So lets start with first using Joda-time API in your models and then creating some Java extensions for Joda-time

Continue reading