Setting up a small budget Hadoop Cluster for Big Data Analysis

Analytics

At the 2014 Esri User Conference, the Big Data team gave several presentations, including two technical workshops entitled: ‘Big Data and Analytics: The Fundamentals’ and ‘Big Data and Analytics with ArcGIS’. We presented our open source GIS Tools for Hadoop (shared on GitHub), as well as some research that we’re currently pursuing (exciting things to come!). We gave demos using both our open source tools as well as the prototype tools being currently researched.

For the demos (source data consisted of > 170 million data points that represent all the taxi cab trips in New York City in 2013), we ran all of our analytics on a Hadoop cluster back in Redlands. A twenty node cluster may seem like a big investment (and it can be); but, it doesn’t have to be. Enter the DREDD cluster…

A few months ago we created the DREDD cluster at Esri for R&D use using old computers that were destined for the dump (yes, it’s named after the one-and-only Judge DREDD). Hadoop is described as being able to run on clusters of commodity hardware. What better test is there than using a set of 5-7 year old desktops? Our cluster is composed of twenty computers (called nodes in Hadoop). The DREDD cluster was set up with twenty of these free computers (read: FREE). We did however update each with a new fast network card, more RAM, and a large and fast hard drive. This is all fairly inexpensive, especially when compared with new computers. Buying twenty new computers would have cost around $75,000. We were able to get the DREDD cluster up and running at much less than 1/10th of that!

With old computers you can expect occasional failures, but luckily for us (and you), Hadoop is built to be fault tolerant – meaning that your data is replicated across computers to protect against failures. So old hardware isn’t a problem with the Hadoop infrastructure.

From there we were able to set up our Hadoop cluster using documentation readily found online, such as this.

As always, we would love to hear what you are doing with big data, what functionality you want to see in the future, and any questions or comments you have about setting up your own cluster using commodity hardware.

Technical specs of the DREDD cluster:
OS: Linux (CentOS-6.5 distribution)
Hard Drive: 1TB
RAM: 16 GB
CPU: Intel Xeon ~3.07GHz quad-core
System Deployment: Clonezilla
Hadoop Management: Ambari

Thanks to Sarah Ambrose from the Big Data portion of the geodatabase team for this post.