Setting up a small budget Hadoop Cluster for Big Data Analysis

At the 2014 Esri User Conference, the Big Data team gave several presentations, including two technical workshops entitled: ‘Big Data and Analytics: The Fundamentals’ and ‘Big Data and Analytics with ArcGIS’. We presented our open source GIS Tools for Hadoop (shared on GitHub), as well as some research that we’re currently pursuing (exciting things to come!). We gave demos using both our open source tools as well as the prototype tools being currently researched.

For the demos (source data consisted of > 170 million data points that represent all the taxi cab trips in New York City in 2013), we ran all of our analytics on a Hadoop cluster back in Redlands. A twenty node cluster may seem like a big investment (and it can be); but, it doesn’t have to be. Enter the DREDD cluster…

A few months ago we created the DREDD cluster at Esri for R&D use using old computers that were destined for the dump (yes, it’s named after the one-and-only Judge DREDD). Hadoop is described as being able to run on clusters of commodity hardware. What better test is there than using a set of 5-7 year old desktops? Our cluster is composed of twenty computers (called nodes in Hadoop). The DREDD cluster was set up with twenty of these free computers (read: FREE). We did however update each with a new fast network card, more RAM, and a large and fast hard drive. This is all fairly inexpensive, especially when compared with new computers. Buying twenty new computers would have cost around $75,000. We were able to get the DREDD cluster up and running at much less than 1/10th of that!

With old computers you can expect occasional failures, but luckily for us (and you), Hadoop is built to be fault tolerant – meaning that your data is replicated across computers to protect against failures. So old hardware isn’t a problem with the Hadoop infrastructure.

From there we were able to set up our Hadoop cluster using documentation readily found online, such as this.

As always, we would love to hear what you are doing with big data, what functionality you want to see in the future, and any questions or comments you have about setting up your own cluster using commodity hardware.

Technical specs of the DREDD cluster:
OS: Linux (CentOS-6.5 distribution)
Hard Drive: 1TB
RAM: 16 GB
CPU: Intel Xeon ~3.07GHz quad-core
System Deployment: Clonezilla
Hadoop Management: Ambari

Thanks to Sarah Ambrose from the Big Data portion of the geodatabase team for this post.

This entry was posted in Geodata and tagged , , , . Bookmark the permalink.

Leave a Reply


  1. sambrose88_1 says:

    Thanks go out to Randall Whitman, Mike Park and Erik Hoel for contributing to the blog post.

  2. chandrabigdata says:

    Hi ,

    I would like to set up a small cluster on My Lab to process of 2 to 3 TB data in multiple nodes(say for Ex 5 Systems) .
    Can you please suggest me how i can go ahead with this. I am expecting the technical specification for the 5 nodes and RAC which one will be Good.

    Thanks in Advance


    • sambrose88_1 says:

      Hi Chandra,

      This is the first cluster we’ve set up, would something like this work for you?

      We’ll keep you posted as we try other Big Data methods in the future.

  3. bdschool says:

    JM – thank you for this article. Going back to school (both formal and self-taught). Building some BD boxes to work with is challenging for me (zero budget and no old boxes).

    I found another article, but it was for a 20 node setup – ouch $)

    From what I have read so far; here is what I am considering:

    - start with two nodes (config for each)
    — AMD Quad
    — 8 GB (MB up-gradable to 16GB later)

    —- get my first Server Rack [no future room in house for 10 Nodes - unless I go "up"]

    ….. everyone please feel free to comment
    —– ? better $ path choice
    —– ? MB’s suggested (thinking ASUS with SATA III)
    —– ? CPU suggested
    —– ? more cost effective

    thank you for any and all comments

    - I’m still learning what questions to ask :-)

    • sambrose88_1 says:

      Hi bdschool,

      Since the team has only set up the DREDD cluster, and were limited by what we had, we didn’t have to make those choices. You might get some great feedback using the Hadoop Users mailing list ( Another interesting point is that 2 nodes actually creates an error, as Hadoop has a default replication factor of 3. So you’ll probably want to start out with at least 3 nodes. You may also want to consider a cloud based solution, which will cost much less upfront.

  4. carloss7 says:


    I was wondering if you have the code for the taxi cab demos?

    Thank you.