Hadoop Map-Reduce Framework Tutorials with Examples

Updated October 2015 Full sample code is available for many frameworks, see the list at the bottom of the article

There are a lot of frameworks for writing map-reduce pipelines for Hadoop, but it can be pretty hard to navigate everything to get a good sense of what framework you should be using. I felt very overwhelmed when I started working with Hadoop, and this has only gotten worse for newcomers as the number of frameworks keeps growing.

Having now explored a number of frameworks, I thought it would be useful to list the major frameworks and provide examples of performing a common operation in each framework.

Generally speaking, the goal of each framework is to make building pipelines easier than when using the basic map and reduce interface provided by hadoop- core. This usually means the frameworks do not require you to write these functions at all, but something more high-level that the framework can ‘compile’ into a pipeline of map-reduce jobs. This is particularly true for the higher level frameworks (such as hive), which don’t really require any knowledge of programming to operate.

List of Map Reduce Frameworks for each language

Basic Map Reduce walkthrough docs
Cascading walkthrough docs
Crunch coming soon docs
Cascalog coming soon docs
Scrunch coming soon docs
Scalding walkthrough docs
Scoobi walkthrough docs
Any Language    
Hadoop Streaming coming soon docs
Wukong coming soon docs
Cascading JRuby coming soon docs
PHP (yes, really)    
HadooPHP coming soon docs
MR Job coming soon docs
Dumbo coming soon docs
Hadooppy coming soon docs
Pydoop coming soon docs
Luigi coming soon docs
RHadoop coming soon docs
New Languages    
Hive walkthrough docs
Pig walkthrough docs
Spark coming soon docs

please tweet me if I have missed any: @rathboma

Framework Walkthroughs

I will create a separate article for each framework ( current articles listed here ) in which I will build a small map-reduce pipeline to do the following:

Given two (fake) datasets:

  1. A set of user demographic information containing [id, email, language, location]
  2. A set of item purchases, containing fields [transaction-id, product-id, user-id, purchase-amount, product-description]

Calculate the number of locations in which a product is purchased.

Whilst this example is fairly simple, it requires a join of two datasets, and a pipeline of two mapreduce jobs. Step one joins users to purchases, while step two aggregates on location. These two things in unison should help demonstrate the unique attributes of each framework much better than the simple Word Count example which is usually used as demonstration.

As I complete each example I will update this document with a link to each example.

My Commonly used Frameworks

  • Hive – Hive is amazing because anyone can query the data with a little knowledge of SQL. Hook it up to a visual query designer and you don’t even need that.
  • Pig – the perfect framework for prototyping and quick-investigation. It’s a simple scripting language with a bunch of powerful map-reduce specific features.
  • Scoobi – I use this a lot to build pipelines in Scala because it’s very functional, and in many way’s you just treat the data like a regular list, which is great.
  • Raw Map/Reduce – Sometimes I like to program directly to the API, especially when doing something mission critical. I also find the individual map and reduce functions easier to test.


Matthew Rathbone's Picture

Matthew Rathbone

CEO of Beekeeper Data. British. Data Nerd. Lucky husband and father. More about me

Enjoy this Article? Subscribe for more

  • Reporting Tips
  • Engineering Tutorials
  • Book Recommendations
  • Open Source Code