Hadoop MapReduce Framework Tutorials with Examples

Hire me to supercharge your Hadoop and Spark projects

I help businesses improve their return on investment from big data projects. I do everything from software architecture to staff training. Learn More

Updated October 2015 Full sample code is available for many frameworks, see the list at the bottom of the article

There are a lot of frameworks for writing MapReduce pipelines for Hadoop, but it can be pretty hard to navigate everything to get a good sense of what framework you should be using. I felt very overwhelmed when I started working with Hadoop, and this has only gotten worse for newcomers as the number of frameworks keeps growing.

Having now explored a number of frameworks, I thought it would be useful to list the major frameworks and provide examples of performing a common operation in each framework.

Generally speaking, the goal of each framework is to make building pipelines easier than when using the basic map and reduce interface provided by hadoop- core. This usually means the frameworks do not require you to write these functions at all, but something more high-level that the framework can ‘compile’ into a pipeline of MapReduce jobs. This is particularly true for the higher level frameworks (such as hive), which don’t really require any knowledge of programming to operate.

List of Map Reduce Frameworks for each language

Java    
Basic Map Reduce walkthrough docs
Cascading walkthrough docs
Crunch coming soon docs
Clojore    
Cascalog coming soon docs
Scala    
Scrunch coming soon docs
Scalding walkthrough docs
Scoobi walkthrough docs
Any Language    
Hadoop Streaming coming soon docs
Ruby    
Wukong coming soon docs
Cascading JRuby coming soon docs
PHP (yes, really)    
HadooPHP coming soon docs
Python    
Regular Python Streaming walkthrough docs
MR Job coming soon docs
Dumbo coming soon docs
Hadooppy coming soon docs
Pydoop coming soon docs
Luigi coming soon docs
R    
RHadoop coming soon docs
New Languages    
Hive walkthrough docs
Pig walkthrough docs
Other    
Spark walkthrough docs

please tweet me if I have missed any: @rathboma

Framework Walkthroughs

I will create a separate article for each framework ( current articles listed here ) in which I will build a small MapReduce pipeline to do the following:

Given two (fake) datasets:

  1. A set of user demographic information containing [id, email, language, location]
  2. A set of item purchases, containing fields [transaction-id, product-id, user-id, purchase-amount, product-description]

Calculate the number of locations in which a product is purchased.

Whilst this example is fairly simple, it requires a join of two datasets, and a pipeline of two mapreduce jobs. Step one joins users to purchases, while step two aggregates on location. These two things in unison should help demonstrate the unique attributes of each framework much better than the simple Word Count example which is usually used as demonstration.

As I complete each example I will update this document with a link to each example.

My Commonly used Frameworks

  • Hive – Hive is amazing because anyone can query the data with a little knowledge of SQL. Hook it up to a visual query designer and you don’t even need that.
  • Pig – the perfect framework for prototyping and quick-investigation. It’s a simple scripting language with a bunch of powerful MapReduce specific features.
  • Scoobi – I use this a lot to build pipelines in Scala because it’s very functional, and in many way’s you just treat the data like a regular list, which is great.
  • Raw Map/Reduce – Sometimes I like to program directly to the API, especially when doing something mission critical. I also find the individual map and reduce functions easier to test.

Updates

Matthew Rathbone's Picture

Matthew Rathbone

Consultant Big Data Infrastructure Engineer at Rathbone Labs. British. Data Nerd. Lucky husband and father.

Hire me to supercharge your Hadoop and Spark projects

I help businesses improve their return on investment from big data projects. I do everything from software architecture to staff training. Learn More

Join the discussion