Hadoop MapReduce Framework Tutorials with Examples

By Matthew Rathbone on January 05 2013 Share Tweet Post

Hire me to supercharge your Hadoop and Spark projects

I help businesses improve their return on investment from big data projects. I do everything from software architecture to staff training. Learn More

Updated October 2015 Full sample code is available for many frameworks, see the list at the bottom of the article

There are a lot of frameworks for writing MapReduce pipelines for Hadoop, but it can be pretty hard to navigate everything to get a good sense of what framework you should be using. I felt very overwhelmed when I started working with Hadoop, and this has only gotten worse for newcomers as the number of frameworks keeps growing.

Having now explored a number of frameworks, I thought it would be useful to list the major frameworks and provide examples of performing a common operation in each framework.

Generally speaking, the goal of each framework is to make building pipelines easier than when using the basic map and reduce interface provided by hadoop- core. This usually means the frameworks do not require you to write these functions at all, but something more high-level that the framework can ‘compile’ into a pipeline of MapReduce jobs. This is particularly true for the higher level frameworks (such as hive), which don’t really require any knowledge of programming to operate.

List of Map Reduce Frameworks for each language

Java
Basic Map Reduce	walkthrough	docs
Cascading	walkthrough	docs
Crunch	coming soon	docs

Clojore
Cascalog	coming soon	docs

Scala
Scrunch	coming soon	docs
Scalding	walkthrough	docs
Scoobi	walkthrough	docs

Any Language
Hadoop Streaming	coming soon	docs

Ruby
Wukong	coming soon	docs
Cascading JRuby	coming soon	docs

PHP (yes, really)
HadooPHP	coming soon	docs

Python
Regular Python Streaming	walkthrough	docs
MR Job	coming soon	docs
Dumbo	coming soon	docs
Hadooppy	coming soon	docs
Pydoop	coming soon	docs
Luigi	coming soon	docs

R
RHadoop	coming soon	docs

New Languages
Hive	walkthrough	docs
Pig	walkthrough	docs

Other
Spark	walkthrough	docs

please tweet me if I have missed any: @rathboma

Framework Walkthroughs

I will create a separate article for each framework ( current articles listed here ) in which I will build a small MapReduce pipeline to do the following:

Given two (fake) datasets:

A set of user demographic information containing [id, email, language, location]
A set of item purchases, containing fields [transaction-id, product-id, user-id, purchase-amount, product-description]

Calculate the number of locations in which a product is purchased.

Whilst this example is fairly simple, it requires a join of two datasets, and a pipeline of two mapreduce jobs. Step one joins users to purchases, while step two aggregates on location. These two things in unison should help demonstrate the unique attributes of each framework much better than the simple Word Count example which is usually used as demonstration.

As I complete each example I will update this document with a link to each example.

My Commonly used Frameworks

Hive – Hive is amazing because anyone can query the data with a little knowledge of SQL. Hook it up to a visual query designer and you don’t even need that.
Pig – the perfect framework for prototyping and quick-investigation. It’s a simple scripting language with a bunch of powerful MapReduce specific features.
Scoobi – I use this a lot to build pipelines in Scala because it’s very functional, and in many way’s you just treat the data like a regular list, which is great.
Raw Map/Reduce – Sometimes I like to program directly to the API, especially when doing something mission critical. I also find the individual map and reduce functions easier to test.

Updates

2013-02-09: map reduce walkthrough published
2013-02-21: hive walkthrough published
2013-04-07: pig walkthrough published
2013-11-03: scoobi walkthrough published
2015-06-25: cascading walkthrough published
2015-10-19: scalding walkthrough published

Hadoop MapReduce Framework Tutorials with Examples

Hire me to supercharge your Hadoop and Spark projects

List of Map Reduce Frameworks for each language

Framework Walkthroughs

My Commonly used Frameworks

Updates

Matthew Rathbone

Hire me to supercharge your Hadoop and Spark projects

Join the discussion

Beekeeper Studio

Hadoop MapReduce Framework Tutorials with Examples

Hire me to supercharge your Hadoop and Spark projects

List of Map Reduce Frameworks for each language

Framework Walkthroughs

My Commonly used Frameworks

Updates

Matthew Rathbone

Hire me to supercharge your Hadoop and Spark projects

Previous

Next

Related Hadoop Articles

Join the discussion

Join my newsletter

Beekeeper Studio

Related Articles