Cascading MapReduce Hadoop Tutorial with Examples
Hire me to supercharge your Hadoop and Spark projects
I help businesses improve their return on investment from big data projects. I do everything from software architecture to staff training. Learn More
This article is part of my guide to map reduce frameworks in which I implement a solution to a real-world problem in each of the most popular Hadoop frameworks.
This time I am implementing the solution using Cascading 2.7.0.
The Problem
Let me quickly restate the problem from my original article.
I have two datasets:
- User information (id, email, language, location)
- Transaction information (transaction-id, product-id, user-id, purchase-amount, item-description)
Given these datasets, I want to find the number of unique locations in which each product has been sold. To do that, I need to join the two datasets together.
Previously I implemented this solution in java, with hive and with pig. The java solution was ~500 lines of code, hive and pig were ~20 lines maximum.
The Cascading Solution
If you are impatient, you can see the full implementation here.
Cascading is a data processing API that adds an abstraction layer over the Hadoop API. It is used for defining, sharing, and executing data processing workflows. Its goal is to make building pipelines easier than using the basic map and to reduce interface provided by Hadoop. You use Cascading like any other API by making your code accessible to the cascading jar libraries that are distributed by cascading.org. The fastest way to get started with Cascading is to download Cascading SDK as it includes cascading and related projects in a single archive.
SETUP
Cascading reads input data and writes output to a data resource (such as a file on the local file system) on a Hadoop distributed file system, or on S3 (look here for more details). HDFS serves as an input and output file system in this example. To start working with the code, copy your input data to HDFS. The output folder will be created by the cascading project. It must not exist when you are starting the execution of the program.
Thus, to start our task we put to HDFS
and
The whole solution is here below (and can be downloaded from my code repository), I will go through each portion in detail:
Cascading provides an abstraction to MapReduce by describing data processing workflow in terms of Taps, Pipes and Sinks. Data comes into the process from a Tap, passes through a few Pipes and finally flows into a Sink.
Processing input data starts with reading it into a Tap:
It is rather important that we name the columns of the input data or we will have to refer to them by numbers, which is inconvenient. It is possible to provide an input file with a header to define the names directly. To do this, the second parameter of the TextDelimited( users, *true*, "\t" )
function must be set to true. However this is usually not a good solution for larger datasets.
The second step in our process is to create a Pipe. The data comes from the Tap and goes through a number of transformations.
To connect all the Pipes together and get an output into a Sink a Flow is needed:
In our case data for users and transactions goes straight to the joinPipe via transactionsPipe and usersPipe:
This results are in a table with the fields that came from the two tables that were joined. We only need to retain two of them: product-id
and location
:
Now we count unique locations for each product:
Summary
So the data travels via transactionsPipe
and usersPipe
into a joinPipe
, selected fields travel further through retainPipe
to a cdistPipe
. From this last pipe (as specified in .addTailSink( cdistPipe, outputTap )
) resulting data goes into the Sink, which is our output folder.
The compiled code is executed via the hadoop command:
The result on our data is:
Testing
The Cascading API includes test functionality, in particular cascading.CascadingTestCase
and cascading.PlatformTestCase
classes that allow the testing of custom Operations and Flows. Our example consists of one Flow. The Cascading API suggests using the validateLength()
static helper methods from cascading.CascadingTestCase
class for Flow testing. Different validateLength()
methods validate that a Flow is of the correct length, has the correct Tuple size, and its values match a given regular expression pattern. In our example we can check that the final output has exactly two rows.
A fully working unit test for this code using these libraries is included in the the github repository.
Thoughts
Cascading simplifies data processing in comparison to using plain MapReduce by providing a higher level abstraction. The code is in Java and can easily be a small block of a bigger java project.
Cascading pipes provide an ability to easily write functions for data cleaning, matching, splitting, etc. Sitting somewhere in between raw Map Reduce and a higher level language like Hive, Cascading provides a relatively quick way to build data processing pipelines in a small(ish) amount of time.
Cascading Resources
- Cascading web-site
- Enterprise Data Workflows with Cascading by Paco Nathan from O’Reilly Media
- Check out the review of the book on the Cascading web-site.