Cascading MapReduce Hadoop Tutorial with Examples
Learning Hadoop and Spark?
I've scoured the internet and I think this free Big Data course from UC San Diego is a great way to jump in. It's hosted on Coursera, so you can audit the course for free.
This article is part of my guide to map reduce frameworks in which I implement a solution to a real-world problem in each of the most popular Hadoop frameworks.
This time I am implementing the solution using Cascading 2.7.0.
Let me quickly restate the problem from my original article.
I have two datasets:
- User information (id, email, language, location)
- Transaction information (transaction-id, product-id, user-id, purchase-amount, item-description)
Given these datasets, I want to find the number of unique locations in which each product has been sold. To do that, I need to join the two datasets together.
The Cascading Solution
If you are impatient, you can see the full implementation here.
Cascading is a data processing API that adds an abstraction layer over the Hadoop API. It is used for defining, sharing, and executing data processing workflows. Its goal is to make building pipelines easier than using the basic map and to reduce interface provided by Hadoop. You use Cascading like any other API by making your code accessible to the cascading jar libraries that are distributed by cascading.org. The fastest way to get started with Cascading is to download Cascading SDK as it includes cascading and related projects in a single archive.
Cascading reads input data and writes output to a data resource (such as a file on the local file system) on a Hadoop distributed file system, or on S3 (look here for more details). HDFS serves as an input and output file system in this example. To start working with the code, copy your input data to HDFS. The output folder will be created by the cascading project. It must not exist when you are starting the execution of the program.
Thus, to start our task we put to HDFS
The whole solution is here below (and can be downloaded from my code repository), I will go through each portion in detail:
Cascading provides an abstraction to MapReduce by describing data processing workflow in terms of Taps, Pipes and Sinks. Data comes into the process from a Tap, passes through a few Pipes and finally flows into a Sink.
Processing input data starts with reading it into a Tap:
It is rather important that we name the columns of the input data or we will have to refer to them by numbers, which is inconvenient. It is possible to provide an input file with a header to define the names directly. To do this, the second parameter of the
TextDelimited( users, *true*, "\t" ) function must be set to true. However this is usually not a good solution for larger datasets.
The second step in our process is to create a Pipe. The data comes from the Tap and goes through a number of transformations.
To connect all the Pipes together and get an output into a Sink a Flow is needed:
In our case data for users and transactions goes straight to the joinPipe via transactionsPipe and usersPipe:
This results are in a table with the fields that came from the two tables that were joined. We only need to retain two of them:
Now we count unique locations for each product:
So the data travels via
usersPipe into a
joinPipe, selected fields travel further through
retainPipe to a
cdistPipe. From this last pipe (as specified in
.addTailSink( cdistPipe, outputTap )) resulting data goes into the Sink, which is our output folder.
The compiled code is executed via the hadoop command:
The result on our data is:
The Cascading API includes test functionality, in particular
cascading.PlatformTestCase classes that allow the testing of custom Operations and Flows. Our example consists of one Flow. The Cascading API suggests using the
validateLength() static helper methods from
cascading.CascadingTestCase class for Flow testing. Different
validateLength() methods validate that a Flow is of the correct length, has the correct Tuple size, and its values match a given regular expression pattern. In our example we can check that the final output has exactly two rows.
A fully working unit test for this code using these libraries is included in the the github repository.
Cascading simplifies data processing in comparison to using plain MapReduce by providing a higher level abstraction. The code is in Java and can easily be a small block of a bigger java project.
Cascading pipes provide an ability to easily write functions for data cleaning, matching, splitting, etc. Sitting somewhere in between raw Map Reduce and a higher level language like Hive, Cascading provides a relatively quick way to build data processing pipelines in a small(ish) amount of time.