Scalding Hadoop MapReduce Tutorial With Example Code
image by Paul Dineed
Learning Hadoop and Spark?
I've scoured the internet and I think this free Big Data course from UC San Diego is a great way to jump in. It's hosted on Coursera, so you can audit the course for free.
This article is part of my guide to map reduce frameworks in which I implement a solution to a real-world problem in each of the most popular Hadoop frameworks.
Perhaps it is best to read this article together with my article about Cascading, as Scalding is based on Cascading.
Let me quickly restate the problem from my original article.
I have two datasets:
- User information (id, email, language, location)
- Transaction information (transaction-id, product-id, user-id, purchase-amount, item-description)
Given these datasets, I want to find the number of unique locations in which each product has been sold. To do that, I need to join the two datasets together.
The Scalding Solution
Scalding is a Scala API developed at Twitter for distributed data programming that uses the Cascading Java API, which in turn sits on top of Hadoop’s Java API. Scalding is pitched as a scala DSL for cascading, with the assetion that writing regular Cascading seem like assembly language programming in comparison.
As in the case with Cascading, the goal of Scalding is to make building data processing pipelines easier than using the basic map and reduce interface provided by Hadoop. The Scalding community is very active (as of the time of writing - October 2015), and there are many libraries built on top of scalding extending it’s functionality. Some pretty big organizations use Scalding, like Twitter, Foursquare, LinkedIn, Etsy, eBay and more.
For those who have read my scoobi walkthrough, scalding may feel familiar.
The tables that will be used for demonstration are called
Our code will read and write data from/to HDFS. Before starting work with the code we have to copy the input data to HDFS. Unlike Cascading, the Scalding job wants an output directory to be created for it prior to executing the job.
All code and data used in this post can be found in my hadoop framework examples GitHub repository.
The full solution is below, it’s a pretty small amount of code which is great, although for those unfamiliar with Scala it may look confusing.
Notice the fairly liberal use of scala symbols (
'iAmASymbol), something that likely has it’s roots in the Ruby world thanks to Twitter as few other scala libraries use this feature all that much.
The TextLine class we use to read input data reads the data line by line. We have the ability to immediately parse the input line to create a dataset with known column names with which we can operate upon further:
While this seems trivial for such simple data, such functionality is very useful for much larger datasets that form part of larger pipelines.
The most confusing part of the solution is the final block of code, but it’s actually quite straight forward.
Here it is step by step in plain english:
- Take the transactions and join them to users based on
projectretains only the specified fields from the resulting join, in this case
- They we select only unique pairs of each product/location combo. This allows us to use the
sizefunction to count distinct locations in the next step.
- Finally, we group our results by product id, and aggregate the remaining records by counting the number of locations associated with that product.
- At the end we write the results to our output directory.
Running the resulting jar
The JobTest class is used to construct unit tests for scalding jobs. The official scalding documentation on cascading.org advises that it should not be used unless it is for testing purposes, so I guess some folks have tried to use it in production somehow (don’t do this). For examples of unit testing the different parts of scalding code, see the tests included in the main scalding repository. The most relevant to our case and the most neat test example there is a WordCount test. Drawing on that I have written a fairly concise test that you can see below.
JobTest gives us the ability to define inputs with
.source, and sinks the output as an (Int, Int) tuple we read into a hashmap structure called
outMap and check two things:
- Product 1 has been sold in 3 locations:
outMap(1) shouldBe 3
- Product 2 has been sold in 1 location :
outMap(2) shouldBe 1
Both the main code and the test code are really concise with Scalding. The serialization and deserialization functions are pretty slick, and make working with Hadoop a lot easier. It is a good alternative to a Hive UDF for those who like working in code rather than SQL. I think developers who genuinely like Scala and need to write tasks for data processing on Hadoop benefit a lot from having this tool at their disposal.
The caveat to this is that you need to know scala to get a lot out of Scalding. Scala is a pretty complex language, so if you’re new to both Hadoop and Scala, this might be a pretty rough place to start.
The Cascading web-site has resources related to both Cascading and Scalding.
Programming MapReduce with Scalding is available on Amazon and should provide a practical guide to working with Scalding.