Hadoop MapReduce Scoobi Tutorial with Examples
Learning Hadoop and Spark?
I've scoured the internet and I think this free Big Data course from UC San Diego is a great way to jump in. It's hosted on Coursera, so you can audit the course for free.
This article is part of my guide to map reduce frameworks in which I implement a solution to a real-world problem in each of the most popular Hadoop frameworks.
This time I’m using Scoobi.
Let me quickly restate the problem from my original article.
I have two datasets:
- User information (id, email, language, location)
- Transaction information (transaction-id, product-id, user-id, purchase-amount, item-description)
Given these datasets, I want to find the number of unique locations in which each product has been sold. To do that, I need to join the two datasets together.
The Scoobi Solution
For the impatient, you can see the full implementation here.
Scoobi is a framework that allows you to write workflows in Scala using functional constructs that are designed to feel like regular list, map, and set operations.
Fundamentally the framework provides you with distributed lists that you transform and aggregate much like you would a regular list in scala.
For example, a simple map-only job in scoobi might look something like this:
It really doesn’t look like Map Reduce at all. In fact, Scoobi has to do a lot of complicated stuff under the hood to enable this kind of syntax (such as code generation).
The final scoobi solution tops out at ~50 lines of code, which is a 10x improvement over the Java mapreduce 500 lines of code. This is largely thanks to Scoobi’s built in
join syntax that lets us join two datasets together without having to worry about partitioners, groupers, or sorters.
As a demonstration, here is a simplified version of a left outer join in scoobi (I’ve ripped this straight from my code):
Oh yeah, you can also work with case classes, which is super nice.
Lets get on with it
So the basic workflow of this algorithm is scoobi is this:
- Read users and transactions
- Join users to transactions based on userid
- transform this dataset to be (productid, location) K/V pairs
- Group by product id to get (productid, Iterable[location])
- find the distinct number of locations for each product
- write (productid, distinct-locations) to the output directory somewhere
Here’s the code I used to accomplish this:
Pretty neat huh?
This is also fully unit-tested. Check out my code for the full test suite.
The final program is pretty succint and Scoobi makes it feel like you’re writing regular Scala. In itself these are excellent attributes of a Scala framework, but unfortunately they come with a cost.
Fundamentally if something doesn’t go quite right and you need to debug or optimize your code you’re going to be in a bad place.
Generated code means that stack traces from map/reduce tasks are not going to be very helpful, and operations in scoobi might do something different in practice than you imagined (“it looks like this should be a reduce task, but it’d doing something different”).
Secondly, something that may seem pretty simple on the surface (count distinct locations) can end up being complicated to implement in practice, not because the final solution is hard, but rather because the API is so very removed from
map, partition, group, sort, and reduce that you’re forced to do things differently. It also doesn’t help that there are multiple ways to solve the same problem.
The Right Framework?
The Scoobi community can be really helpful and responsive, but Scoobi only supports Scala 2.10+, and APIs change regularly from release to release, breaking old code to introduce new hot features.
Personally, I like the api for Scoobi a lot, but with the number of folks involved in the Scalding community it’s quickly becoming the framework to beat.
Scoobi comes somewhere inbetween Pig and Java in terms of ease of use. If you’re interested in how it compares to other frameworks, check out my guide to mapreduce frameworks for the links.