Apache Spark Tutorial with Example Code

image by blair_25

This article is part of my guide to map reduce frameworks in which I implement a solution to a real-world problem in each of the most popular Hadoop frameworks.

Spark is isn’t actually a MapReduce framework. Instead it is a general-purpose framework for cluster computing, however it can be run, and is often run, on Hadoop’s YARN framework. Because it is often associated with Hadoop I am including it in my guide to map reduce frameworks as it often serves a similar function. Spark was designed to be fast for interactive queries and iterative algorithms that Hadoop MapReduce is a bit slow with.

The Problem

Let me quickly restate the problem from my original article.

I have two datasets:

  1. User information (id, email, language, location)
  2. Transaction information (transaction-id, product-id, user-id, purchase-amount, item-description)

Given these datasets, I want to find the number of unique locations in which each product has been sold. To do that, I need to join the two datasets together.

Previously I have implemented this solution in java, with hive and with pig. The java solution was ~500 lines of code, hive and pig were like ~20 lines tops.

The Solution

I’ve published the solution to this problem over on the Beekeeper Blog.

Read the Spark solution and walkthrough here.

Matthew Rathbone's Picture

Matthew Rathbone

CEO of Beekeeper Data. British. Data Nerd. Lucky husband and father. More about me

Need More Hadoop Reading?

I've collected a list of the top Hadoop books on the market

Join the discussion

comments powered by Disqus