Apache Spark Scala Tutorial with Examples
image by Tony Webster
This article is part of my guide to map reduce frameworks in which I implement a solution to a real-world problem in each of the most popular Hadoop frameworks.
Spark is isn’t actually a MapReduce framework. Instead it is a general-purpose framework for cluster computing, however it can be run, and is often run, on Hadoop’s YARN framework. Because it is often associated with Hadoop I am including it in my guide to map reduce frameworks as it often serves a similar function. Spark was designed to be fast for interactive queries and iterative algorithms that Hadoop MapReduce is a bit slow with.
Let me quickly restate the problem from my original article.
I have two datasets:
- User information (id, email, language, location)
- Transaction information (transaction-id, product-id, user-id, purchase-amount, item-description)
Given these datasets, I want to find the number of unique locations in which each product has been sold. To do that, I need to join the two datasets together.
The Spark Scala Solution
Spark is an open source project that has been built and is maintained by a thriving and diverse community of developers. Spark started in 2009 as a research project in the UC Berkeley RAD Lab, later to become the AMPLab. It was observed that MapReduce was inefficient for some iterative and interactive computing jobs, and Spark was designed in response. Spark’s aim is to be fast for interactive queries and iterative algorithms, bringing support for in-memory storage and efficient fault recovery. Iterative algorithms have always been hard for MapReduce, requiring multiple passes over the same data.
The tables that will be used for demonstration are called
For this task we have used Spark on Hadoop YARN cluster. Our code will read and write data from/to HDFS. Before starting work with the code we have to copy the input data to HDFS.
All code and data used in this post can be found in my Hadoop examples GitHub repository.
Prior to manipulating the data it is required to define a SparkContext. It is enough to set an app name and a location of a master node.
Spark’s core abstraction for working with data is the resilient distributed dataset (RDD). Explicitely you can see it in the code when looking at
newUsersPair are RDDs. They are Key/Value RDDs to be more precise.
An RDD in Spark is an immutable distributed collection of objects. Each RDD is split into multiple partitions, which may be computed on different nodes of the cluster. RDDs can contain any type of Python, Java, or Scala objects, including user-defined classes. Computations on RDD’s are designed to feel like Scala’s native List operations.
In Spark all work is expressed as either creating new RDDs, transforming existing RDDs, or calling operations on RDDs to compute a result. Spark automatically distributes the data contained in RDDs across the cluster and parallelizes the operations that are performed on them.
Transforming existing RDDs is different from calling an
action to compute a result. Actions trigger actual computations, where transformations are lazy, so transformation code is not executed until a downstream action is called.
In our code we utilize a lot of Key/Value RDDs. Key/Value RDDs are commonly used to perform aggregations, such as countByKey(), and are useful for joins, such as leftOuterJoin().
In our case we use the action
saveAsTextFile() that is used to output result to HDFS). Where a transformation only returns info about the format the data after the transformation (because it doesn’t actually do anything), calling an action will immediately result in logs about what is being done and the progress of the computation pipeline.
It’s really easy to see the transaction/action interplay by using the Spark CLI, an interactive Spark shell.
Transforming our data
The process of transforming the input text file into a Key/value RDD is rather self-explanatory:
After calling an action and computing a result, we transform it back into an RDD so we can use the saveAsTextFile function to store the result elsewhere in HDFS.
toSeq transforms the Map that
countByKey of the
processData function returns into an ArrayBuffer. This ArrayBuffer can be given as an input to
parallelize function of SparkContext to map it back into an RDD.
Spark is designed with workflows like ours in mind, so join and key count operations are provided out of the box.
leftOuterJoin() function joins two RDDs on key, that is why it was important that our RDDs are Key/Value RDDs. The result of the join is an RDD of a form
RDD[(Int, (Int, Option[String]))].
values() functions allows to omit the key of the join (user_id) as it is not needed in the operations that follow the join.
distinct() function selects distinct Tuples from the values of the join.
The result of
distinct() functions is in a form of
countByKey() counts the number of countries where the product was sold. It will return a
Running the resulting jar
The best way to run a spark job is using spark-submit.
As with other frameworks the idea was to follow closely the existing official tests in Spark GitHub, using scalatests and JUnit in our case.
The test is fairly straightforward. We have checked at the end that the expected result is equal to the result that was obtained through Spark.
Spark is used for a diverse range of applications. It contains different components: Spark Core, Spark SQL, Spark Streaming, MLlib, and GraphX. These libraries solve diverse tasks from data manipulation to performing complex operations on data.
In addition, Spark can run over a variety of cluster managers, including Hadoop YARN, Apache Mesos, and a simple cluster manager included in Spark itself called the Standalone Scheduler.
The tool is very versatile and useful to learn due to variety of usages. It’s easy to get started running Spark locally without a cluster, and then upgrade to a distributed deployment as needs increase.
O’REILLY Publishing ‘Learning Spark: Lightning-Fast Big Data Analysis’ Book by Holden Karau, Andy Konwinski, Patrick Wendell, Matei Zaharia: Amazon Link.