Apache Hive Customization Tutorial Series

Learning Hadoop and Spark?

I've scoured the internet and I think this free Big Data course from UC San Diego is a great way to jump in. It's hosted on Coursera, so you can audit the course for free.

Apache Hive is a SQL-on-Hadoop framework that levereges both MapReduce and Tez to execute queries. It is possible to extend hive with your own code. Hive has a very flexible API, so you can write code to do a whole bunch of things, unfortunately the flexibility comes at the expense of complexity.

There are three types of function APIs in Hive, UDF, UDTF, and UDAF which all do very different things. Only by having a solid grasp of all of them will you truly be able to bend Hive to your will. Below are links to tutorials for each function type.

Hive Tutorials

Normal Functions (UDF)

Normal functions take inputs from a single row, and output a single value. Examples of built-in functions include unix_timestamp(), round(), and cos()

Click here for my UDF tutorial

Table Functions (UDTF)

Table functions are similar to UDF functions, but they can output both multiple columns AND multiple rows of data (which is pretty nifty). Examples of built-in table functions include explode(), json_tuple(), and inline()

Click here for my UDTF tutorial

Aggregate Functions (UDAF)

Aggregate functions can operate over an entire table at once to perform some sort of aggregation. This sounds confusing, but it’s very useful in practice. Examples of built-in aggregate functions include sum(), count(), min(), and histogram_numeric()

Click here for my UDAF tutorial

Matthew Rathbone's Picture

Matthew Rathbone

CEO of Beekeeper Data. British. Data Nerd. Lucky husband and father. More about me

Need More Hadoop Reading?

I've collected a list of the top Hadoop books on the market

Join the discussion