Hadoop Hive UDF Tutorial - Extending Hive with Custom Functions
Learning Hadoop and Spark?
I've scoured the internet and I think this free Big Data course from UC San Diego is a great way to jump in. It's hosted on Coursera, so you can audit the course for free.
This is part 1/3 in my tutorial series for extending Apache Hive.
There are two different interfaces you can use for writing UDFs for Apache Hive. One is really simple, the other… not so much.
The simple API (
org.apache.hadoop.hive.ql.exec.UDF) can be used so long as your function reads and returns primitive types. By this I mean basic Hadoop & Hive writable types - Text, IntWritable, LongWritable, DoubleWritable, etc.
However, if you plan on writing a UDF that can manipulate embedded data structures, such as
Set, then you’re stuck using
org.apache.hadoop.hive.ql.udf.generic.GenericUDF, which is a little more involved.
- Simple API - org.apache.hadoop.hive.ql.exec.UDF
- Complex API - org.apache.hadoop.hive.ql.udf.generic.GenericUDF
I’m going to walk through an example of building a UDF in each interface. I will provide code and tests for everything I do.
If you want to browse the code, fork it on Github.
The Simple API
Building a UDF with the simpler UDF API involves little more than writing a class with one function (evaluate). Here is an example:
Testing a simple UDF
Because the UDF is simple one function, you can test it with regular testing tools, like JUnit.
Gotcha! Also Test in the Hive Console
You should also test the UDF in hive directly, especially if you’re not totally sure that the function deals with the right types.
In fact, this UDF has a bug, it doesn’t do a check for null arguments. Nulls can be pretty common in big datasets, so plan appropriately.
In response, I added a simple null check to the function code -
And included a second test to verify it -
Running the tests with
mvn test confirms that everything passes.
The Complex API
org.apache.hadoop.hive.ql.udf.generic.GenericUDF API provides a way to write code for objects that are not writable types, for example -
This api requires you to manually manage object inspectors for the function arguments, and verify the number and types of the arguments you receive. An object inspector provides a consistent interface for underlying object types so that different object implementations can all be accessed in a consistent way from within hive (eg you could implement a struct as a
Map so long as you provide a corresponding object inspector.
The API requires you to implement three methods:
This probably doesn’t make any sense without an example, so lets jump into that.
I’m going to walk through the creation of a function called
containsString that takes two arguments:
- A list of Strings
- A String
and returns true/false on whether the list contains the string that we provide, for example:
Unlike with the UDF api, the GenericUDF api requires a little more boilerplate.
The call pattern for a UDF is the following:
- The UDF is initialized using a default constructor.
udf.initialize()is called with the array of object instructors for the udf arguments (ListObjectInstructor, StringObjectInstructor).
- We check that we have the right number of arguments (2), and that they are the right types (as above).
- We store the object instructors for use in
- We return an object inspector so Hive can read the result of the function (BooleanObjectInspector).
- Evaluate is called for each row in your query with the arguments provided (eg evaluate(List(“a”, “b”, “c”), “c”)).
- We extract the values using the stored object instructors.
- We do our magic and return a value that aligns with the object inspector returned from initialize. (list.contains(elemement) ? true : false)
The only complex part of testing the function is in the setup. Once the call order is clear, and we know how to build object instructors then it’s pretty easy.
My test reflects the functional examples provided earlier, with an additional null argument check.
Again, all the code in this blogpost is open source and on Github.
Hopefully this article has given you an idea of how to extend Hive with custom functions.
Although I ignored them in this article, there are also User Defined Aggregation Functions (UDAF) which allow the processing and aggregation of many rows in a single function. If you’re interested in learning more, there are a few resources on-line on this topic which can help.
Read More about Hive
Programming Hive contains brief tutorials and code samples for both UDFs and UDAFs. The examples are different to mine, so they should help in building your understanding.