Sometimes the Hive query you want to write could not be expressed easily using the Hive built-in functions. By writing a user-defined function (UDF), Hive makes it easy to plug in your own processing code and invoke it from a Hive query.

There are three types of user-defined functions you can have in Hive:

  1. UDF – normal user-defined functions, simplest to write
  2. UDAF – user-defined aggregation functions, the ones typically used in the “group by” case
  3. UDTF – user-defined table functions, just like Hive internal “explode” function

This post will give you an example of how to write UDAF functions. Here is the code:

package com.effectivemeasure.hive.udaf;

import org.apache.hadoop.hive.ql.exec.UDAF;
import org.apache.hadoop.hive.ql.exec.UDAFEvaluator;

import java.util.HashMap;

/**
 * This class will group key value pairs and return a Map<Integer, Integer>, based on group by fields
 *
 * For example, query "SELECT id, GROUP_MAP(choice_id, question_id) FROM table GROUP BY visitor_id"
 *
 * Will be able to return the following values:
 *
 * 2f5d017334da977740ee723-61885258 -> {3165:411,2:1,3162:410,3159:409}
 *
 * @author Eric Lin
 */
public final class GroupMap extends UDAF
{
    public static class Evaluator implements UDAFEvaluator
    {
        private HashMap<Integer, Integer> buffer;

        public Evaluator()
        {
            init();
        }

        /**
         * Initializes the evaluator and resets its internal state.
         */
        public void init()
        {
            buffer = new HashMap<Integer, Integer>();
        }

        /**
         * This function is called every time there is a new value to be aggregated.
         * The parameters are the same parameters that are passed when function is called in Hive query.
         *
         * @param key Integer
         * @param value Integer
         * @return Boolean
         */
        public boolean iterate(Integer key, Integer value)
        {
            if(!buffer.containsKey(key)) {
                buffer.put(key, value);
            }

            return true;
        }

        /**
         * Function called when separated jobs are done on different data nodes (partial aggregation)
         *
         * @return HashMap
         */
        public HashMap<Integer, Integer> terminatePartial()
        {
            return buffer;
        }

        /**
         * Function called when merging all data result calculated from all data notes
         *
         * @param another HashMap
         * @return Boolean
         */
        public boolean merge(HashMap<Integer, Integer> another)
        {
            //null might be passed in case there is no input data.
            if (another == null) {
                return true;
            }

            for(Integer key : another.keySet()) {
                if(!buffer.containsKey(key)) {
                    buffer.put(key, another.get(key));
                }
            }

            return true;
        }

        /**
         * This function is called when the final result of the aggregation is needed
         *
         * @return HashMap
         */
        public HashMap<Integer, Integer> terminate()
        {
            if (buffer.size() == 0) {
                return null;
            }

            return buffer;
        }
    }
}

Key points:

  • A UDAF must be a subclass of org.apache.hadoop.hive.ql.exec.UDAF
  • Contain one or more nested static classes implementing org.apache.hadoop.hive.ql.exec.UDAFEvaluator
  • I have explained the 5 required functions “init”, “iterate”, “terminatePartial”, “merge”, “terminate” in the comments section.

To run it:

  1. Compile the JAVA code to generate the JAR file
  2. Put the JAR file on the namenode under any location, maybe /tmp
  3. Hive command “ADD JAR /tmp/my-udaf.jar”
  4. Hive command “CREATE TEMPORARY FUNCTION group_map AS ‘com.effectivemeasure.hive.udaf'”
  5. Finally simply use the function as normal in Hive query: “SELECT id, GROUP_MAP(choice_id, question_id) FROM table GROUP BY id”

That’s it, now you have a custom UDAF running on your cluster.

Leave a Reply

Your email address will not be published. Required fields are marked *