Hive supports a couple of ways to read JSON data, however, I think the easiest way is to use custom JsonSerDe library. The article illustrated how to use this library to query JSON data stored in HDFS using Hive. 1. Download binary jar file based on your CDH version at: http://www.congiu.net/hive-json-serde/1.3/ to /tmp 2. Assumeing you have the following json data in HDFS under /tmp/json-test/test.json, one per line. Please note that no newline is allowed within each record. [generic] { “details”: { “first_name” : “First Name”, “last_name” : “Last Name” }, orders: [ {“id” : 1}, {“id” : 2} ] } { “details”: { “first_name” : “Foo”, “last_name” : “Bar” }, orders: [ {“id” : 4}, {“id” : 43}, {“id”: 111} ] } 3. Enter hive command line either using Beeline or Hive 4. Enter the following commands:
ADD JAR /tmp/json-serde-1.3-jar-with-dependencies.jar;

DROP TABLE IF EXISTS customer;

CREATE EXTERNAL TABLE IF NOT EXISTS customer ( 
    details struct,
    orders array>
) 
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' 
LOCATION '/tmp/json-test';
5. Now you can query the data using the following HQL:
SELECT 
    CONCAT(details.first_name, ' ', details.last_name), 
    orders[0].id 
FROM customer;

SELECT 
    CONCAT(details.first_name, ' ', details.last_name), 
    size(orders) 
FROM customer;
You can see how easy it is now to query JSON data using JsonSerDe and you can use build-in functions available for struct and arrays.

Leave a Reply

Your email address will not be published. Required fields are marked *