Welcome again to my 5th episode of Ranger series. In this post, I will focus on what Hive Plugin is in Ranger, how it works and what happens behind the scene when you try to create or modify tables in Hive. If you have missed my previous episodes, please review them below to get basic understanding of what Ranger is:
- Introduction to Apache Ranger – Part I – Ranger vs Sentry
- Introduction to Apache Ranger – Part II – Architecture Overview
- Introduction to Apache Ranger – Part III – Security Zone
- Introduction to Apache Ranger – Part IV – Resource vs Tag Based Policies
If you are not already aware, in order for Ranger policies to work for any components that you want to apply authorization, you have to install Ranger Plugins to those components. For example, in order to apply authorization to Hive entities, like DB, Table or Columns, you have to install Ranger Plugin in Hive. You also need plugins for HDFS, HBase, Kafka etc. But for simplicity, I will focus on Hive Plugin in this post.
So, what is the plugin and what does it do? Well, Ranger’s Hive Plugin is basically a small piece of code that is attached to HiveServer2 which can perform extra functionalities that is required for Ranger Policies to work. I have drawn below chart to illustrate the relationships between each components:
Let me explain each part of the chart in more detail so that we can get better understanding on what is happening.
To start from the very beginning, when a new table is created by running CREATE TABLE statement in Beeline, which will submit query to HiveServer2 for processing, and before HiveServer2 is able to run the query, it will check the policy cache file (denoted 1 in the chart) and make sure the user who submits the query has the appropriate permission to perform the task. Once the authorization passes, query is submitted and table created.
Upon successful creation of the new table, two things will be triggered by Ranger’s Hive Plugin:
- Sends audit event to Solr and/or HDFS (depends on configuration – denoted by 2 in the chart)
- Sends Kakfa event to topic “ATLAS_HOOK” (denoted by 3 in the chart), to record that a new entity has been created, so effectively Ranger’s Hive Plugin is the producer for “ATLAS_HOOK” topic in Kafka
Please note that for both above to work, appropriate policies in Ranger should be setup to allow “hive” user to write to topic in ATLAS_HOOK in Kafka, create Index in Solr and write data to HDFS (assuming that Hive’s Impersonation is turned OFF). Otherwise those operations will fail. If it happens, check HiveServer2’s server log (under /var/log/hive by default), where you can find corresponding errors reported.
Once the event reaches to “ATLAS_HOOK” in Kafka, Atlas, as the consumer for this topic, will pick up the event and update it’s database to include this new entity/table (denoted by 4 in the chart), so that Atlas admin users can see this new entity in the web UI, view Lineage info, attache Tag/Classifications etc.
Once the Audit event reaches to Solr and indexed properly, and admin user goes to Audits page in Ranger web UI, he/she will be able to view this entry on the Audits page (denoted by 5 in the chart).
Please note that data in HDFS is for backup only, it won’t be used by any services, also, the audit data in Solr by default will expire after 90 days.
Up to this point, the series of events happen after new Hive table creation have completed. However, other things can still happen and data needs to reach HiveServer2’s policy cache for things to work correctly. Let’s check again what they are.
As I have explained what Tag Policies are and how they work in my Introduction to Apache Ranger – Part IV – Resource vs Tag Based Policies post, Tags (or called Classifications in Atlas) are managed by Atlas. Atlas administrator can create/modify/delete Tags, and when that happen, all entities associated with Tags/Classifications will be sent as events to another Kafka topic called “ATLAS_ENTITIES” (denoted by 6 in the chart), which will be picked up by Ranger TagSync service (denoted by 7 in the chart), and in term Ranger’s backend DB will be updated (denoted by 8 in the chart). Again, this will require appropriate Ranger Policy to be setup to allow “atlas” user to be able to write to topic “ATLAS_ENTITIES” in Kafka.
Ranger also has another service called UserSync, where it can be configured to setup to sync user/group information from LDAP and store them into Ranger’s database.
Once Tag information is updated in Ranger, users and groups are synced correctly, together with all other Resource Based Policies updated, Hive Plugin from HiveServer2 side will pull updated policies information down to its local cache, which will happen every 30 seconds by default (denoted by 9 in the chart), so that new requests will be checked against those policies. This would mean that if Ranger is down for whatever reason, authorization in client side like Hive will continue as normal. As soon as Ranger is backed up and running again, Hive Plugin will resume pulling of policy caches and continue its operations.
Finally the cycle goes back to the beginning again as Hive users create or update databases, tables or columns in Hive.
This concludes this introduction to Hive Plugin service from Ranger. Any questions or comments, please update below and let me know.
Thanks for reading!!