I have seen lots of Hadoop users are not using Big Data technology correctly. Hadoop is designed for Big Data, so it works well with large file sizes and this is why we have block size for HDFS as 256MB or 512MB depending on use cases. However, lots of users, even from big corporate companies are not utilizing such technology by having lots of small files and partitions for a given Hive table. Some particular users have millions of partitions and hundreds of millions of files stored in HDFS, each file is in KB of size. This puts lots of pressure on all components in the Hadoop echo system, including HDFS, HiveServer2, Impala, ZooKeeper just to name a few.
In this particular post, I am going to discuss one of the side effect of such usage pattern that it will crash ZooKeeper with OutOfMemory error, combined with large string of Hive query being run.
The following was what happened:
1. User runs a Hive query with very long string (contains 100KB of characters)
2. This Hive query runs against a table with millions of partitions
3. The query will scan through about 20,000 partition
4. Since hive.support.concurrency=true, when the query is running, Hive will try to create one ZNode per partition in ZooKeeper to indicate that those partitions are locked
5. Hive will also store the full Hive query string against each ZNode for debugging purpose, so that when issue happened, user can check ZNode and see which query locks the partition
6. So we have 20,000 partitions * 100K each, we will end up creating 2GB of data in ZooKeeper, just for this query alone
7. If we have multiple similar queries, ZooKeeper can reach to memory limit easily in no time
To overcome this problem, Hive introduced a new feature to control the number of characters to be stored against each ZNode in such scenario, via upstream JIRA HIVE-16334
. This JIRA has been backported into CDH since 5.12.0.
However, the default size is 1MB (1,000,000 bytes), which is still big and above case will still happen. To work around this issue, we can simply reduce the number of Hive query being stored, say to 10K. (Storing of the query string is purely for debugging purpose, so in theory we can reduce to a very small size, but probably not a good idea if you want to troubleshoot other issues, so 10K should be a good starting point).
To do so, please follow below steps (assuming that you are using Cloudera Manager):
1. Go go CM > Hive > Configuration > HiveServer2 Advanced Configuration Snippet (Safety Valve) for hive-site.xml
2. Enter below into textarea (view as XML):
The maximum length of the query string to store in the lock. Set it to 10K.
3. Save and restart HiveServer2
After that, we should have less chance of hitting ZooKeeper OutOfMemory in the above scenario. However, the root cause was due to too many partitions, so the first priority is to reduce as much as possible so that each query will not scan more than 1000 partitions to get good performance.
For users using CDH older than CDH 5.12.0, suggestion is to upgrade.
Hope above helps.