Impala query failed with error “Specified block size is less than configured minimum value”

Currently Impala supports Parquet file format pretty well. For those of you not familiar with Parquet, Parquet is a free and open-source column-oriented data store of the Apache Hadoop ecosystem, and it provides easy and quick access to data with large amount of columns. For more details, you can refer to Apache Parquet official website for details.

However, even though Parquet is good for storing data with large amount of columns and can retrieve column data pretty quickly, there is still a limit as to how many columns you can store, in order for processing engine to work properly, like Impala.

Recently I have discovered a bug in Impala that when you have too many columns (more than 10,000), Impala query will fail with below error:

hdfsOpenFile(hdfs://nameservice1/user/hive/warehouse/default.db/table/_impala_insert_staging/f44e0332a3ec1af9_55c692eb00000000/.dh4e0332q3ac1af9-55c692wb00000003_1471427586_dir/dh4e0332q3ac1af9-55c692wb00000003_1471427586_data.0.parq): 
FileSystem#create((Lorg/apache/hadoop/fs/Path;ZISJ)Lorg/apache/hadoop/fs/FSDataOutputStream;) error:
RemoteException: Specified block size is less than configured minimum value (dfs.namenode.fs-limits.min-block-size): -1130396776 < 1048576
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2705)
at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2658)

The reason being that when parquet table has 10K+ columns, Impala tries to estimate memory required to process those data, and it will overflow Java’s int32 variable used in Impala code and caused negative value returned, hence caused the error we saw above. This has been reported in the upstream JIRA: IMPALA-7044.

There is no workaround to fix the issue at this stage, but only to reduce the number of columns in Parquet table.

Currently the maximum number of columns Impala can handle is around 8K-10K, depending on the column types, so have to re-design the table to fit with less columns.

Hope above information is helpful.

Unable to Insert data into VARCHAR data type in Impala

In this blog post, I will talk about an issue that Impala user is not able to directly insert into a table that has VARCHAR column type. To re-produce, see below test case:

CREATE TABLE test (a varchar(20));
INSERT INTO test SELECT 'a';
ERROR: AnalysisException: Possible loss of precision for target table 'default.test'.
Expression ''a'' (type: STRING) would need to be cast to VARCHAR(20) for column 'a'

I would like to advise that this is working as expected currently, as Impala was designed to force data integrity by stopping user from converting STRING data type into VARCHAR behind the scene and user is unaware of data loss in this scenario.

This has been reported in the upstream JIRA: IMPALA-1333. And the plan is to allow such implicit casting between STRING and VARCHAR, so that any thirdparty software can handle this case easier. This probably will be implement using a flag to turn ON and OFF at session level when needed, rather than just allow implicit conversion blindly, as it will have the risk of losing data or cause data corruption unexpected.

For now, if you know that you have data in STRING that will fit into VARCHAR’s length, then, as mentioned in the JIRA, you can run below query to overcome the issue:

INSERT INTO test SELECT CAST('a' AS varchar(20));

Impala query failed with error “Failed to parse view-definition statement” after upgrade

If you have just upgraded CDH or Impala, from CDH5.11 and before to CDH5.12 and after, or Impala 2.8 and before to Impala 2.9 and after, and you noticed that your Impala query against certain VIEWs failed with below error:

AnalysisException: Failed to load metadata for table: 'db.view_name' CAUSED BY: 
TableLoadingException: Failed to parse view-definition statement of view: db.view_name

Then please read on.

Firstly, please confirm if your VIEW contains a keyword like “SORT”, by running “SHOW CREATE TABLE view_name”. If the output looks something like below:

CREATE VIEW default.test_view AS SELECT a, sort FROM default.test

Notice the word “sort” in the query, then you are hitting the issue I am talking about in this post.

This is because in CDH5.12.x or Impala 2.9 version, the JIRA IMPALA-4166 was introduced. This JIRA adds support for “SORT BY” operation in Impala, which makes “SORT” a keyword in Impala from this version onwards. And because SORT is now a keyword, any reference to it in other part of the query will cause the query to be invalid, and cause the query to fail.

You can see below stacktrace from Impala Coordinator log:

CAUSED BY: TableLoadingException: Failed to parse view-definition statement of view: db.view_name
@ 0x83efb9 impala::Status::Status()
@ 0xb747c2 impala::JniUtil::GetJniExceptionMsg()
@ 0xa7dfcb impala::Frontend::GetExecRequest()
@ 0xa98fc8 impala::ImpalaServer::ExecuteInternal()
@ 0xaa0278 impala::ImpalaServer::Execute()
@ 0xadf28e impala::ImpalaServer::query()
@ 0xd71cd5 beeswax::BeeswaxServiceProcessor::process_query()
@ 0xd74fd4 beeswax::BeeswaxServiceProcessor::dispatchCall()
@ 0x80ecdc apache::thrift::TDispatchProcessor::process()
@ 0x1b596ab apache::thrift::server::TThreadPoolServer::Task::run()
@ 0x1b40fc9 apache::thrift::concurrency::ThreadManager::Worker::run()
@ 0x9d5c69 impala::ThriftThread::RunRunnable()
@ 0x9d6a42 boost::detail::function::void_function_obj_invoker0<>::invoke()
@ 0xbd6fe2 impala::Thread::SuperviseThread()
@ 0xbd7744 boost::detail::thread_data<>::run()
@ 0xe6417a (unknown)

To fix the issue, there are two ways:

1. Add “`” to the SORT keyword, by DROP and re-CREATE the view, like below:

DROP VIEW default.test_view;
Query: DROP VIEW default.test_view
Fetched 0 row(s) in 4.35s

CREATE VIEW default.test_view AS SELECT a, `sort` FROM default.test;
Query: CREATE VIEW default.test_view AS SELECT a, `sort` FROM default.test
Fetched 0 row(s) in 3.87s

SELECT * FROM default.test_view;
Query: SELECT * FROM default.test_view
Fetched 0 row(s) in 3.63s

2. DROP and re-CREATE the VIEW in beeline, without modifying the VIEW creation query, if your view is complex.

This will work because Hive will add “`” for you automatically, which Impala does not.

This should help to resolve your cluster upgrade issue.

If this post helps, please share your experience below in the comments, thanks.

Enable Debug Logging for YARN Application in Hive

When troubleshooting Hadoop related issues, we constantly need to enable DEBUG level logging so that we can see more what Hadoop is doing when running jobs. This blog post in particular shows how you can enable DEBUG logging for YARN application jobs when you run them through Hive.

To do so, it is simple. Just add below three lines before you run your query in Hive session:

SET mapreduce.map.log.level=DEBUG;
SET mapreduce.reduce.log.level=DEBUG;
SET yarn.app.mapreduce.am.log.level=DEBUG;

This will turn on Mapper, Reducer and Application Master log to DEBUG level. If you just need one of them, then simply remove the other ones.

Then you can just run your query as normal:

SELECT COUNT(*) FROM tablename;

Before enabling the DEBUG in Mapper log:

2018-05-22 21:06:30,267 WARN [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Metrics system not started: org.apache.commons.configuration.ConfigurationException: Unable to load the configuration from the URL file:/run/cloudera-scm-agent/process/1370-yarn-NODEMANAGER/hadoop-metrics2.properties
2018-05-22 21:06:30,338 INFO [main] org.apache.hadoop.mapred.YarnChild: Executing with tokens:
2018-05-22 21:06:30,339 INFO [main] org.apache.hadoop.mapred.YarnChild: Kind: mapreduce.job, Service: job_1526873868324_0001, Ident: (org.apache.hadoop.mapreduce.security.token.JobTokenIdentifier@54b83d10)
2018-05-22 21:06:30,686 INFO [main] org.apache.hadoop.mapred.YarnChild: Kind: HDFS_DELEGATION_TOKEN, Service: 10.17.100.142:8020, Ident: (token for hive: HDFS_DELEGATION_TOKEN owner=hive/host-10-17-100-142.coe.cloudera.com@CDH511.COM, renewer=yarn, realUser=, issueDate=1527048353000, maxDate=1527653153000, sequenceNumber=787, masterKeyId=173)
2018-05-22 21:06:31,556 INFO [main] org.apache.hadoop.mapred.YarnChild: Sleeping for 0ms before retrying again. Got null now.
2018-05-22 21:06:32,194 INFO [main] org.apache.hadoop.mapred.YarnChild: mapreduce.cluster.local.dir for child: /yarn/nm2/usercache/hive/appcache/application_1526873868324_0001
2018-05-22 21:06:33,738 INFO [main] org.apache.hadoop.conf.Configuration.deprecation: session.id is deprecated. Instead, use dfs.metrics.session-id
2018-05-22 21:06:34,271 INFO [main] org.apache.hadoop.mapred.Task:  Using ResourceCalculatorProcessTree : [ ]

After DEBUG is enabled:

2018-05-22 21:08:00,607 DEBUG [main] org.apache.hadoop.mapred.YarnChild: Child starting
2018-05-22 21:08:01,936 DEBUG [main] org.apache.hadoop.metrics2.lib.MutableMetricsFactory: field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginSuccess with annotation @org.apache.hadoop.metrics2.annotation.Metric(valueName=Time, about=, value=[Rate of successful kerberos logins and latency (milliseconds)], always=false, type=DEFAULT, sampleName=Ops)
2018-05-22 21:08:01,968 DEBUG [main] org.apache.hadoop.metrics2.lib.MutableMetricsFactory: field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.loginFailure with annotation @org.apache.hadoop.metrics2.annotation.Metric(valueName=Time, about=, value=[Rate of failed kerberos logins and latency (milliseconds)], always=false, type=DEFAULT, sampleName=Ops)
2018-05-22 21:08:01,968 DEBUG [main] org.apache.hadoop.metrics2.lib.MutableMetricsFactory: field org.apache.hadoop.metrics2.lib.MutableRate org.apache.hadoop.security.UserGroupInformation$UgiMetrics.getGroups with annotation @org.apache.hadoop.metrics2.annotation.Metric(valueName=Time, about=, value=[GetGroups], always=false, type=DEFAULT, sampleName=Ops)
2018-05-22 21:08:01,969 DEBUG [main] org.apache.hadoop.metrics2.lib.MutableMetricsFactory: field private org.apache.hadoop.metrics2.lib.MutableGaugeLong org.apache.hadoop.security.UserGroupInformation$UgiMetrics.renewalFailuresTotal with annotation @org.apache.hadoop.metrics2.annotation.Metric(valueName=Time, about=, value=[Renewal failures since startup], always=false, type=DEFAULT, sampleName=Ops)
2018-05-22 21:08:01,969 DEBUG [main] org.apache.hadoop.metrics2.lib.MutableMetricsFactory: field private org.apache.hadoop.metrics2.lib.MutableGaugeInt org.apache.hadoop.security.UserGroupInformation$UgiMetrics.renewalFailures with annotation @org.apache.hadoop.metrics2.annotation.Metric(valueName=Time, about=, value=[Renewal failures since last successful login], always=false, type=DEFAULT, sampleName=Ops)
2018-05-22 21:08:01,970 DEBUG [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: UgiMetrics, User and group related metrics
2018-05-22 21:08:02,125 DEBUG [main] org.apache.hadoop.security.SecurityUtil: Setting hadoop.security.token.service.use_ip to true
2018-05-22 21:08:02,169 DEBUG [main] org.apache.hadoop.security.Groups:  Creating new Groups object
2018-05-22 21:08:02,211 DEBUG [main] org.apache.hadoop.util.Shell: Failed to detect a valid hadoop home directory
java.io.IOException: HADOOP_HOME or hadoop.home.dir are not set.
	at org.apache.hadoop.util.Shell.checkHadoopHome(Shell.java:351)
	at org.apache.hadoop.util.Shell.<clinit>(Shell.java:376)
	at org.apache.hadoop.util.StringUtils.<clinit>(StringUtils.java:79)
	at org.apache.hadoop.security.Groups.parseStaticMapping(Groups.java:168)
	at org.apache.hadoop.security.Groups.<init>(Groups.java:132)
	at org.apache.hadoop.security.Groups.<init>(Groups.java:100)
	at org.apache.hadoop.security.Groups.getUserToGroupsMappingService(Groups.java:435)
	at org.apache.hadoop.security.UserGroupInformation.initialize(UserGroupInformation.java:337)
	at org.apache.hadoop.security.UserGroupInformation.setConfiguration(UserGroupInformation.java:374)
	at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:78)
2018-05-22 21:08:02,425 DEBUG [main] org.apache.hadoop.util.Shell: setsid exited with exit code 0
2018-05-22 21:08:02,521 DEBUG [main] org.apache.hadoop.security.Groups: Group mapping impl=org.apache.hadoop.security.ShellBasedUnixGroupsMapping; cacheTimeout=300000; warningDeltaMs=5000
2018-05-22 21:08:02,548 DEBUG [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: from system property: null
2018-05-22 21:08:02,549 DEBUG [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: from environment variable: null
2018-05-22 21:08:02,614 DEBUG [main] org.apache.commons.configuration.ConfigurationUtils: ConfigurationUtils.locate(): base is null, name is hadoop-metrics2-maptask.properties
2018-05-22 21:08:02,616 DEBUG [main] org.apache.commons.configuration.ConfigurationUtils: ConfigurationUtils.locate(): base is null, name is hadoop-metrics2.properties
2018-05-22 21:08:02,617 DEBUG [main] org.apache.commons.configuration.ConfigurationUtils: Loading configuration from the context classpath (hadoop-metrics2.properties)
2018-05-22 21:08:02,620 WARN [main] org.apache.hadoop.metrics2.impl.MetricsSystemImpl: Metrics system not started: org.apache.commons.configuration.ConfigurationException: Unable to load the configuration from the URL file:/run/cloudera-scm-agent/process/1370-yarn-NODEMANAGER/hadoop-metrics2.properties

Hope above helps!

Patch for SENTRY-2240 – DROP UDF Permission Issue in Sentry

Last week, I have discovered an issue in Sentry that it does not check permissions properly when a user is trying to DROP a function. To re-produce this is easy, simply create a function under a database using admin account, and make sure that one particular user does not have ANY permissions on the database that the UDF was created under. Then, try to use that user to DROP the function.

I immediately checked if there is any upstream JIRA reported, but I was not able to find any, hence I filed a new JIRA, please see SENTRY-2240.

I have done some patches before for Sqoop and Hive, however, I have not done any for Sentry yet, so I think this JIRA is a good one to start. I went ahead to check out Sentry code from github, examined through the code to see what was wrong and could see that for CREATE and DROP FUNCTION calls, Sentry does not care what database the user was under:

HiveAuthzBindingHook.java#L226
HiveAuthzBindingHook.java#L230

      case HiveParser.TOK_CREATEFUNCTION:
        ........

        // create/drop function is allowed with any database
        currDB = Database.ALL;
        break;
      case HiveParser.TOK_DROPFUNCTION:
        // create/drop function is allowed with any database
        currDB = Database.ALL;
        break;

I have spent last weekend thinking about and applying fixes and also updating test cases that are affected. I have forked into my repository and created a branch to track my changes until final version. Please refer to https://github.com/ericlin05/sentry/tree/SENTRY-2240.

If you have any comments on my patch or want to discuss it, please add your comments below.