SparkHistory Server Keeps Crashing With OutOfMemory Error

This article explains what to do when you are unable to start up SparkHistory server which keeps crashing with OutOfMemory errors after using Spark for some time.

To confirm that Spark History Server keeps failing with OutOfMemory error on start up, we can check the run time process directory for Spark under /var/run/cloudera-scm-agent/process directory if you using using CDH version of Spark. The stdout.log file contains the following error:

#
# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="/usr/lib64/cmf/service/common/killparent.sh"
# Executing /bin/sh -c "/usr/lib64/cmf/service/common/killparent.sh"...

One of the possible reason for the failure was due to some large job history files under /user/spark/sparkApplicationHistory (or /user/spark/spark2ApplicationHistory for Spark2). On SparkHistory server startup, it will try to load those files into memory, and if those history files are too big, it will cause history server to crash with OutOfMemory error unless heap size is increased through Cloudera Manager interface.

To confirm this, just run:

hdfs dfs -ls /user/spark/sparkApplicationHistory

or

hdfs dfs -ls /user/spark/spark2ApplicationHistory

and see if there are any files that are in hundreds of MBs in size, if yes, then that will be a problem.

You might also notice that some files might have “.inprogress” extension, like below:

/user/spark/spark2ApplicationHistory/application_1503451614878_0337.inprogress

Those files can become stale in the HDFS directory in the case that Spark job failed prematurely, and Spark did not get a chance to clean up those files. Since SparkHistory server has no way of knowing if those files were left over from a failed Spark job, or if they were still being processed, hence it will just blindly load everything under the HDFS directory into memory, which will cause failure if those files are too big.

Spark has a clean up job to remove any old files that are longer than a pre-defined time period, however, it does not remove stale .inprogress files. This issue was reported in SPARK-8617.

Once SPARK-8617 is fixed, we should not see those stale .inprogress files anymore. But at the time of writing, it has not been backported into CDH yet.

For now, we just need to delete all the files under /user/spark/sparkApplicationHistory or /user/spark/spark2ApplicationHistory directory in HDFS that are older than, say one week, so that those big files can be cleaned up.

After that, we should be able to get SparkHistory server startup without issues.

Show Create Table Output Truncated for VIEW in Hive

Yesterday i was working on a very weird issue in Hive that the SHOW CREATE TABLE for a VIEW returned partial output. This happened in Hue, Beeline as well as Hive CLI. Please see below test case:

1. From Hue interface:

2. From Beeline:

0: jdbc:hive2://localhost:10000/default> SHOW CREATE TABLE test_view;
+------------------------------------+--+
|           createtab_stmt           |
+------------------------------------+--+
| CREATE VIEW `test_view` AS SELECT  |
|                                    |
| FROM `default`.`sample_07`         |
+------------------------------------+--+

3. From Hive CLI:

hive> SHOW CREATE TABLE test_view;
OK
CREATE VIEW `test_view` AS SELECT

FROM `default`.`sample_07`

After some researching and testing in varies CDH versions, I found out that the issue was caused by having “\t” character in the VIEW’s create statement, and it only happens in CDH version before 5.13.1. Any characters after “\t” will be removed in the output.

What happens in CDH 5.13.1 version is that Hive will remove white spaces in the query string before saving to database, so this will not happen, however, simply upgrade CDH will not help to resolve the issue, as the code that reads from Hive MetaStore database does not change.

To fix the issue, we need to update the database stored in the Hive MetaStore backend, please follow below steps (this is for MySQL, for other database types, please consult with vendor):

1. STOP Hive services, so that no once an use the it to update Hive Metadata by running any queries
2 Dump the whole MySQL database out into text a file using “mysqldump” command line tool (please make sure all data is dumped correctly so that we can load data back)
3. Use text editor to replace “tab” with spaces for all the data stored in “TBLS” table
4. Since we can not RENAME databases, just drop it and re-create it
5. Reload the modified dump file back into newly created database
6. Restart Hive and confirm if issue is fixed.

It is better to test this in the DEV/TEST cluster before applying the same change in PROD to minimize mistakes.

Hope above helps.

HiveServer2 Failed to Start With “Table/View ‘DBS’ does not exist”

Recently I was dealing with an issue that after user upgraded CDH to 5.12.x, HiveServer2 was unable to start up. Examining the HiveServer2 log we noticed that for some reason HiveServer2 was try to access HiveMetaStore database and returned table does not exist error. See below full stacktrace:

2017-12-01 20:17:15,419 WARN  org.apache.hadoop.hive.metastore.MetaStoreDirectSql: [Thread-13]: Self-test query [select "DB_ID" from "DBS"] failed; direct SQL is disabled
javax.jdo.JDODataStoreException: Error executing SQL query "select "DB_ID" from "DBS"".
        at org.datanucleus.api.jdo.NucleusJDOHelper.getJDOExceptionForNucleusException(NucleusJDOHelper.java:451)
        at org.datanucleus.api.jdo.JDOQuery.execute(JDOQuery.java:230)
        at org.apache.hadoop.hive.metastore.MetaStoreDirectSql.runTestQuery(MetaStoreDirectSql.java:226)
        at org.apache.hadoop.hive.metastore.MetaStoreDirectSql.<init>(MetaStoreDirectSql.java:134)
        at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:347)
        at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:298)
        at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
        at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
        at org.apache.hadoop.hive.metastore.RawStoreProxy.<init>(RawStoreProxy.java:60)
        at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:69)
        at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:682)
        at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:660)
        at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:713)
        at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:508)
        at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.<init>(HiveMetaStore.java:461)
        at org.apache.hive.service.auth.HiveAuthFactory.<init>(HiveAuthFactory.java:118)
        at org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:58)
        at java.lang.Thread.run(Thread.java:745)
NestedThrowablesStackTrace:
java.sql.SQLSyntaxErrorException: Table/View 'DBS' does not exist.
        at org.apache.derby.impl.jdbc.SQLExceptionFactory.getSQLException(Unknown Source)
        at org.apache.derby.impl.jdbc.Util.generateCsSQLException(Unknown Source)
        at org.apache.derby.impl.jdbc.TransactionResourceImpl.wrapInSQLException(Unknown Source)
        at org.apache.derby.impl.jdbc.TransactionResourceImpl.handleException(Unknown Source)
        at org.apache.derby.impl.jdbc.EmbedConnection.handleException(Unknown Source)
        at org.apache.derby.impl.jdbc.ConnectionChild.handleException(Unknown Source)
        at org.apache.derby.impl.jdbc.EmbedPreparedStatement.<init>(Unknown Source)
        at org.apache.derby.impl.jdbc.EmbedPreparedStatement42.<init>(Unknown Source)
        at org.apache.derby.jdbc.Driver42.newEmbedPreparedStatement(Unknown Source)
        at org.apache.derby.impl.jdbc.EmbedConnection.prepareStatement(Unknown Source)
        at org.apache.derby.impl.jdbc.EmbedConnection.prepareStatement(Unknown Source)
        at com.jolbox.bonecp.ConnectionHandle.prepareStatement(ConnectionHandle.java:1193)
        at org.datanucleus.store.rdbms.SQLController.getStatementForQuery(SQLController.java:350)
        at org.datanucleus.store.rdbms.query.RDBMSQueryUtils.getPreparedStatementForQuery(RDBMSQueryUtils.java:194)
        at org.datanucleus.store.rdbms.query.SQLQuery.performExecute(SQLQuery.java:267)
        at org.datanucleus.store.query.Query.executeQuery(Query.java:1786)
        at org.datanucleus.store.query.AbstractSQLQuery.executeWithArray(AbstractSQLQuery.java:339)
        at org.datanucleus.store.query.Query.execute(Query.java:1654)
        at org.datanucleus.api.jdo.JDOQuery.execute(JDOQuery.java:221)
        at org.apache.hadoop.hive.metastore.MetaStoreDirectSql.runTestQuery(MetaStoreDirectSql.java:226)
        at org.apache.hadoop.hive.metastore.MetaStoreDirectSql.<init>(MetaStoreDirectSql.java:134)
        at org.apache.hadoop.hive.metastore.ObjectStore.initialize(ObjectStore.java:347)
        at org.apache.hadoop.hive.metastore.ObjectStore.setConf(ObjectStore.java:298)
        at org.apache.hadoop.util.ReflectionUtils.setConf(ReflectionUtils.java:73)
        at org.apache.hadoop.util.ReflectionUtils.newInstance(ReflectionUtils.java:133)
        at org.apache.hadoop.hive.metastore.RawStoreProxy.<init>(RawStoreProxy.java:60)
        at org.apache.hadoop.hive.metastore.RawStoreProxy.getProxy(RawStoreProxy.java:69)
        at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.newRawStore(HiveMetaStore.java:682)
        at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.getMS(HiveMetaStore.java:660)
        at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.createDefaultDB(HiveMetaStore.java:713)
        at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.init(HiveMetaStore.java:508)
        at org.apache.hadoop.hive.metastore.HiveMetaStore$HMSHandler.<init>(HiveMetaStore.java:461)
        at org.apache.hive.service.auth.HiveAuthFactory.<init>(HiveAuthFactory.java:118)
        at org.apache.hive.service.cli.thrift.ThriftBinaryCLIService.run(ThriftBinaryCLIService.java:58)
        at java.lang.Thread.run(Thread.java:745)
Caused by: ERROR 42X05: Table/View 'DBS' does not exist.
        at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
        at org.apache.derby.iapi.error.StandardException.newException(Unknown Source)
        at org.apache.derby.impl.sql.compile.FromBaseTable.bindTableDescriptor(Unknown Source)
        at org.apache.derby.impl.sql.compile.FromBaseTable.bindNonVTITables(Unknown Source)
        at org.apache.derby.impl.sql.compile.FromList.bindTables(Unknown Source)
        at org.apache.derby.impl.sql.compile.SelectNode.bindNonVTITables(Unknown Source)
        at org.apache.derby.impl.sql.compile.DMLStatementNode.bindTables(Unknown Source)
        at org.apache.derby.impl.sql.compile.DMLStatementNode.bind(Unknown Source)
        at org.apache.derby.impl.sql.compile.CursorNode.bindStatement(Unknown Source)
        at org.apache.derby.impl.sql.GenericStatement.prepMinion(Unknown Source)
        at org.apache.derby.impl.sql.GenericStatement.prepare(Unknown Source)
        at org.apache.derby.impl.sql.conn.GenericLanguageConnectionContext.prepareInternalStatement(Unknown Source)
        ... 29 more

If you look closely at the stacktrace, you will notice that HiveServer2 was trying to access Derby database for those tables and hence failed with error. This does not make sense because:

1. The backend database was configured to use MySQL, not Derby
2. The database access should happen at HiveMetaStore service, not HiveServer2

After researching, I found out that this was actually caused by the following:

– A known issue in Cloudera Manager, that when you chose DBTokenStore as the HiveMetaStore Delegation Token Store, Cloudera Manager will also HiveServer2 to use DBTokenStore
– HiveServer2 does not support DBTokenStore until CDH 5.13.0 (introduced by HIVE-12270)
– When DBTokenStore was used by HiveServer2, the behaviour for HiveServer2 is that it will use Derby as the backend for the DBTokenStore and somehow it will try to query the tables that supposed to be used by HiveMetaStore (I have not figured out why yet)

Since user had HiveMetaStore HA, roll back to MemoryTokenStore was not an option, so the solution is to use ZookeeperTokenStore instead, which is supported by both HiveMetaStore as well as HiveServer2.

So simply go to CM > Hive > Configuration > “Hive Metastore Delegation Token Store” and select “org.apache.hadoop.hive.thrift.ZooKeeperTokenStore“, see below screenshot:

Save and then restart Hive services.

Impala Failed to Read Parquet Decimal Data

If you are using Hive JDBC/ODBC driver to create Parquet data via Hive, and you face below error when reading from Impala:

File 'hdfs://namespace1/path/to/parquet/file/15110390563421' column 'column1' has a scale that does not match the table metadata scale. 
File metadata scale: 0 Table metadata scale: 2 when we try to execute in Impala.

It is most likely that you hit a known issue HIVE-15519.

Due to this bug, the DECIMAL scale value returned from the driver was incorrect (returns null), hence caused the data being written to parquet file with wrong scale value for those columns. And Impala will complain that the column’s definition at metadata side is not matching with the column type stored in Parquet file, due to different scale values.

Currently there is no workarounds available for this issue, and a patch is required. Or you will have to wait it to be fixed. At the time of writing, the latest Cloudera CDH is at 5.13.1.

Hive LPAD function hangs HS2 and its subsequent queries

Recently I have noticed an issue in Hive that running below query will hang HiveServer2 indefinitely until it got cancelled. And because HiveServer2’s global compilation lock, it prevents any further queries from submitted into Hive and all queries appear to be queued.

SELECT lpad("String",10,'');

After researching, I concluded that it was caused by a Hive’s upstream bug: HIVE-15792. This has been fixed in upstream Hive 2.3.0 and in CDH5.12.1 and CDH5.13.0 onwards.

Currently, the only workaround is to have a non-empty string passed into the third parameter of the function, so below query will work:

SELECT lpad("String",10,'v');

The issue is basically caused by a code logic that will run in a loop infinitely when an empty string is passed into the function. You can have a look at the Patch in the JIRA for details of the change.

Hope above information helps.