Currently Impala supports Parquet file format pretty well. For those of you not familiar with Parquet, Parquet is a free and open-source column-oriented data store of the Apache Hadoop ecosystem, and it provides easy and quick access to data with large amount of columns. For more details, you can refer to Apache Parquet official website for details.
However, even though Parquet is good for storing data with large amount of columns and can retrieve column data pretty quickly, there is still a limit as to how many columns you can store, in order for processing engine to work properly, like Impala.
Recently I have discovered a bug in Impala that when you have too many columns (more than 10,000), Impala query will fail with below error:
hdfsOpenFile(hdfs://nameservice1/user/hive/warehouse/default.db/table/_impala_insert_staging/f44e0332a3ec1af9_55c692eb00000000/.dh4e0332q3ac1af9-55c692wb00000003_1471427586_dir/dh4e0332q3ac1af9-55c692wb00000003_1471427586_data.0.parq): FileSystem#create((Lorg/apache/hadoop/fs/Path;ZISJ)Lorg/apache/hadoop/fs/FSDataOutputStream;) error: RemoteException: Specified block size is less than configured minimum value (dfs.namenode.fs-limits.min-block-size): -1130396776 < 1048576 at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFileInt(FSNamesystem.java:2705) at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.startFile(FSNamesystem.java:2658)
The reason being that when parquet table has 10K+ columns, Impala tries to estimate memory required to process those data, and it will overflow Java’s int32 variable used in Impala code and caused negative value returned, hence caused the error we saw above. This has been reported in the upstream JIRA: IMPALA-7044.
There is no workaround to fix the issue at this stage, but only to reduce the number of columns in Parquet table.
Currently the maximum number of columns Impala can handle is around 8K-10K, depending on the column types, so have to re-design the table to fit with less columns.
Hope above information is helpful.