This article explains what to do when you are unable to start up SparkHistory server which keeps crashing with OutOfMemory errors after using Spark for some time.
To confirm that Spark History Server keeps failing with OutOfMemory error on start up, we can check the run time process directory for Spark under /var/run/cloudera-scm-agent/process directory if you using using CDH version of Spark. The stdout.log file contains the following error:
# # java.lang.OutOfMemoryError: Java heap space # -XX:OnOutOfMemoryError="/usr/lib64/cmf/service/common/killparent.sh" # Executing /bin/sh -c "/usr/lib64/cmf/service/common/killparent.sh"...
One of the possible reason for the failure was due to some large job history files under /user/spark/sparkApplicationHistory (or /user/spark/spark2ApplicationHistory for Spark2). On SparkHistory server startup, it will try to load those files into memory, and if those history files are too big, it will cause history server to crash with OutOfMemory error unless heap size is increased through Cloudera Manager interface.
To confirm this, just run:
hdfs dfs -ls /user/spark/sparkApplicationHistory
hdfs dfs -ls /user/spark/spark2ApplicationHistory
and see if there are any files that are in hundreds of MBs in size, if yes, then that will be a problem.
You might also notice that some files might have “.inprogress” extension, like below:
Those files can become stale in the HDFS directory in the case that Spark job failed prematurely, and Spark did not get a chance to clean up those files. Since SparkHistory server has no way of knowing if those files were left over from a failed Spark job, or if they were still being processed, hence it will just blindly load everything under the HDFS directory into memory, which will cause failure if those files are too big.
Spark has a clean up job to remove any old files that are longer than a pre-defined time period, however, it does not remove stale .inprogress files. This issue was reported in SPARK-8617.
Once SPARK-8617 is fixed, we should not see those stale .inprogress files anymore. But at the time of writing, it has not been backported into CDH yet.
For now, we just need to delete all the files under /user/spark/sparkApplicationHistory or /user/spark/spark2ApplicationHistory directory in HDFS that are older than, say one week, so that those big files can be cleaned up.
After that, we should be able to get SparkHistory server startup without issues.