We all know that the leap seconds caused major outrage to lots of big sites like Reddit, Gawker, LinkedIn, Foursquare and Yelp reported early July. I didn’t understand why it happened and never thought of it would affect the servers that host our Hadoop cluster. When we got in the office on Monday morning, we had noticed that most of processing tasks had been running very slowly and we had got lots of Hadoop namenode overloaded warning email from OpsView.
Initially we thought that it was caused by our monthly processing, which is quite resource intensive as it needs to process TB of data. However, the same problem kept happening even after the monthly processing had finished and warnings is continuing on, and the Hive server was just up and down.
On Tuesday night, I had to work with our system administrator up until 12AM in the mid-night trying to restart all Hadoop datanode clusters, but still no hope.
Eventually we concluded that it was the Linux Kernel bug caused by the leap second. The solution was to apply the bug fix to the kernel and reset the server’s clock.
It wasn’t an obvious fix, but we all learned from it. Next time would be easier to identify, but when?