We got to experience our own version of Thundering herd problem.
Well, pseudo-thundering herd problem. May be.
So, what is it?
“This can happen under Unix when you have a number of processes that are waiting on a single event. When that event (a connection to the web server, say) happens, every process which could possibly handle the event is awakened. In the end, only one of those processes will actually be able to do the work, but, in the meantime, all the others wake up and contend for CPU time before being put back to sleep. Thus the system thrashes briefly while a herd of processes thunders through. If this starts to happen many times per second, the performance impact can be significant.” Source Link.
It didn’t happen this way but we experienced something similar. Our already loaded database servers used to experience a spike every 8 hours. The spike was heightened during morning hours.
At first, we thought it was just a surge of traffic but it became a clear visible pattern after few days.
After debugging, we realized the culprit was our cache expiration and refresh strategy. We cache heavy weight Product objects. Product encapsulates every possible definition, it’s policies, price and availability etc.
We had set the expiration policy to be 8 hours i.e. a business day. The idea was that we would cache these heavy weight Product objects in cache and refresh it the next day.
But, why the spike?
Well, the entire cache used to be flushed after every 8 hours. And, there were multiple http requests which used to find the cache empty and then used to access the database to load the top Products from the database. Here is the culprit code –
Now, multiple requests used to hit line number #8 at the same time causing the temporary database spike.
We fixed that by putting the code in a lock statement so that only one thread could get it in. The new code which solved the issue of database spike looks like this –
This code also has problems. Some http requests end up taking longer than usual but it has stopped the database peaks.
Isn’t it interesting the optimal caching strategies end up sometimes being the culprit in heavy load conditions, if not implemented properly?