We have been using a combination of .Net WCF + NHibernate + NCache for a high availability application over the last couple of years. But it has been during the last six months of 2012 that its usage has picked up as more users have migrated over to our application.
Apart from using NCache for our application data caching, we also have NHibernate configured to use NCache as a Second Level cache to speed up database fetches.
We are basically using NCache as a distributed cache and have a cluster of NCache servers setup on production.
Below are a few gotcha’s related to NCache that we have encountered in the last six months while tuning and trouble-shooting our application, and how we overcame them. Although, these are specific to NCache, but they might as well equally apply to other distributed caching solutions.
Our solutions may not be the best for the moment, but we are still learning and improving.
We received plenty of help from the Alachisoft folks as well during this period, their support has been very professional and prompt. Thanks Alachisoft!!
1. Use a separate cache instance in NCache for NHibernate’s Second Level cache – We were seeing our NCache object count suddenly drop to zero once or twice a week when we were using the same cache instance for NHibernate and our application. The reason was that NHibernate was invalidating the entire cache for certain update operations, which is why one should use a different cache instance for NHibernate’s second level cache.
I believe another way to solve this is to use NHibernate’s Cache Region’s feature, but I haven’t had a chance to explore that yet.
2. Ensure that the NCache cluster and the application cluster are in the same VLAN – Earlier in our case both the clusters were in a separate VLAN. And we were frequently getting “NCache Server Not Available” errors on our application servers. The first step to address this was to bring both the clusters to the same VLAN.
3. Ensure there is no firewall between your application server and NCache – Even after bringing the clusters to the same VLAN, the “NCache server not available” errors persisted. It was then discovered that there was a firewall sitting in between the clusters, with pretty aggressive filtering rules, which was blocking NCache requests when they reached over a certain threshold count. So, this was fixed by toning down the firewall rules, as the security team could not remove the firewall between the clusters.
4. Managing NCache failover – While we were trying to understand the root cause of problems listed above, we had to manage NCache failover (or NCache not being available) on our application side.
For this with Alachisoft’s help we created a list of Fatal NCache exceptions based on which our application would stop querying NCache and instead go to the original source for data. Then after a while using a timer we would check if NCache is reachable and again start using it.
4. Managing NCache failover for NHibernate – Building NHibernate session factories is expensive, and we have to build 4 of them when our application starts. And if we want to use NCache as the second level cache, the session factory has to be built with this option enabled.
After an NCache failover, we cannot continue to use the session factory built with NCache support, it has to be re-built without second level caching support. Doing this at runtime meant a lot of failed user requests while the factories were being rebuilt, so we decided to build two versions of the NHibernate factories at application startup itself, one with second level caching support and one without. In total we built 8 session factories at application startup. Then when NCache went down, on the fly we switched to using the factory with second level caching support.
We could not find any examples of switching the NHibernate factories on the fly on the web, so we had to thoroughly test our application with this feature, and happily this has been running fine for the last three months on production for us.
5. Use Batch and Bulk operations for long running NCache operations – We were trying to fetch over 1 Gb of data with a single Get call from NCache and then we were trying to delete a similar amount of data using a Delete/Remove call. No wonder we were getting “Operation timed out” exceptions!
NCache provides GetBulk() and DeleteBulk() operations, so if you have say 10,000 items, run the bulk operations in batches of say 500 items, which would avoid any time-out issues, but YMMV so do check the performance.
6. Current issue which we are grappling with – Invalidating or deleting a large chunk of data in NCache, while ensuring there are no dirty reads while the data is being cleared.
When we start clearing the data we keep locks/flags in NCache which indicate that any Get operations should not query NCache for data until the flag has been removed. This works fine for certain cases, but becomes a performance hit for others e.g. multiple Gets. The NCache Tags feature is helpful here for storing the flags, we basically tag the flags so that they can be retrieved quickly. NCache automatically creates an index for the tags you create.
Also, we have certain scenarios where a clear call for a parent entity ends up clearing its children entities as well, but a concurrent Get call for children entities do not have a way of knowing that the parent entities are being cleared, so we end up getting incorrect data for the children entities Get calls. This one is particularly tough to handle at the moment, and we need to check if NCache has any existing features that would help us lock/invalidate a chunk of data, all at once.
So, these were a few things we learnt while using NCache, and which could very well apply to any other distributed caching solution as well.
Hope this helps someone while setting up NCache for their solution or someone facing similar issues. Please do send in any questions, clarifications and comments, and thanks for reading!