Caching (part 1)

(For Static Pages)

Situation - you have created a site that serves pages. Congratulations. It is serving a lot of pages. My word. Your server is melting. Oh well, at least you don’t need to serve a large number of pages any more - everyone’s gone.

Solution - caching. But how? Where? Lengthways?

Hopefully this will help.

Level 1 - RAM

Memcache is ubiquitous - everyone in web tech knows that you just install it, turn it on, put things in it and sit back to watch the dollars roll in. For a lot of applications, yes, this will suit you. Most default installations will give you a 64MB cache size, enough for quite a few web pages. By virtue of the caching mechanism, the most popular pages will prevail in the cache for longest, so if you have a few pages that get 90% of your traffic, you are done.

However, this is not fit for all purposes. What happens if a page in your cache has something terrible in it? How do you expire it immediately? More problematically, let’s say you cache all HTML, and your layout templates for a single page type have changed, so you need to regenerate all pages in one category.

Flushing. This clears the cache, bringing your site to its knees temporarily. A bit of overkill, to be honest.

Precision. Find and delete the cache keys that belong to the pages that are no longer valid. This is good, but there may be thousands of them, and the keys might just be hashes of the URL, or something equally unfathomable. This brings us to…

Level 1.5 - cache key strategy

URL. That’s fine, but memcache has a 256 char limit - are you going to breach that? What about arbitrary query strings?

URL hash. Unintelligible, but fixed length and (depending on the algorithm) quick to compute in code. But, it does limit you if you want to use the built-in nginx caching layer, which I have no experience with but think it needs just the raw URI.

Semantic information. SEO is useful here, because your URL is probably broken down into category and article name, maybe with a trailing ID for the machine. You could use category-id as the cache key, which is probably going to be sub-256 char, and we’re some way to solving the category expiration problem, becausse the keys have human-readable information. But you can’t do it without a supporting script, either a hand-rolled or off-the-shelf application to send memcached a lot of delete calls. So :

Versioning. If you’re limited to one daemon, then use your app to control cache hits by incrementing a version number. Summarising this answer, cache your pages with a template-specific version number, like

2:news:123

Then change your app to increment the version number for the news category. Memcache will look for 3:news:123, rebuild it and put in the cache. The old version will fall out after expiry or when the slabs are full.

Update - this page has given a step-by step of how a deeper version of this method can work (where the modified timestamp of the model classes form part of the cache key).

Level 1.7 - API responses

If your page content has a dependancy on a third party API, then you might want to apply the above practice not only to your own pages, but also the raw data returned from these APIs in order to reduce the reliance, and to allow multiple pages to use the same API request (latest posts, for example) without hammering the third party.

Level 1.9 - 503 responses

If there’s nothing in memcache, then you have to do the hard graft of generating data. This may take a long time. You may get multiple clients requesting the same data while the processes are running, which will back up until one of them completes and populates the cache. A solution is to serve a 503 response immediately (with a meta-refresh), and put the request on a queue for background processing. This will help your search engine rankings too, as 503s are taken into account by crawlers.

Next - if things are expiring from memcache, let’s have a filecache.