Harddrivers are cheap, CPU’s aren’t; cache your data, fool!

I have been developing a while now, both as a hobby and education, and I see many people fearing that there database could possibly be a few MB smaller. So, when you talk to them about caching some data, they fear it would cause there harddrivers would get filled up way to soon. And, on one side, they are right; if you cache all your output, you could verry well end up having more space eaten up by cache then by the data and code used to generate it. And I agree, for most people thats not a verry good solution. You don’t want to store every possible output infinitly. So, how about storing the output for whatever gets requested, and leave it there for an hour (or any other amount of time, how long exactly would depend on the situation) and after that delete it? Or possibly leave it there untill there has been no request to it for some time? This sounds like a good idea, and it is for the more basic things. (In example, it is something I could implement on this site, on some parts) However, as soon as you start getting situations where people log in, this would mean a different version for each user, which is plain stupid. I personally do not beleave that caching everything is a good option, except for specific websites. I do, however, have a few ideas that I feel could improve preformance for quite a few applications I’ve seen.

Pars caching

Parsing user input into something showable is something most webdevelopers will, at one point, have to deal with. This is most relevant in forum software, so I’ll use that for example. I have a few running locally, and I mainly see three different ways to deal with it:

  • Parse the data when its requested.
  • Parse the data when its submitted
  • Parse the data halfway when its submitted, finish it when requested.

All three have there pro’s and con’s, some obvious, others only come to mind when you start thinking about it. I’ll shortly eleborate on all three approaches.
Parsing the data on request is, in my opinion, the most natural thing to do. Its also easiest, and its what most people do. However, it is also the most expensive. Because every time somebody looks at it, you have to run trough all the data, match it for stuff that needs to be replaced, make sure its safe to output. This all takes a lot of time, which we don’t want.
Parsing the data on submittion, on the other hand, is probally the cheapest thing to do. On first glance, its just as hard as parsing on request, except you only need to do it once. However, this had one major drawback: what if you want to edit the data? I gave it a try, and trust me when I say, un-parsing isn’t near as easy as parsing. Sure, HTML bold to BB bold is easy, but links are harder, and when you have more complicated stuff (like the /me feature that mimics what IRC does) it stops being fun.
Parsing halfway can be done in a lot of ways, but will mostly combine the disadvantages of both other options.
May I suggest what I beleave to be a solution to all the problems the previous methods cause, at the cost of just some diskspace: parse caching. (Ok, so I suck at naming things..) Its verry simple, yet I don’t think I’ve ever seen somebody do it. What you do is just parse the data when its submitted, but also store the originally submitted data. This way, there is no need to parse it every time somebody wants to see it, yet you can simply edit it be editting the original data and overwriting the parsed data with the parsed version of the new data. Alterantly, you could also parse the data when its requested the first time, and then store it in cache. In theory, that would save you the space the cache takes up for data thats never requested, but I think that in the real world, people will oftenly request the data they just submitted just to make sure its correct.

I think this read was long enough for now (I know the write was) so thats it for now, I’ll give you an insight into my brain again soon. 🙂

About Jory

Born in 1988, Software Engineer, Dutch.
This entry was posted in Programming, The Internet. Bookmark the permalink.

2 Responses to Harddrivers are cheap, CPU’s aren’t; cache your data, fool!

  1. Whenever I’m working on an application that needs data parsed like what you described above, I always add one additional field to the database table to hold the unparsed data. And the only time this data is presented is to an editor. When the edit is complete, the unparsed data gets updated and the parsed data is overwritten with the new reflection of the unparsed.

    Does this take up more disk space? Yes. Who cares? Not me. When you start to talk about a high volume of visitors to the application, scalability is important. And this is the most scalable solution. You parse once when data is submitted, and never have to parse again, unless it’s edited. Simple logic, really.

    Don’t waste extra CPU cycles on unnecessary parsing of what is more often than not static data.

  2. Jory says:

    Man, your so much better at explaining stuff that I am.. thats exactly what I meant. 😛

Leave a Reply

Your email address will not be published. Required fields are marked *