jump to navigation

Most Sites Just Aren’t That Popular (or, why you can stick with your RDBMS) 27 June 2009

Posted by manniwood in SQL.
trackback

Non-relational databases like BigTable, SimpleDB, CouchDB, and HBase are getting a lot of press lately. Predictably, articles are popping up predicting the end of the relational database management system (RDBMS). Non-relational databases are being presented as solutions to the database scalability problems of the
Googles, Amazons, and Twitters of the world.

Is your site the next Twitter? Should you be thinking of abandoning your RDBMS just in case you can’t scale to meet your site’s future demands? Or is your site, like most sites, just not that popular?

There’s a particularly strongly-argued blog entry called “Social Media Kills the Database” that declares

social media is driving the final stake into the large analytical RDBMS (Relational Database Management System).

and

The ACIDy, Transactional, RDBMS doesn’t scale, and it needs to be relegated to the proper dustbin before it does any more damage to engineers trying to write scalable software.

Unfortunately, the article extrapolates from the experience of building social media sites (like Twitter) out to all sites, and makes some over-broad conclusions about the impending doom of the RDBMS as a result.

The RDBMS is doubtless not the best data store for large social web sites like Twitter. But, does that mean the RDBMS is unsuitable for all the other sites out there? Or, more importantly, your site?

As a general rule of thumb, if you are not expecting Google traffic levels, I think the RDBMS is still the way to go.

First off, I think it’s really important to clarify what the tradeoffs are between RDBMSs like Oracle and PostgreSQL, and non-relational databases like BigTable or HBase.

First, what are two attributes we really want from a database?

  1. data correctness (when I ask how many users are in the system, do I get the correct answer?) and
  2. speed (how long does it take to find out how many users are in the system?).

Now pick one.

Well, OK, that’s a bit cruel and a bit unrealistic. But in the real world, we do have to sometimes say, when push comes to shove, would I choose speed over correctness, or correctness over speed?

With the RDBMS, we choose correctness first. It’s not to say that speed doesn’t matter, but when we must have the correct answers, an RDBMS provides us with lots of tools to ensure data integrity.

With the non-relational database, we choose performance first. It’s not to say that data integrity doesn’t matter, it’s just that speed and scalability matter even more with non-relational databases like BigTable and HBase.

If you know which quality is most important to your project (correctness or speed) then you really don’t need to read the rest of this blog entry.

But what if you need your RDBMS to scale? You’ll need to spend a lot of money to get a beefier single server, because to maintain their guarantees of data integrity, RDBMSs do not scale across multiple servers very easily. (In fact, read the section Persistence Layer in the “Scaling Gracefully” chapter of Software Engineering for Internet Applications.

On the other hand, if you need your non-relational database to offer guarantees of data correctness, to the best of my knowledge, you’re pretty much out of luck. (Which, I guess, could be translated as: spend the money on a really huge box to run an RDBMS, because it will still cost less than figuring out how to bring RDBMS-style data integrity guarantees to your non-relational database.)

In the recounting of “Social Media Kills the Database”, data correctness, though nice, was not the primary requisite for their data store:

If our users are analysing millions of documents, they’re not going to care if there’s 15,000 unique Authors, or 15,001.

And this is why they were very wise to abandon their RDBMS.

On the other hand, I currently work on a database-backed web site that helps companies do financial budgeting and forecasting. When analysing millions of dollars, my clients will care if an expense is $15,000.00 or $15,001.00. I am therefore correct in having chosen an RDBMS to hold my applications’ data.

This does not mean that I’m going to extrapolate out from my own experience and declare that because it worked for me, all sites should therefore use an RDBMS, and that financial web sites are driving a stake into the heart of those unreliable non-relational databases.

However…

This is a good time to ask yourself how popular you think your site will be. Because if you think it will be a small-to-medium sized site, you should use an RDBMS even if you don’t think you need the data integrity guarantees of an RDBMS. I mean, if you can serve your audience at the speed they demand, and have data integrity on top of that, why not have it all?

Remember that most sites do not have to handle the traffic loads of Twitter or Google. Most sites’ traffic loads do not bog down an RDBMS on a decent piece of hardware. (In fact, given the newness of databases like BigTable and HBase, one can assume that most sites currently use an RDBMS right now, and, apparently, they are handling their loads just fine.)

Remember the old adage about premature optimisation being the root of all evil? As a general rule of thumb, use an RDBMS. It’s really nice to use a data store that has guarantees on data integrity.

But, by all means, if you are building the next Google or Twitter, do what you have to do to scale, scale, scale.

Just realise that most sites are not the next Google or Twitter.

Comments»

1. gsteph22 - 27 June 2009

Hey there,

Thanks for linking to my article! I mostly agree with what you said — what my post was about (probably not too clear), is that it’s the end of the swiss-army RDBMS. People have figured out that different data storage problems need different paradigms, and we finally have the tools to scale appropriately :)

manniwood - 28 June 2009

I liked your original post: it had a splashy, somewhat hyperbolic title in the manner of Coding Horror’s Jeff Atwood. Your post sounds like it was rooted in hard-won knowledge, as though you’d been burned by your RDBMS. ;-) I think I need to write another post on how RDBMSs are incorrectly (and dishonestly) marketed as swiss-army data stores, but how many developers intuitively know they are not Jacks-of-all-trades, but rather flawed/incomplete implementations of the relational data model.

2. Non-Relational Databases Are a Performance Hack « Manni Wood - 7 July 2009

[...] But if there’s one thing Wiggins got really, really right is that sometimes you have no choice but to throw guaranteed data integrity out the window to meet massive scalability requirements. I happily acknowledge that sometimes you just need a performance hack. But please realise that it is a performance hack, not a best practice. It should only be used sometimes, and certainly not as often as current thinking would have you believe. [...]

3. Ken Farmer - 22 July 2009

Nicely put. Many of the catastrophes that I see are because developers are misapplying lessons learned in one area to another.

And perhaps there is a social cognitive reason to explain why developers want to apply the lessons learned from managing pedabytes to databases unlikely to grow beyond tens of gigabytes. Much like how *most* Americans believe that one day they’ll be rich and famous, most developers believe that their website will hit the big time. Even though history says otherwise.

So, perhaps for most people – deploying a non-relational database is like buying a lottery ticket. Something that’s very difficult to reason people through.

manniwood - 22 July 2009

Your comment is so spot on, it made me chuckle. I love your comparison to the (il)logic of planning for a future of fame and fortune to that of choosing a non-relational RDBMS right away. If I was going to re-title this blog entry, I’d go with “Non-relational databases: the aspirational data store!”