jump to navigation

Another Revenge of the Nerds 30 June 2009

Posted by manniwood in SQL.
3 comments

Any sufficiently complicated C or Fortran program contains an ad-hoc, informally-specified, bug-ridden, slow implementation of half of Common Lisp.

– Philip Greenspun

So what about other data models?—the “object oriented model”, for example, or the “heirarchic model”, or the CODASYL “network model”, or the “semistructured model”? In my view, these other models are just not in the same ballpark [as the relational model]. Indeed, I seriously question whether they deserve to be called models at all.

– C. J. Date, SQL and Relational Theory, Appendix A, “The Relational Model”

Manni’s Prediction: Any sufficiently complicated data store will contain an ad-hoc, informally-specified, bug-ridden, (but perhaps fast) implementation of half of the relational data model.

I predict that, the same way Paul Graham’s “Revenge of the Nerds” says that newer programming languages are becoming successively more like Lisp as computers get more powerful, so too will the relational data model become more popular as computers get more powerful.

And yet, just today, I read another article discussing the imminent demise of the RDBMS.

On the one hand, I don’t think all data need to be stored relationally, as I’ve said before. Sometimes you just need an object store, or a document store, or just a plain text file. Not all data have to be relational data.

But when we look at data stores like Amazon’s SimpleDB, or Google’s BigTable, would those data stores provide relational features if they could do so without hampering response time and scalability?

I think that as computers become more powerful, many data stores that currently eschew relational features for performance reasons will start to adopt those features—much the same way Paul Graham points out programming languages are getting more and more like Lisp.

I’m reminded of a quote from a blog entry (“Social Media Kills the Database”) that keeps bearing creative fruit for me:

The internals are totally opaque for most [RDBMS] solutions. Optimization, required for any performant solutions, becomes a “black art”.

The exact same criticism has been (fairly?) leveled at Lisp.

And yet the opacity of the internals of RDBMSs (or Lisp) may end up being their greatest strengths. As C. J. Date puts it, the fact that the relational data model is not tied to any one implementation leaves lots of opportunity “to adopt creative approaches to questions of implementation.”

Or, as Paul Graham says about Lisp, “Lisp was a piece of theory that unexpectedly got turned into a programming language.”

So while Lisp itself may or may not take over the world, languages are becoming more and more Lisp-like, because the theory behind Lisp has proven to be too good to ignore.

And, I would make a similar argument for the relational data model. Perhaps particular RDBMSs are on their way out; maybe even SQL will one day go away; but only to be replaced by even cleaner, faster implementations of the relational data model. The relational data model is not just here to stay: it’s actually going to become more popular as it becomes technically feasible to use it in places where it currently has performance issues. It’s just a matter of time.

Only the R Matters in RDBMS 29 June 2009

Posted by manniwood in SQL.
3 comments

“Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious.”

– Frederick P. Brooks, Jr., The Mythical Man Month, chapter 9, section “Representation Is the Essence of Programming”.

Bradford’s blog entry, “Social Media Kills the Database”, really struck a chord with me. Although I am an unabashed RDBMS fanboy, I may be an unusual one, because in some ways, I wish RDBMSs were used less than they are today. Not very fanboy-ish of me, but let me explain.

I was struck by the way Bradford called modern RDBMSs the “Swiss Army knives” of data storage, and referred to their “jack-of-all-trades nature”.

At first I thought that wasn’t a fair characterisation of RDBMSs, because I only use the relational features of RDBMSs, and figured everybody else did too. But then I thought of how many RDBMS vendors have added non-relational capabilities to their database products, like object-relational features and XML support; and how their marketing departments describe their products as all things to all people. And I realised Bradford was on to something: RDBMS vendors are trying to sell their products as the Swiss Army knives of data storage! I would argue that this hurts RDBMSs, and it confuses developers who try to use RDBMSs for things RDBMSs aren’t good at.

So what are RDBMSs good at? They’re pretty good implementations of the relational data model. And the relational data model is a pretty useful way of representing (and thinking about) data.

Now one thing I consider to be a bit of a misperception of RDBMSs is that

Traditional relational databases (like those used in finance) evolved in the 70’s and 80’s as a glorified spreadsheet-and-file-cabinet. The metaphor prevails today.

The truth is actually a bit more peculiar and interesting.

The relational data model began as a set of research papers published by IBM employee E. F. Codd in 1969 and throughout the 1970s. So unlike Google’s BigTable, or Amazon’s SimpleDB, which began as implementations, the relational model of data began as a set of ideas, not as source code.

This really appeals to me: the relational data model strives to be a mathematically and logically rigorous way of representing data, and exists outside of any one implementation. I can’t think of any other data model that this is true of.

In fact, work on the relational model continues to this day! C. J. Date in particular continues to work on, and publish books about, the relational data model. It’s particularly enlightening (not to mention entertaining) to read books like Database in Depth, where Date says that modern RDMBSs — and standard SQL — do a poor job of implementing the relational data model.

But I must say, even though RDBMSs are an imperfect implementation of the relational data model, I still love RDBMSs (and the relational model) to death. Do you think Unix/Linux is cool for being built around a central metaphor (“everything’s a file”)? SQL is way cooler: it tries to be an implementation of an actual data model based in logic and set theory!

Of course, we have to ask: is the relational data model, as implemented by SQL in today’s RDBMSs, actually useful for anything?

Honest answer: it depends.

If you use heavily object-oriented code, and have a bunch of objects that you wish to store and retrieve, don’t use an RDBMS; use an object store. The impedence mismatch is too great. (Here I will link to my favourite blog entry of all time, “The Vietnam of Computer Science”, which says that Object/Relational Mapping is the Vietnam of computer science.)

If you have a bunch of hierarchical data to store, consider a filesystem, or XML. Don’t use an RDBMS.

If you have a bunch of documents to store, consider some sort of document store. Don’t use an RDBMS (unless you need transactions or guarantees of data integrity, and don’t mind storing your documents as BLOBs).

On the other hand, if you have a problem that is particularly data-driven, and you’d like to store and act upon your data in a way that is reasonably rigorous, and follows rules and constraints that you’d like to define, use an RDBMS.

Or, if your problem is particularly data-driven, and you want to use a tool that helps you think and explore ideas in a data-oriented language, use an RDBMS.

In fact, let me repeat Brooks’ quote here:

Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious.

Often times, once you’ve correctly understood how to model your problem’s data, you’ll find that you’ve come to understand your problem at a deeper level as well, and the solutions to your problems will have become obvious. RDBMSs and SQL really help here. They are not a panacea, but they are useful tools.

A favourite example of mine is the writings of Philip Greenspun. Most of his code examples are in SQL. For instance the User Registration and Management chapter of Software Engineering for Internet Applications has all of its code examples in SQL. It’s just easier to think of data representation problems in SQL.

If data correctness is paramount, use an RDBMS. If data corruption is unacceptable, use an RDBMS.

Want to know my favourite (paraphrased) definition of what a relational database is? It comes from C. J. Date: A relational database is a series of propositions that always must evaluate to true.

RDBMSs are designed from the ground up not to lose data, not to corrupt data, not to have your data in an inconsistent state. RDBMSs try very hard to live up to C. J. Date’s ideal: every proposition in an RDBMS at any time ideally evaluates to true. This is incredibly useful behaviour for a multitude of purposes.

If you’re going to ask questions of your data, and you need correct, repeatable, reproducible answers, use an RDBMS.

Once you’ve found the best way to model your problem’s data, you can have your RDBMS enforce your schema and reject bad data. This goes a long way to helping you solve your problem and ensuring correct results. The declarative nature of SQL is a real boon for defining data and their relationships. Look at the sample table definitions in the User Registration and Management chapter of Software Engineering for Internet Applications, and tell me if you’d rather define all those relationships and constraints using EJBs. I didn’t think so.

And when I say use an RDBMS, I mean use your RDBMS.

Learn to think in SQL.

Write your data model in a text editor, coding the table and constraint definitions by hand. Keep your schema code in a plain text SQL file that you can load into your database via its command-line client — this will be a runnable/loadable definition of your database. Keep the SQL file under source-code management, as you do with your application code, so that you can maintain the history of your data schema as it evolves.

Do not hide your RDBMS behind an ORM! Realise that objects, though popular, are not the best representation of relational data: the relational model is the best representation of relational data. This means that you should try to hold onto that representation even in your business logic code: lists of maps translate very well to relations of tuples (or tables of rows; or result sets of rows). Objects do not. (Again, look at “The Vietnam of Computer Science”, and my own blog entry on why objects are overrated.) Instead, prefer systems like iBATIS or (shameless plug) Pybatis.

Finally, if you need many ways to report on the same set of data, use an RDBMS. It’s very sad that many developers seem to fear SQL. Do not fear SQL! Pick up a copy of the SQL Cookbook and marvel at the the concise power of SQL queries, and ask yourself: “would I really want to do this in Java?”

Unfortunately, a lot of misinformation surrounds RDBMSs, and some of it is perpetuated by over-eager marketing departments of the major vendors, trying to sell their wares as all things to all people. An RDBMS is not an object store. It’s not a heirarchical file system. It’s not Google’s BigTable. Even if a vendor tells you it is!

But if you use your RDBMS for its core competency (storing and querying data according to the relational model — at least as currently codified by SQL) it won’t take you long to wonder how you got by without it.

Most Sites Just Aren’t That Popular (or, why you can stick with your RDBMS) 27 June 2009

Posted by manniwood in SQL.
5 comments

Non-relational databases like BigTable, SimpleDB, CouchDB, and HBase are getting a lot of press lately. Predictably, articles are popping up predicting the end of the relational database management system (RDBMS). Non-relational databases are being presented as solutions to the database scalability problems of the
Googles, Amazons, and Twitters of the world.

Is your site the next Twitter? Should you be thinking of abandoning your RDBMS just in case you can’t scale to meet your site’s future demands? Or is your site, like most sites, just not that popular?

There’s a particularly strongly-argued blog entry called “Social Media Kills the Database” that declares

social media is driving the final stake into the large analytical RDBMS (Relational Database Management System).

and

The ACIDy, Transactional, RDBMS doesn’t scale, and it needs to be relegated to the proper dustbin before it does any more damage to engineers trying to write scalable software.

Unfortunately, the article extrapolates from the experience of building social media sites (like Twitter) out to all sites, and makes some over-broad conclusions about the impending doom of the RDBMS as a result.

The RDBMS is doubtless not the best data store for large social web sites like Twitter. But, does that mean the RDBMS is unsuitable for all the other sites out there? Or, more importantly, your site?

As a general rule of thumb, if you are not expecting Google traffic levels, I think the RDBMS is still the way to go.

First off, I think it’s really important to clarify what the tradeoffs are between RDBMSs like Oracle and PostgreSQL, and non-relational databases like BigTable or HBase.

First, what are two attributes we really want from a database?

  1. data correctness (when I ask how many users are in the system, do I get the correct answer?) and
  2. speed (how long does it take to find out how many users are in the system?).

Now pick one.

Well, OK, that’s a bit cruel and a bit unrealistic. But in the real world, we do have to sometimes say, when push comes to shove, would I choose speed over correctness, or correctness over speed?

With the RDBMS, we choose correctness first. It’s not to say that speed doesn’t matter, but when we must have the correct answers, an RDBMS provides us with lots of tools to ensure data integrity.

With the non-relational database, we choose performance first. It’s not to say that data integrity doesn’t matter, it’s just that speed and scalability matter even more with non-relational databases like BigTable and HBase.

If you know which quality is most important to your project (correctness or speed) then you really don’t need to read the rest of this blog entry.

But what if you need your RDBMS to scale? You’ll need to spend a lot of money to get a beefier single server, because to maintain their guarantees of data integrity, RDBMSs do not scale across multiple servers very easily. (In fact, read the section Persistence Layer in the “Scaling Gracefully” chapter of Software Engineering for Internet Applications.

On the other hand, if you need your non-relational database to offer guarantees of data correctness, to the best of my knowledge, you’re pretty much out of luck. (Which, I guess, could be translated as: spend the money on a really huge box to run an RDBMS, because it will still cost less than figuring out how to bring RDBMS-style data integrity guarantees to your non-relational database.)

In the recounting of “Social Media Kills the Database”, data correctness, though nice, was not the primary requisite for their data store:

If our users are analysing millions of documents, they’re not going to care if there’s 15,000 unique Authors, or 15,001.

And this is why they were very wise to abandon their RDBMS.

On the other hand, I currently work on a database-backed web site that helps companies do financial budgeting and forecasting. When analysing millions of dollars, my clients will care if an expense is $15,000.00 or $15,001.00. I am therefore correct in having chosen an RDBMS to hold my applications’ data.

This does not mean that I’m going to extrapolate out from my own experience and declare that because it worked for me, all sites should therefore use an RDBMS, and that financial web sites are driving a stake into the heart of those unreliable non-relational databases.

However…

This is a good time to ask yourself how popular you think your site will be. Because if you think it will be a small-to-medium sized site, you should use an RDBMS even if you don’t think you need the data integrity guarantees of an RDBMS. I mean, if you can serve your audience at the speed they demand, and have data integrity on top of that, why not have it all?

Remember that most sites do not have to handle the traffic loads of Twitter or Google. Most sites’ traffic loads do not bog down an RDBMS on a decent piece of hardware. (In fact, given the newness of databases like BigTable and HBase, one can assume that most sites currently use an RDBMS right now, and, apparently, they are handling their loads just fine.)

Remember the old adage about premature optimisation being the root of all evil? As a general rule of thumb, use an RDBMS. It’s really nice to use a data store that has guarantees on data integrity.

But, by all means, if you are building the next Google or Twitter, do what you have to do to scale, scale, scale.

Just realise that most sites are not the next Google or Twitter.