The RDBMS as Final Arbiter of Your Data Model 8 July 2009
Posted by manniwood in Django, SQL.trackback
Mwanji Ezana asked a question about my blog post, Objects are Hammers: Why do I want my RDBMS to be final arbiter of my data model? Shouldn’t my application’s object model be the final definition of my data model, and shouldn’t my RDBMS be subordinate to my application’s object model?
In some ways, my blog entry Only the R matters in RDBMS answers that question, but I was not specifically talking on this topic.
So, let me state my bias clearly: I think making your application’s object model the final arbiter of your data model is wrong. I realise this may be an unpopular bias. After all, even Django, which I have fallen in love with, encourages you to model your data with Python objects, using Django’s ORM facilities to write all your DDL code for you.
In fact, my copy of The Definitive Guide to Django says in chapter 5, page 66, that
Writing Python is fun, and keeping everything in Python limits the number of times your brain has to do a “context switch.” It helps productivity if you keep yourself in a single programming environment/mentality for as long as possible. Having to write SQL, then Python, and then SQL is disruptive.
I think that, for most developers, we could swap out Python for Java, or Python for Ruby, and we would find much agreement that writing the application language is fun, and that (perhaps I’m inferring this) writing SQL is not so much fun.
I disagree strongly on at least two counts.
The first count is that context switching from language to language on a single project, even during a single coding session, is acceptable if the languages make the expression of certain ideas easier. For instance, my Django book does not go on to claim that all the HTML templating code be written in Python. There is a context switch every time I switch from Python to HTML to Django’s templating language (or, in my case, Jinja). And it’s totally worth it, because HTML is best expressed in, well, HTML, not Python. And dynamic HTML is better handled using Jinja (or Django’s templating language) and not Python.
But the same goes for SQL, and I cannot stress this enough. If you have data that needs to be modeled, manipulated, and stored reliably, you’ll find that the context switch from Python to SQL and back is as appropriate as the switch you’re already making from Python to HTML/templates and back.
Which brings me to my second count: coding in SQL is not only fun, but preferable to coding in Python (or any other general purpose programming language) when SQL is used for what it’s best at: data representation and manipulation. It’s actually kind of funny, because there’s a lot of interest in domain-specific languages lately, and yet developers are falling over themselves to avoid learning SQL, the ultimate domain-specific language for data modeling and manipulation!
I think that most developers are secretly afraid of learning SQL. I don’t know why this is, but I think that anybody who creates database-backed web sites for a living, as I do, has to know SQL. Not knowing SQL will severely limit a developer’s growth, not to mention his career. Also, on the face of it, there’s no sense in the idea that computer programmers don’t want to learn a computer programming language the way developers don’t want to learn SQL. If a pilot didn’t want to learn how to fly more than one type of plane, we would have a right to ask if the pilot enjoyed being a pilot, wouldn’t we?
Up until now, I have assumed that your RDBMS really is the best place to model and store your data. But I should address an obvious question: what if the object model in the business layer of your application really is the best representation of your data? Shouldn’t your RDBMS be subordinate to that data model?
I would argue no.
One reason is that, with object-relational modeling being the Vietnam of computer science, you should not be persisting your objects to an RDBMS; you should be using an object store or object database instead.
But I’ll confess that I’d have a difficult time believing that an application’s object model was the best representation of its problem-domain’s data, because the object oriented paradigm is a paradigm and not a data model. To quote C.J. Date:
So what about other data models?—the “object oriented model”, for example, or the “heirarchic model”, or the CODASYL “network model”, or the “semistructured model”? In my view, these other models are just not in the same ballpark [as the relational model]. Indeed, I seriously question whether they deserve to be called models at all.
(– C. J. Date, SQL and Relational Theory, Appendix A, “The Relational Model”)
Again, this goes back to domain-specific languages, or domain-specific anything. The sole purpose of the realational data model, and its implentation in RDBMSs, is to correctly describe and store data, and allow useful queries on those data. Not using SQL for your data needs is like not using a hammer to drive home a nail. And having your ORM write your SQL for you makes no sense either. SQL is succinct and powerful enough that you should want to write your SQL by hand to ensure you are getting the results you want. (The same way you write your HTML and CSS by hand, to get the results you want.)
Or, to put it simply, whenever I think data, I think SQL. (Actually, having read a lot of Joe Celko and C.J. Date, I also think the relational data model.) Philip Greenspun’s books re-inforce my bias. In his writings, when Greenspun starts talking about the data model behind a web site, the SQL examples start flowing. See his chapter on user registration and mangagement to see what I mean. Any discussion of data modelling naturally has code samples in SQL; not Java, not Ruby, not Python. SQL.
Thinking of Philip Greenspun, I am reminded as to why it is I am also biased towards beginning any one of my projects by writing SQL and not Python: Because I write websites that are actually databases. This will explain a lot of my biases.
In fact, my goal is to get as thin a wrapper as possible over my database. It turns out that the code wrapping the database of any project I’ve ever worked on is quite thick, and that’s because a lot of user-friendliness, security, polish, and error correction must liaise between a web browser and a database. But with the projects I’ve worked on, I consider getting the data model right to be probably the most important step. The stuff that wraps the database in many ways naturally comes together after the data model has been figured out.
Or, as Fred Brooks would say:
Show me your flowcharts and conceal your tables, and I shall continue to be mystified. Show me your tables, and I won’t usually need your flowcharts; they’ll be obvious.
(– Frederick P. Brooks, Jr., The Mythical Man Month, chapter 9, section “Representation Is the Essence of Programming”.)
There is another pragmatic reason I have found to make the RDBMS the final arbiter of your data model, and to have the rest of your application be subservient to the RDBMS: databases take on lives beyond the applications they were originally written for. In most corporations, once a database has been created, the data in that database become useful for a lot of different purposes. So the web application you wrote on top of that database becomes only one of many portals into the database. In addition to servicing its web front end, the database also services raw queries from developers, command-line queries from reporting engines running on cron jobs, and maybe even queries from applications developed later on. It’s most fitting, then, that the database is the final arbiter of its own data model, because that is the expectation of every other client of that database.
Finally, there is the often-stated requirement that another advantage of making your object model the final word on your data model is that it allows you to more easily switch SQL implementations should you wish to do so.
This makes no sense. There is a common misperception that because all databases run SQL, all SQLs are essentially the same, except for some pesky differences in syntax which your ORM should abstract away so that you don’t get tied to a particular vendor. This makes me wish that each RDBMS had a differnt name for its SQL implementation. This way, people would realise that the differences between, let’s say, MySQL and PostgreSQL are as large as the differences between Ruby and Python (actually, probably greater). Anybody who knows anything about databases would choose MySQL or PostgreSQL for completely different purposes. The same way you give a lot of thought to which language you will use before embarking on a project, you should put a lot of effort into choosing an RDBMS that helps you correctly model your data and keep it consistent and easy to query.
But once you have chosen your RDBMS, you should be no more afraid of using its most powerful features as you would fear leveraging the unique capabilities of the general-purpose programming language you just chose.
In fact, can you imagine choosing Python for your project, but not using some of its more dynamic features, just in case you had to switch to Java? Of course not. Or can you imagine using a language abstraction layer that dis-allowed the use of language features unique to only Ruby or Python, in case you wanted to switch from one to the other? I can’t imagine it either. So why do we do this with RDBMSs?
When I choose an RDBMs, I actually take advantage of its unique features, the same way I choose a general purpose programming language for its feature set (and libraries). That’s the point. And I use all of those features to make my (carefully-chosen) RDBMS the final arbiter of my application’s data model.
[...] Database Abstraction Layers Must Die!, where he makes a criticism very similar to one I made in The RDBMS as the Final Arbiter of Your Data Model, only much more succinctly: [Abstracting away your RDBMS so that you can switch vendors is] no [...]
[...] ORM comes from my copy of The Definitive Guide to Django, that I quoted in my earlier blog entry on The RDBMS as Final Arbiter of Your Data Model. It says in chapter 5, page 66, that Writing Python is fun, and keeping everything in Python [...]
[...] been pointed out by many people, including this humble blogger, that writing portable SQL, or worse, using an ORM to make it easy to switch RDBMSs, is about as [...]
[...] Bloom’s article is well worth reading. It treads over ground I’ve covered in my own blog, and I’m not surprised any time somebody discovers that “the database itself should hold the authoritative schema, not a class declaration in the code.” I agree. [...]