jump to navigation

Brandon Bloom Nails it on ORM; and ORM’s Definition Has Grown 7 October 2009

Posted by manniwood in SQL.
2 comments

As readers of my blog know, one of my favourite blog posts of all time is Ted Neward’s The Vietnam of Computer Science, where he discusses what a quagmire ORM is. But it looks like the definition of ORM has changed since Neward blogged about it, and I’m at risk of attacking database tools I love dearly, because I don’t consider these tools to be ORM—but others do.

I find when talking to developer friends of mine, and reading blogs, ORM is being quietly redefined to include tools like iBATIS, which doesn’t refer to itself as (and which I do not consider to be) an object relational mapper. iBATIS refers to itself a SQL mapper. It can be used to map objects and relational database records, but it can alsobe used to return primitive types, data structures made out of simple String types, etc, etc. It also makes it trivial to call stored procedures, which strikes me as decidedly non-ORM-ish.

Back when Ted Neward wrote his blog entry, I think it was assumed that ORM meant Hibernate, or Django’s ORM—tools that wrote SQL for you automagically in the background. I don’t believe that a tool like iBATIS was considered ORM back in 1996. Yet, in common parlance, iBATIS, and tools like it, seem to be considered ORM, which I just don’t get.

When Brandon Bloom discusses the shortcomings of Django’s ORM in his blog entry ORMs and Declarative Schemas, he explicitly singles out “the schema-generative ORM paradigm” for criticism, as though Django-style ORM is a sub-set of all other kinds of ORMs—which I guess now it is under this more expansive definition, even though I think Bloom’s sub-definition of ORM would not have been required a few years ago.

The acronym “ORM” seems to have become sort of a blanket term for any tool that helps you liaise between an RDBMS and any non-relational, (usually) object-oriented language. This is a bit unfortunate, because I used to think of ORMs (and I believe Neward’s original blog entry) describes ORMs as specifically tools that automagically generate SQL code on the fly based on your object model and configuration files. It used to be tools that allowed you to write SQL by hand were not considered to be ORMs. But I guess now they are.

By this new definition, my own Pybatis could be considered an ORM, even though I expressly designed it not to be an ORM. (Though nothing in Pybatis would prevent you from building an ORM on top of it.)

I almost feel like going back and editing my old blog entries, because when I criticise ORM, I mean tools that auto-generate SQL, not tools like iBATIS or Pybatis that allow you to write SQL by hand. Note I don’t say “force you to write SQL by hand”. This I consider to be an important philosophical divide between what I used to consider ORM and what I consider not ORM: “let our tool write all that nasty SQL for you” (ORM) versus “let us make writing SQL easier” (not ORM).

Bloom’s article is well worth reading. It treads over ground I’ve covered in my own blog, and I’m not surprised any time somebody discovers that “the database itself should hold the authoritative schema, not a class declaration in the code.” I agree.

I have a feeling that this lesson will continually need to be learned. After all, Bloom is posting 4 years after Neward, but hte conclusions remain the same. The siren call of ORM (sorry—specific ORMs that write your SQL for you) is still strong.

Maybe I should take heart in the idea that ORM is getting redefined to include tools that encourage you to write SQL by hand and that use the RDBMS as the authoritative data model for your project. Perhaps I shouldn’t care what people call that activity, as long as it gets pursued. Still, I wish there was a different name for this practice. (NoRM?) I think calling it ORM muddies the waters a little, and hides important differences in philosophy.

I’m even happy to cede that in certain greenfield projects, traditional ORM might be the way to go. (Hey—it is quick to get a site up and running with Rails-style or Django-style ORM.)

Perhaps I will start calling my favourite method of liaising with the RDBMS NoRM.

Lorenzo Alberton’s Graphs in the Database 9 September 2009

Posted by manniwood in SQL.
add a comment

Here’s a recommendation to read Lorenzo Alberton’s Graphs in the database: SQL meets social networks.

Alberton shows how to represent both directed and undirected graphs in SQL, with a heavy focus on how that applies to linkedin– and facebook–style social networking.

The SQL fanboy in me loves articles like this.

RDBMS Static Typing vs. General Purpose Language Dynamic Typing 23 August 2009

Posted by manniwood in Programming, Python, SQL, Uncategorized.
add a comment

One thing that’s fun about being on the job interview circuit is that you get to re-think a lot of your biases and tastes.

Something that has occurred to me is that I prefer to program in dynamically typed languages when I can (lately that would be manifested as a preference for Python over Java) and yet this fails to explain my love for RDBMSs, and the static typing that goes with them!

For general-purpose programming, Paul Graham’s opinion definitely rings true, so I’ll quote him at length:

As far as I can tell, the way they taught me to program in college was all wrong. You should figure out programs as you’re writing them, just as writers and painters and architects do.

Realizing this has real implications for software design. It means that a programming language should, above all, be malleable. A programming language is for thinking of programs, not for expressing programs you’ve already thought of. It should be a pencil, not a pen. Static typing would be a fine idea if people actually did write programs the way they taught me to in college. But that’s not how any of the hackers I know write programs. We need a language that lets us scribble and smudge and smear, not a language where you have to sit with a teacup of types balanced on your knee and make polite conversation with a strict old aunt of a compiler.

—Paul Graham, Hackers and Painters

What’s interesting is that I find for the projects I’ve worked on, I really like to nail down the data model in an RDBMS, and then code the business logic and display code using as flexible a language as possible. I think this bias comes from two requirements I’ve encountered a lot in my programming career:

1) make stuff easy to report on

2) change data capture and display on a fairly routine basis.

To make stuff easy to report on, it’s nice to have constraints and other forms of data integrity enforcement, so that your reports are already half-way complete just because of how your RDBMS stores (or refuses to store) your data: you know your data are correct, and you just have to aggregate them. Honestly, I sometimes find RDBMSs a bit too rigid to deal with the ever-changing needs of business, but all mature RDBMSs seem to have good support for alter <anything> commands.

And if you can get your data stored in a consistent way in an RDBMS, SQL is at your beck and call to report on your data in all sorts of interesting ways. This generally pleases the business very much.

For the projects I’ve tended to work on, after all that hard work has gone into getting the data model right, I generally prefer to make as thin a layer as possible over that database, so that the database can essentially speak for itself. A dynamically typed language comes in handy here, because it will have good support for lists and maps, and not a whole lot of casting or type conversion has to happen to shuttle data back and forth from the display to the RDBMS, and vice versa. Manipulating and validating form input is generally much easier in higher-level languages as well, with their somewhat easier out-of-the-box implementations of regular expressions, etc, etc.

Of course, this leaves open some interesting possibilities. If my next project has ever-changing needs, and a data model cannot really be pinned down, will I try to squeeze that design into an RDBMS, or will I try to find a more flexible, non-traditional data store? I can easily imagine turning to a non-traditional data store if the data are too unruly, and don’t follow the sorts of patterns that would suggest using an RDBMS.

If the data were to rarely be reported on, I might not even miss SQL that much. After all, SQL really shines when it comes to reporting.

But if I had to generate reports from non-relational data, that might start to get interesting. As much as I love dynamically typed languages, this might be one area where I’d wish for SQL’s strong typing, because programatically preparing reports would feel sort of like re-inventing the SQL wheel.

I don’t know if I’ll ever be able to explain the seeming inconsistency in my love of dynamic typing in general-purpose programming languages, and my love of SQL and the strong typing that goes with it. Maybe it’s because I like storing my data in a formal way (when the data are amenable to it), but I like manipulating my data in a flexible way, especially when I know I can count on them being correct.

All I know is that on the projects I’ve worked on, it’s been a great combination. But I also know that all projects are different, and that not all data are relational. Maybe one day I’ll work on a project that uses dynamic typing from top to bottom. That would be interesting.

Relational Data Model and SQL Collide at BOOLEAN 4 August 2009

Posted by manniwood in PostgreSQL, SQL, SQL NULLs.
add a comment

Database luminaries whose opinions I trust do not seem to like the BOOLEAN type in SQL.

This discussion at Tom Kyte’s ‘Ask Tom’ blog gives you a flavour of what I’m talking about: Oracle has no BOOLEAN type, and Tom says it’s not needed.

Joe Celko points out on pages 228 and 229 of Thinking in Sets that NULL throws a wrench into the way BOOLEANs work anyway. On pages 104 and 105 of his 3rd edition of SQL for Smarties, he says that programmers often think of encodings in the wrong way anyway, and that the two values of boolean are often too limiting: in addition to true and false, what about unknown or not applicable, or pending, or whatever? In one of his other books (can’t find the reference right now) instead of recording a boolean value (let’s say for a has_graduated column in a student table), Celko says why not record more information about the event? (Such as, date_graduated.)

On the other hand, C. J. Date, who has done a lot of work describing the relational data model (and the ways in which SQL is a poor implementation of that model) says this about BOOLEAN on page 275 of SQL and Relational Theory:

The set of system defined scalar types is required to include type BOOLEAN—the most fundamental type of all—but a real system will surely support others as well (INTEGER, CHAR, and so on).

But then again, C. J. Date says the relational data model should not have NULLs, so the problems with BOOLEANs and SQL’s three valued logic go away when NULLs are not part of the data model. (I could have called this post SQL NULLs are evil, part 5.)

I’m rather happy that PostgreSQL supports the BOOLEAN data type, and I’m using it in the data model of the product I’m currently working on. As long as you declare your BOOLEAN fields NOT NULL, you’re restricting that column to contain only true or false, so that it actually behaves like a boolean:

create table users (
    ...
    is_locked boolean default false not null,
    ...
);

Also, despite all the ways in which I might have made is_locked more SQL-ish by doing any of the following:

    locked_on timestamp with time zone not null,

or

    lock_status varchar(8) constraint
    "need valid lock status"
    check (lock_status in
    ('locked', 'unlocked'))
    default 'unlocked' not null,

or

    lock_status_id bigint constraint
    "must point to valid lock status"
    references lock_statuses(id)
    not null,

sometimes all I really want to do is store a boolean value, plain and simple. Sometimes, something really is only true or false (“Is this user locked or not?”), and that’s the most useful encoding for the only two states that are possible (or that my data model is concerned with).

Interestingly, the newer SQL standards do seem to have a BOOLEAN type. This should create lots of hair-pulling, given that SQL also supports NULLs and UNKNOWN. BOOLEANs can’t be strictly booleans in SQL’s world of three-valued logic. I get this.

Nonetheless, I’ll be happy to use BOOLEAN (restricted to be NOT NULL) where appropriate.

If we had a choice to either 1) get rid of BOOLEANs and keep NULLs in the SQL standard, or 2) get rid of NULLs and keep BOOLEANs in the SQL standard, I’d nuke NULL from the SQL standard in a heartbeat. I’m siding with C. J. Date on this one: BOOLEAN is the primary (and only required) data type of the relational data model, and SQL needs to follow the relational data model more closely.

SQL NULLs are Evil! Part III 29 July 2009

Posted by manniwood in SQL, SQL NULLs.
add a comment

After reading SQL NULLs are Evil! and SQL NULLs are Evil! Part II, if you are still not convinced that SQL NULLs are evil, read David Wheeler’s Neither NULL nor NOT NULL: An SQL WTF.

Call me old fashioned; call me pedantic; but I like assertions about my data to be either true or false, not unknown.

When Was the Last Time You Switched RDBMSs? 22 July 2009

Posted by manniwood in Programming, SQL.
1 comment so far

It’s been pointed out by many people, including this humble blogger, that writing portable SQL, or worse, using an ORM to make it easy to switch RDBMSs, is about as silly as using only a subset of the features of Python or Ruby in case you have to switch between them, or back to Java.

Thomas Kyte is particularly lucid (not to mention entertaining) on this point, especially in the section “Openness” in Chapter 1 of Expert Oracle Database Architecture. He talks about a team that had committed to some very specific front-end technologies, but were worried about what would happen if they switched databases, and wanted to write the most portable SQL possible.

What really blows me away is that in the 13 years I’ve been building database-backed web sites, I’ve only seen a web site switch out its back end once (at my recommendation), but I have seen web sites switch their front ends plenty of times.

In fact, if I compare my latest front-end switch to the only back-end switch I’ve ever done, I’d have to say that they were both worthwhile, but that the back-end switch was easier.

With the back-end switch, I moved from a small Oracle database to a small PostgreSQL database, mostly for cost reasons (can’t beat the cost of PostgreSQL!). I was (and still am) a huge fan of Oracle, especially its locking model (writers don’t block readers / readers don’t block writers), and I found the same locking model in PostgreSQL. And trust me, this was going from Oracle-specific, leverage-every-Oracle-advantage, vendor-locked-in code to PostgreSQL-specific, leverage-every-PostgreSQL-advantage, vendor-locked-in code. The switch was way more straightforward than the front-end switch I did from Java/J2EE to Python/Django. (But I’ll repeat that both switches were worthwhile.)

But this was the only time I’d ever switched back ends, and I don’t think I’m alone in noting how rare it is to switch RDBMSs. Front-end switches are way more common.

Why?

I think I was listening to a stackoverflow.com podcast where Jeff Atwood said the stackoverflow data was their most important asset—more important than even their front-end code. I think most websites / businesses feel that way about their data. Their data help them make decisions, make money, etc, etc. The processes that act upon those data are important too, but subordinate to the data.

Same thing with a web site. The front end is obviously important (no front-end, no visitors!), but the data are king. I think this is why most web sites and businesses are very loathe to do anything that might disrupt the database. But the front end? There’s room to experiment there. In fact, most good databases end up with multiple front-ends, because their data are so valuable. And once you have multiple front-ends, you become even more loathe to change your database, because now more clients depend on it.

So not only does the oft-quoted ORM argument of “but what if you want to switch databases?” ring hollow to me, it actually strikes me as downright silly. Because front-ends, (you know, like where the ORM is kept?) change way more than back ends, and yet we still haven’t seen tools nearly as silly as ORMs to help people abstract away the front-end in case of an inevitable switch.

And if having no front-end abstraction layers has not meant the end of the world, I think we should do away with back-end abstraction layers, and other foolish attempts to write “platform independent code at all costs”. Swapping out the back end is less common than swapping out the front end. Or, to put it more bluntly, designing for the eventual swapping-out of the back end is a very unusual, perhaps even bogus, project requirement.

SQL NULLs are Evil! Part II 21 July 2009

Posted by manniwood in SQL, SQL NULLs.
1 comment so far

In SQL NULLs are Evil (Part I), I detailed the evilness of SQL NULLs, and solutions for more accurately representing missing data in tables.

One thing I did not cover was the evilness of NULLs in foreign keys, and solutions to address that.

For instance, let’s say we have a table of products where we want to keep a record of which user created a row, and which user modified a row, like so:

create sequence user_ids start 1;
create table users (
    id bigint constraint "User id must be unique."
        primary key not null,
    ...
);

create sequence product_ids start 1;
create table products (
    id bigint constraint "Product id must be unique."
        primary key not null,
    name varchar(80) constraint
        "Product name must be unique."
        unique not null,
    inserted_on timestamp with time zone
        default current_timestamp not null,
    updated_on  timestamp with time zone
        default current_timestamp not null,
    inserted_by bigint constraint
        "Product must be inserted by existing user."
        references users(id) not null,
    updated_by  bigint constraint
        "Product must be updated by existing user."
        references users(id) not null
);

One thing we could do for freshly-inserted products is just cheat, and say that the updated_by user is the same as the inserted_by user, even though, strictly speaking, freshly-inserted products have never been updated. (In fact, that’s the cheat we use for updated_on; we just use the current time the new row is inserted. This is not exactly accurate; another solution could be to use a sentry date that is impossible in the real world.)

Another solution is to literally make an unknown user and set any new row’s inserted_by foreign key to the unknown user. As you can see, I’ve set up the sequences (used for generating primary keys) to begin at 1; I’ll use ids of less than 1 for special purposes. 0 seems like a good idea for the unknown user:

insert into users (id, username, ...)
           values (0, 'unselected' ...);

I could even go back and put a default value of 0 in products.updated_by’s column definition, so that freshly-inserted products’ updated_by would default to the unknown user automatically.

Getting More Extreme

Once you start to think this way, you can decide what sorts of sentry values you want or need. For instance, let’s say you are working on a database where columns in a bugs table may have foreign keys to a program features table. Now, consider this list of program features:

program_features
  -3   'Any'  -- used for search dropdowns
  -2   'Unsure'
  -1   'Not Applicable'
   0   'Unselected'
   1   'Back End'
   2   'Business Logic'
   3   'User Interface'
   ...

Furthermore, let’s assume there is only a handful of features, and that they fit in a drop-down list when the user enters a new bug. ‘Unselected’ could be entered as the program feature of a new bug if the user did not bother to select a program feature from the drop-down menu. ‘Not Applicable’ could be used if the bug did not pertain to any particular feature of the program, or ‘Unsure’ could be selected if the user was unsure.

In search menus, where there may be a drop-down to search for bugs by program feature, you may be using -3 as a sentry value to mean ‘Search all’. You may (or may not, depending on your tastes) encode the ‘all’ sentry value into your table as a reminder of what sentry value you are using (and, maybe you’d add a check to your bugs table to disallow -3 as a program feature foreign key).

These are, of course, just suggestions. I’ve found some of these techniques useful in the databases I build. Used judiciously, these sorts of sentry values let you have non-null foreign keys in many areas of your database where you might have put up with NULLs.

You can also usefully represent kinds of missing data, and search on them later. So, for instance, instead of looking for bugs where the product feature is unknown (which is what SQL NULL really means; not “missing”, but “unknown”), you can actually search for bugs where the user purposefully didn’t choose a program feature, or expressly didn’t know which program feature contained the bug, or whatever. When these sorts of things need to be tracked, it’s nice to know you can do so.

If you start to think about your data this way, you’ll find that there are different flavours of “missing”. You can decide how much you care about such things, and design your database accordingly. As a pleasant benefit, there will be one less place in your data model where NULLs are allowed.

SQL NULLs are Evil! 15 July 2009

Posted by manniwood in PostgreSQL, SQL, SQL NULLs.
2 comments

In my last post, I spoke about how null in your business application language is not the same as NULL in SQL. Whereas null in your business application language (Java, Python) happens to throw a null pointer error (or equivalent) when you try to add it to numbers or concatenate it to strings or otherwise interact with it, SQL happily returns NULL.

As a result, Joe Celko says to avoid NULL as much as you can in your SQL. C. J. Date says that the relational data model isn’t even supposed to contain NULL! (Although it should be noted that NULL exists in SQL because C. F. Codd put it in his original writings on the topic, and SQL was quick to pick it up. Alas.)

As I said in my previous post, in SQL, NULL does not mean “there’s nothing here!”; it means “I have no idea!”

Let’s explore this evil further, shall we?

I’m going to fire up psql, the PostgreSQL client, and try a few things.

First, I want NULLs to show up as ‘NULL’ and not whitespace:

mydb# \pset null 'NULL'
Null display is "NULL".

Now let’s create a table t that holds quarterly earnings (in whole dollars only, so I can use int for my columns; this is just a toy demo).

mydb# create table t (year int, q1 int, q2 int, q3 int,
mydb# q4 int);
CREATE TABLE

Now let’s enter earnings for the first two quarters of this year:

mydb# insert into t (year, q1, q2) values (2009, 50, 40);
INSERT 0 1

Now let’s calculate earnings for this year so far:

mydb# select q1 + q2 + q3 + q4 as "earnings"
mydb# from t where year = 2009;
 earnings
----------
     NULL
(1 row)

Any grizzled SQL programmer will tell you this makes perfect sense, and that 90 is not the answer.

mydb# select * from t;
 year | q1 | q2 |  q3  |  q4
------+----+----+------+------
 2009 | 50 | 40 | NULL | NULL
(1 row)

We don’t know the values for q3 and q4 yet, so we have left them null. Here’s the problem. In SQL, NULL means “I have no idea”, which literally means q3 and q4 could have values, it’s just that the database does not know what they are! So q1 + q2 + q3 + q4 translates to 50 + 40 + “I have no idea” + “I have no idea”. The only logical total is “I have no idea”! Because when q3 and q4 do not have empty values, but unknown values (which could be 0, but could be 65, or could be -1…), then the sum of q1 through q4 is also unknown. Ah, the joys of SQL.

The same thing happens with text too, by the way. If we create a table for users, and we leave the address2 field NULL (a common thing), but then we try to form a complete address by concatenating the second address field to the other columns, here’s what happens:

mydb# create table u (username text, addr1 text, addr2 text);
CREATE TABLE
mydb# insert into u (username, addr1) values
mydb# ('bob', '32 Short St.');
INSERT 0 1
mydb# select username || ': ' || addr1 || ' ' ||
mydb# addr2 as "ADDR" from u;
 ADDR
------
 NULL
(1 row)

This should come as no surprise now.

What are the solutions?

There are two.

1. Use COALESCE. If you are saddled with a database that is riddled with NULLs being (incorrectly) used to indicate emptiness (instead of unknown), SQL’s COALESCE is here to help. I won’t discuss it any further, but it’s very useful when you’re faced with NULLs that you can’t vanquish.

2. Fix your data model to not allow NULLs! One thing I’ve come to really appreciate about SQL is NOT NULL and DEFAULT values. They really help you make more accurate, more robust data model. Another thing I’ve come to appreciate about PostgreSQL in particular is that it allows you to store empty strings, whereas other RDBMSs (I’m looking at you, Oracle) silently convert empty strings to NULLs! I do love Oracle, but, when you can, choose an RDBMS that allows empty strings in your data model. It’s less evil that way.

Here’s how to fix our earnings table:

mydb# update t set q3 = 0, q4 = 0 where year = 2009;
UPDATE 1
mydb# select * from t;
 year | q1 | q2 | q3 | q4
------+----+----+----+----
 2009 | 50 | 40 |  0 |  0
(1 row)
mydb# alter table t alter q1 set not null,
mydb# alter q1 set default 0;
ALTER TABLE
mydb# alter table t alter q2 set not null,
mydb# alter q2 set default 0;
ALTER TABLE
mydb# alter table t alter q3 set not null,
mydb# alter q3 set default 0;
ALTER TABLE
mydb# alter table t alter q4 set not null,
mydb# alter q4 set default 0;
ALTER TABLE

So now we have made our earnings table not accept nulls, and we have provided sane defaults for each quarter! If you approach each column of each table as being NOT NULL but with a DEFAULT, you’ll be surprised at how many sane defaults there are in most situations. For instance, in the case of quarterly earnings, the earnings for a quarter that hasn’t come yet are not unknown (NULL); they are 0! We know they are 0, because that quarter hasn’t arrived yet, so we’ve earned 0 dollars!

Hence, here’s all we have to do to enter the row for next year, now that we have NOT NULL and sane DEFAULTs:

mydb# insert into t (year) values (2010);
INSERT 0 1
mydb# select * from t where year = 2010;
 year | q1 | q2 | q3 | q4
------+----+----+----+----
 2010 |  0 |  0 |  0 |  0
(1 row)

This will come in very handy for reporting, when you come to do that later on:

mydb# select q1 + q2 + q3 + q4 as "earnings"
mydb# from t where year = 2009;
 earnings
----------
       90
(1 row)

Here’s how we fix our user table:

mydb# update u set addr2 = '' where username = 'bob';
UPDATE 1
mydb# alter table u alter username set not null;
ALTER TABLE
mydb# alter table u alter addr1 set not null,
mydb# alter addr1 set default '';
ALTER TABLE
mydb# alter table u alter addr2 set not null,
mydb# alter addr2 set default '';
ALTER TABLE

Now we’ve made our address fields not accept NULLs, and made them default to the empty string. This makes more sense, because when we did not provide addr2 when we inserted user bob, we did so because we knew addr2 was not provided. So addr2 was not unknown (NULL), it was known to be empty, making the empty string the best representation of this state. Naturally, now this works too:

mydb# select username || ': ' || addr1 || ' ' || addr2 as "ADDR"
mydb# from u where username = 'bob';
        ADDR
--------------------
 bob: 32 Short St.
(1 row)

If there’s one thing I’ve discovered from designing data models that eschew NULLs wherever possible, it’s this:

1. they’re usually not necessary, and

2. there’s usually a sane default that actually makes the data model more informative than the “Idunno” shrug of the shoulders that is SQL’s NULL.

Also, if we take C. J. Date’s idea seriously that a relational database is a series of propositions all of which must evaluate to true, the use of NULLs in your data model throws a real wrench in the machinery. Because with NULLs, there’s always the problem that certain propositions will be neither provably true nor false, but UNKNOWN.

I haven’t yet touched upon interesting things you can do to avoid NULLs in foreign keys, which I may discuss in a later post.

SQL NULL is not Application Language NULL 14 July 2009

Posted by manniwood in Java, PostgreSQL, Programming, Python, SQL, SQL NULLs.
3 comments

When I first started using Python, it irritated me that null was None. C uses NULL, Java and JavaScript use null; why did Python have to use None? And why did Lisp have to use nil? Can’t all languages just standardise on null? It seems so arbitrary and capricious.

I’ve had a change of heart: I actually think the word ‘null’ needs to go away: languages should specify exactly what they mean by null, the way Python means None when it says None.

Like most application languages, Python throws errors when you try to add None to an integer, or concatenate None with strings:

>>> 1 + None
Traceback (most recent call last):
  File "", line 1, in
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
>>> "foo " + None
Traceback (most recent call last):
  File "", line 1, in
TypeError: cannot concatenate 'str' and 'NoneType' objects

In Java, you get null pointer errors trying to do similar things with Integer objects and Strings. Maybe it would be nice if Java also called null None, seeing as that’s what null really means in Java too.

In the case of SQL, it would be nice if NULL was called UNKNOWN. (Unfortunately, UNKNOWN is already taken! When was the last time you used the UNKNOWN keyword in SQL? Me neither. But it’s there…)

SQL’s NULL is often (ab)used to indicate “none”, but that’s incorrect: really a SQL NULL means “I have no idea”.

Watch this to see what I mean.

I fire up a PostgreSQL client, and ask it to print “NULL” instead of whitespace whenever NULL is encountered in result sets:

mydb=# \pset null 'NULL'
Null display is "NULL".

and now I try to add an integer and NULL, then try to concatenate NULL to a string:

mydb# select 1 + null as "RESULT";
 RESULT
--------
   NULL
(1 row)

mydb# select 'foo ' || null as "RESULT";
 RESULT
--------
 NULL
(1 row)

No tracebacks or null pointer errors in SQL! Nope. You get results. Why? Because SQL NULL does not mean “None”, it means “I have no idea!”

Hence, the only logical answer to adding one and “I have no idea” is “I have no idea”, because we have no idea what to what we are adding the number one—so if we are forced to provide an answer, we must respond “I have no idea!” And I’ll say it again: in SQL, NULL means “I have no idea”, not “None”. (Given that SQL has the UNKNOWN keyword, you’d think that UNKNOWN would be the result to the above queries. For whatever reason, SQL returns NULL instead. On the plus side, this re-inforces the fact that even SQL considers NULL to be the equivalent of UNKONWN.)

Likewise for concatenating ‘foo ‘ and SQL NULL: The answer is not ‘foo ‘ as you might expect if you were concatenating ‘foo ‘ and nothing; the answer is SQL NULL, because we are concatenating ‘foo ‘ and “I have no idea”. And when we have no idea what we are concatenating ‘foo ‘ to, we cannot predict what the resulting string would be, so the only logical answer is “I have no idea!”

In a sense, you could argue that Java null and Python None also mean “I have no idea”, it’s just that, in those languages, when you try to add 1 to “I have no idea”, you get an error. The compiler is basically stopping and asking for a time out: “Just what are you asking me to do here? I can’t add one and ‘I have no idea’!” Wouldn’t it be nice if SQL did that? But it does not. SQL happily returns the result: “I have no idea what one plus ‘I have no idea’ equals.”

My next blog post will be an appreciation of why Joe Celko says to avoid nulls in SQL whenever you can, and why C. J. Date says nulls should not even be part of the relational data model. I want to keep on typing, but I want to leave this blog entry with only one major idea: that nulls in your code’s business logic layer are not the same as nulls in your RDBMS!

Why Maps Are Not the Default Return Type of Python’s DB API 13 July 2009

Posted by manniwood in PostgreSQL, Python, SQL.
2 comments

One thing that irritates me about Python’s DB API is that rows in result sets are by default integer indexed, rather than column-name indexed.

So, for the query

select id as "ID",
       username as "USERNAME"
  from users

I have to retrieve “USERNAME” from a row of my result set with a numeric offset, like so:

username = myrow[1]

Python’s DB API does not, by default, let me access column rows by name. Why?

It turns out this irritating state of affairs comes from a “feature” of SQL: columns in result sets do not need unique names.

I figured I’d test this in my favourite RDBMS, PostgreSQL:

mydb=# create table t (t text not null, t text not null);
ERROR:  column "t" specified more than once

This made me happy: tables cannot contain duplicate row names.

But, result sets can contain duplicate row names:

mydb=# create table t1 (t text not null);
CREATE TABLE
mydb=# create table t2 (t text not null);
CREATE TABLE
mydb=# insert into t1 (t) values ('Hello');
INSERT 0 1
mydb=# insert into t2 (t) values ('world');
INSERT 0 1
mydb=# select * from t1, t2;
   t   |   t
-------+-------
 Hello | world
(1 row)

mydb=# select t1.t, t2.t from t1, t2;
   t   |   t
-------+-------
 Hello | world
(1 row)

So if I tried to retrieve column ‘t’ in the above two queries, I wouldn’t be providing enough information to get a single column.

Now it turns out that the psycopg2 implementation of the PostgreSQL DB API driver does allow you to return maps instead of sequences for each row:

import psycopg2.extras
curs = conn.cursor(cursor_factory=\
psycopg2.extras.RealDictCursor)

And I use this as the default in Pybatis. That way, I can do the following using my original query at the top of this post:

username = myrow["USERNAME"]

But it’s irritating to know that SQL allows same-named columns in result sets. It’s part of SQL’s legacy of being a flawed implementation of the relational data model. In the relational data model, tables and result sets are of the same type: they are both relations. So they both are constrained to having unique column names. (There aren’t actually columns in the relational data model, but I’ll ignore that right now.)

On the other hand, in SQL, tables and result sets are different. Tables must have unique column names, but result sets don’t have to.

Happily, there is an easy way to get around this. Always explicitly name the columns of result sets to ensure all of the column names are unique. Then, maps (or hashes, or dictionaries) can be used to represent rows from result sets:

mydb=# select t1.t as "t1_t", t2.t as "t2_t" from t1, t2;
 t1_t  | t2_t
-------+-------
 Hello | world
(1 row)

Also, that’s why I named my columns in the sample query opening this post. It’s just a habit now: it allows me to guarantee what names I’ll use when accessing my data in the dictionaries I get back as my result set.

This is one area where I wish SQL more closely followed the relational model. In the relational model, the target of a query is a relation (or relations), and the result of a query is a relation. Nice and simple.

If SQL followed the relational data model more closely, result sets would no more be allowed duplicate column names than tables. And that would be a beautiful thing.