Relational Data Model and SQL Collide at BOOLEAN 4 August 2009
Posted by manniwood in PostgreSQL, SQL, SQL NULLs.add a comment
Database luminaries whose opinions I trust do not seem to like the BOOLEAN type in SQL.
This discussion at Tom Kyte’s ‘Ask Tom’ blog gives you a flavour of what I’m talking about: Oracle has no BOOLEAN type, and Tom says it’s not needed.
Joe Celko points out on pages 228 and 229 of Thinking in Sets that NULL throws a wrench into the way BOOLEANs work anyway. On pages 104 and 105 of his 3rd edition of SQL for Smarties, he says that programmers often think of encodings in the wrong way anyway, and that the two values of boolean are often too limiting: in addition to true and false, what about unknown or not applicable, or pending, or whatever? In one of his other books (can’t find the reference right now) instead of recording a boolean value (let’s say for a has_graduated column in a student table), Celko says why not record more information about the event? (Such as, date_graduated.)
On the other hand, C. J. Date, who has done a lot of work describing the relational data model (and the ways in which SQL is a poor implementation of that model) says this about BOOLEAN on page 275 of SQL and Relational Theory:
The set of system defined scalar types is required to include type BOOLEAN—the most fundamental type of all—but a real system will surely support others as well (INTEGER, CHAR, and so on).
But then again, C. J. Date says the relational data model should not have NULLs, so the problems with BOOLEANs and SQL’s three valued logic go away when NULLs are not part of the data model. (I could have called this post SQL NULLs are evil, part 5.)
I’m rather happy that PostgreSQL supports the BOOLEAN data type, and I’m using it in the data model of the product I’m currently working on. As long as you declare your BOOLEAN fields NOT NULL, you’re restricting that column to contain only true or false, so that it actually behaves like a boolean:
create table users (
...
is_locked boolean default false not null,
...
);
Also, despite all the ways in which I might have made is_locked more SQL-ish by doing any of the following:
locked_on timestamp with time zone not null,
or
lock_status varchar(8) constraint
"need valid lock status"
check (lock_status in
('locked', 'unlocked'))
default 'unlocked' not null,
or
lock_status_id bigint constraint
"must point to valid lock status"
references lock_statuses(id)
not null,
sometimes all I really want to do is store a boolean value, plain and simple. Sometimes, something really is only true or false (“Is this user locked or not?”), and that’s the most useful encoding for the only two states that are possible (or that my data model is concerned with).
Interestingly, the newer SQL standards do seem to have a BOOLEAN type. This should create lots of hair-pulling, given that SQL also supports NULLs and UNKNOWN. BOOLEANs can’t be strictly booleans in SQL’s world of three-valued logic. I get this.
Nonetheless, I’ll be happy to use BOOLEAN (restricted to be NOT NULL) where appropriate.
If we had a choice to either 1) get rid of BOOLEANs and keep NULLs in the SQL standard, or 2) get rid of NULLs and keep BOOLEANs in the SQL standard, I’d nuke NULL from the SQL standard in a heartbeat. I’m siding with C. J. Date on this one: BOOLEAN is the primary (and only required) data type of the relational data model, and SQL needs to follow the relational data model more closely.
SQL NULLs are Evil! 15 July 2009
Posted by manniwood in PostgreSQL, SQL, SQL NULLs.2 comments
In my last post, I spoke about how null in your business application language is not the same as NULL in SQL. Whereas null in your business application language (Java, Python) happens to throw a null pointer error (or equivalent) when you try to add it to numbers or concatenate it to strings or otherwise interact with it, SQL happily returns NULL.
As a result, Joe Celko says to avoid NULL as much as you can in your SQL. C. J. Date says that the relational data model isn’t even supposed to contain NULL! (Although it should be noted that NULL exists in SQL because C. F. Codd put it in his original writings on the topic, and SQL was quick to pick it up. Alas.)
As I said in my previous post, in SQL, NULL does not mean “there’s nothing here!”; it means “I have no idea!”
Let’s explore this evil further, shall we?
I’m going to fire up psql, the PostgreSQL client, and try a few things.
First, I want NULLs to show up as ‘NULL’ and not whitespace:
mydb# \pset null 'NULL' Null display is "NULL".
Now let’s create a table t that holds quarterly earnings (in whole dollars only, so I can use int for my columns; this is just a toy demo).
mydb# create table t (year int, q1 int, q2 int, q3 int, mydb# q4 int); CREATE TABLE
Now let’s enter earnings for the first two quarters of this year:
mydb# insert into t (year, q1, q2) values (2009, 50, 40); INSERT 0 1
Now let’s calculate earnings for this year so far:
mydb# select q1 + q2 + q3 + q4 as "earnings"
mydb# from t where year = 2009;
earnings
----------
NULL
(1 row)
Any grizzled SQL programmer will tell you this makes perfect sense, and that 90 is not the answer.
mydb# select * from t; year | q1 | q2 | q3 | q4 ------+----+----+------+------ 2009 | 50 | 40 | NULL | NULL (1 row)
We don’t know the values for q3 and q4 yet, so we have left them null. Here’s the problem. In SQL, NULL means “I have no idea”, which literally means q3 and q4 could have values, it’s just that the database does not know what they are! So q1 + q2 + q3 + q4 translates to 50 + 40 + “I have no idea” + “I have no idea”. The only logical total is “I have no idea”! Because when q3 and q4 do not have empty values, but unknown values (which could be 0, but could be 65, or could be -1…), then the sum of q1 through q4 is also unknown. Ah, the joys of SQL.
The same thing happens with text too, by the way. If we create a table for users, and we leave the address2 field NULL (a common thing), but then we try to form a complete address by concatenating the second address field to the other columns, here’s what happens:
mydb# create table u (username text, addr1 text, addr2 text);
CREATE TABLE
mydb# insert into u (username, addr1) values
mydb# ('bob', '32 Short St.');
INSERT 0 1
mydb# select username || ': ' || addr1 || ' ' ||
mydb# addr2 as "ADDR" from u;
ADDR
------
NULL
(1 row)
This should come as no surprise now.
What are the solutions?
There are two.
1. Use COALESCE. If you are saddled with a database that is riddled with NULLs being (incorrectly) used to indicate emptiness (instead of unknown), SQL’s COALESCE is here to help. I won’t discuss it any further, but it’s very useful when you’re faced with NULLs that you can’t vanquish.
2. Fix your data model to not allow NULLs! One thing I’ve come to really appreciate about SQL is NOT NULL and DEFAULT values. They really help you make more accurate, more robust data model. Another thing I’ve come to appreciate about PostgreSQL in particular is that it allows you to store empty strings, whereas other RDBMSs (I’m looking at you, Oracle) silently convert empty strings to NULLs! I do love Oracle, but, when you can, choose an RDBMS that allows empty strings in your data model. It’s less evil that way.
Here’s how to fix our earnings table:
mydb# update t set q3 = 0, q4 = 0 where year = 2009; UPDATE 1 mydb# select * from t; year | q1 | q2 | q3 | q4 ------+----+----+----+---- 2009 | 50 | 40 | 0 | 0 (1 row) mydb# alter table t alter q1 set not null, mydb# alter q1 set default 0; ALTER TABLE mydb# alter table t alter q2 set not null, mydb# alter q2 set default 0; ALTER TABLE mydb# alter table t alter q3 set not null, mydb# alter q3 set default 0; ALTER TABLE mydb# alter table t alter q4 set not null, mydb# alter q4 set default 0; ALTER TABLE
So now we have made our earnings table not accept nulls, and we have provided sane defaults for each quarter! If you approach each column of each table as being NOT NULL but with a DEFAULT, you’ll be surprised at how many sane defaults there are in most situations. For instance, in the case of quarterly earnings, the earnings for a quarter that hasn’t come yet are not unknown (NULL); they are 0! We know they are 0, because that quarter hasn’t arrived yet, so we’ve earned 0 dollars!
Hence, here’s all we have to do to enter the row for next year, now that we have NOT NULL and sane DEFAULTs:
mydb# insert into t (year) values (2010); INSERT 0 1 mydb# select * from t where year = 2010; year | q1 | q2 | q3 | q4 ------+----+----+----+---- 2010 | 0 | 0 | 0 | 0 (1 row)
This will come in very handy for reporting, when you come to do that later on:
mydb# select q1 + q2 + q3 + q4 as "earnings"
mydb# from t where year = 2009;
earnings
----------
90
(1 row)
Here’s how we fix our user table:
mydb# update u set addr2 = '' where username = 'bob'; UPDATE 1 mydb# alter table u alter username set not null; ALTER TABLE mydb# alter table u alter addr1 set not null, mydb# alter addr1 set default ''; ALTER TABLE mydb# alter table u alter addr2 set not null, mydb# alter addr2 set default ''; ALTER TABLE
Now we’ve made our address fields not accept NULLs, and made them default to the empty string. This makes more sense, because when we did not provide addr2 when we inserted user bob, we did so because we knew addr2 was not provided. So addr2 was not unknown (NULL), it was known to be empty, making the empty string the best representation of this state. Naturally, now this works too:
mydb# select username || ': ' || addr1 || ' ' || addr2 as "ADDR"
mydb# from u where username = 'bob';
ADDR
--------------------
bob: 32 Short St.
(1 row)
If there’s one thing I’ve discovered from designing data models that eschew NULLs wherever possible, it’s this:
1. they’re usually not necessary, and
2. there’s usually a sane default that actually makes the data model more informative than the “Idunno” shrug of the shoulders that is SQL’s NULL.
Also, if we take C. J. Date’s idea seriously that a relational database is a series of propositions all of which must evaluate to true, the use of NULLs in your data model throws a real wrench in the machinery. Because with NULLs, there’s always the problem that certain propositions will be neither provably true nor false, but UNKNOWN.
I haven’t yet touched upon interesting things you can do to avoid NULLs in foreign keys, which I may discuss in a later post.
SQL NULL is not Application Language NULL 14 July 2009
Posted by manniwood in Java, PostgreSQL, Programming, Python, SQL, SQL NULLs.3 comments
When I first started using Python, it irritated me that null was None. C uses NULL, Java and JavaScript use null; why did Python have to use None? And why did Lisp have to use nil? Can’t all languages just standardise on null? It seems so arbitrary and capricious.
I’ve had a change of heart: I actually think the word ‘null’ needs to go away: languages should specify exactly what they mean by null, the way Python means None when it says None.
Like most application languages, Python throws errors when you try to add None to an integer, or concatenate None with strings:
>>> 1 + None Traceback (most recent call last): File "", line 1, in TypeError: unsupported operand type(s) for +: 'int' and 'NoneType' >>> "foo " + None Traceback (most recent call last): File "", line 1, in TypeError: cannot concatenate 'str' and 'NoneType' objects
In Java, you get null pointer errors trying to do similar things with Integer objects and Strings. Maybe it would be nice if Java also called null None, seeing as that’s what null really means in Java too.
In the case of SQL, it would be nice if NULL was called UNKNOWN. (Unfortunately, UNKNOWN is already taken! When was the last time you used the UNKNOWN keyword in SQL? Me neither. But it’s there…)
SQL’s NULL is often (ab)used to indicate “none”, but that’s incorrect: really a SQL NULL means “I have no idea”.
Watch this to see what I mean.
I fire up a PostgreSQL client, and ask it to print “NULL” instead of whitespace whenever NULL is encountered in result sets:
mydb=# \pset null 'NULL' Null display is "NULL".
and now I try to add an integer and NULL, then try to concatenate NULL to a string:
mydb# select 1 + null as "RESULT"; RESULT -------- NULL (1 row) mydb# select 'foo ' || null as "RESULT"; RESULT -------- NULL (1 row)
No tracebacks or null pointer errors in SQL! Nope. You get results. Why? Because SQL NULL does not mean “None”, it means “I have no idea!”
Hence, the only logical answer to adding one and “I have no idea” is “I have no idea”, because we have no idea what to what we are adding the number one—so if we are forced to provide an answer, we must respond “I have no idea!” And I’ll say it again: in SQL, NULL means “I have no idea”, not “None”. (Given that SQL has the UNKNOWN keyword, you’d think that UNKNOWN would be the result to the above queries. For whatever reason, SQL returns NULL instead. On the plus side, this re-inforces the fact that even SQL considers NULL to be the equivalent of UNKONWN.)
Likewise for concatenating ‘foo ‘ and SQL NULL: The answer is not ‘foo ‘ as you might expect if you were concatenating ‘foo ‘ and nothing; the answer is SQL NULL, because we are concatenating ‘foo ‘ and “I have no idea”. And when we have no idea what we are concatenating ‘foo ‘ to, we cannot predict what the resulting string would be, so the only logical answer is “I have no idea!”
In a sense, you could argue that Java null and Python None also mean “I have no idea”, it’s just that, in those languages, when you try to add 1 to “I have no idea”, you get an error. The compiler is basically stopping and asking for a time out: “Just what are you asking me to do here? I can’t add one and ‘I have no idea’!” Wouldn’t it be nice if SQL did that? But it does not. SQL happily returns the result: “I have no idea what one plus ‘I have no idea’ equals.”
My next blog post will be an appreciation of why Joe Celko says to avoid nulls in SQL whenever you can, and why C. J. Date says nulls should not even be part of the relational data model. I want to keep on typing, but I want to leave this blog entry with only one major idea: that nulls in your code’s business logic layer are not the same as nulls in your RDBMS!
Why Maps Are Not the Default Return Type of Python’s DB API 13 July 2009
Posted by manniwood in PostgreSQL, Python, SQL.2 comments
One thing that irritates me about Python’s DB API is that rows in result sets are by default integer indexed, rather than column-name indexed.
So, for the query
select id as "ID",
username as "USERNAME"
from users
I have to retrieve “USERNAME” from a row of my result set with a numeric offset, like so:
username = myrow[1]
Python’s DB API does not, by default, let me access column rows by name. Why?
It turns out this irritating state of affairs comes from a “feature” of SQL: columns in result sets do not need unique names.
I figured I’d test this in my favourite RDBMS, PostgreSQL:
mydb=# create table t (t text not null, t text not null); ERROR: column "t" specified more than once
This made me happy: tables cannot contain duplicate row names.
But, result sets can contain duplicate row names:
mydb=# create table t1 (t text not null);
CREATE TABLE
mydb=# create table t2 (t text not null);
CREATE TABLE
mydb=# insert into t1 (t) values ('Hello');
INSERT 0 1
mydb=# insert into t2 (t) values ('world');
INSERT 0 1
mydb=# select * from t1, t2;
t | t
-------+-------
Hello | world
(1 row)
mydb=# select t1.t, t2.t from t1, t2;
t | t
-------+-------
Hello | world
(1 row)
So if I tried to retrieve column ‘t’ in the above two queries, I wouldn’t be providing enough information to get a single column.
Now it turns out that the psycopg2 implementation of the PostgreSQL DB API driver does allow you to return maps instead of sequences for each row:
import psycopg2.extras curs = conn.cursor(cursor_factory=\ psycopg2.extras.RealDictCursor)
And I use this as the default in Pybatis. That way, I can do the following using my original query at the top of this post:
username = myrow["USERNAME"]
But it’s irritating to know that SQL allows same-named columns in result sets. It’s part of SQL’s legacy of being a flawed implementation of the relational data model. In the relational data model, tables and result sets are of the same type: they are both relations. So they both are constrained to having unique column names. (There aren’t actually columns in the relational data model, but I’ll ignore that right now.)
On the other hand, in SQL, tables and result sets are different. Tables must have unique column names, but result sets don’t have to.
Happily, there is an easy way to get around this. Always explicitly name the columns of result sets to ensure all of the column names are unique. Then, maps (or hashes, or dictionaries) can be used to represent rows from result sets:
mydb=# select t1.t as "t1_t", t2.t as "t2_t" from t1, t2; t1_t | t2_t -------+------- Hello | world (1 row)
Also, that’s why I named my columns in the sample query opening this post. It’s just a habit now: it allows me to guarantee what names I’ll use when accessing my data in the dictionaries I get back as my result set.
This is one area where I wish SQL more closely followed the relational model. In the relational model, the target of a query is a relation (or relations), and the result of a query is a relation. Nice and simple.
If SQL followed the relational data model more closely, result sets would no more be allowed duplicate column names than tables. And that would be a beautiful thing.