jump to navigation

SQL NULLs are Evil! Part III 29 July 2009

Posted by manniwood in SQL, SQL NULLs.
add a comment

After reading SQL NULLs are Evil! and SQL NULLs are Evil! Part II, if you are still not convinced that SQL NULLs are evil, read David Wheeler’s Neither NULL nor NOT NULL: An SQL WTF.

Call me old fashioned; call me pedantic; but I like assertions about my data to be either true or false, not unknown.

Will Microsoft Ever Save Money by Abandoning IE? 29 July 2009

Posted by manniwood in Uncategorized.
add a comment

I was reading Ars Technica’s description of the ballot page Microsoft’s operating systems will use in the EU, and I started wondering…

With the browser wars heating up again, and Microsoft having to pour resources into a product it gives away for free, when does it become cost-effective to abandon IE and let the better browsers take over?

Back when Microsoft won the first browser war around 2001, it stopped improving IE6 so that the internet would languish, and people would realise that the Windows desktop was where all new applications should be developed.

This did not work out exactly how Microsoft wanted. Instead, the internet continued to become popular, becoming as good as it possibly could under the limitations of IE6. So, I guess you could say Microsoft slowed down the development of the internet, but did not stop it.

Now, Microsoft is moving more of its own stuff into the cloud, realising the internet is here to stay. (Even MS Office is moving into the cloud.)

I wonder if IE isn’t now holding Microsoft back? It’s well known that a large portion of every web-based company’s budget is spent on supporting IE. If IE disappeared tomorrow, every web-based company would instantly become more profitable, because it would cost less money bringing new features to market on the remaining browsers, which are all standards compliant.

The funny thing is, I think this cost now affects even Microsoft. If Microsoft wants to compete in the cloud, they too need to leverage the best attributes browsers provide. Supporting IE must be a pain point for even Microsoft.

I think Microsoft is still living under the delusion that it can steer the internet in a direction more amenable to Microsoft’s goals, and that is why they persist in developing newer versions of IE, which is now pretty much permanently in a trailing position (feature-wise) behind every other browser. So Microsoft continues to spend money making a browser that makes their cloud division spend more money supporting it.

If this didn’t affect the rest of us, I’d laugh. As it is, I feel more like crying.

But who knows? Maybe one day, after hemorrhaging enough money, Microsoft will take the Google route, and base its browser on a real rendering engine. I never thought Microsoft would contribute code to Linux, and that happened last week…

When Was the Last Time You Switched RDBMSs? 22 July 2009

Posted by manniwood in Programming, SQL.
1 comment so far

It’s been pointed out by many people, including this humble blogger, that writing portable SQL, or worse, using an ORM to make it easy to switch RDBMSs, is about as silly as using only a subset of the features of Python or Ruby in case you have to switch between them, or back to Java.

Thomas Kyte is particularly lucid (not to mention entertaining) on this point, especially in the section “Openness” in Chapter 1 of Expert Oracle Database Architecture. He talks about a team that had committed to some very specific front-end technologies, but were worried about what would happen if they switched databases, and wanted to write the most portable SQL possible.

What really blows me away is that in the 13 years I’ve been building database-backed web sites, I’ve only seen a web site switch out its back end once (at my recommendation), but I have seen web sites switch their front ends plenty of times.

In fact, if I compare my latest front-end switch to the only back-end switch I’ve ever done, I’d have to say that they were both worthwhile, but that the back-end switch was easier.

With the back-end switch, I moved from a small Oracle database to a small PostgreSQL database, mostly for cost reasons (can’t beat the cost of PostgreSQL!). I was (and still am) a huge fan of Oracle, especially its locking model (writers don’t block readers / readers don’t block writers), and I found the same locking model in PostgreSQL. And trust me, this was going from Oracle-specific, leverage-every-Oracle-advantage, vendor-locked-in code to PostgreSQL-specific, leverage-every-PostgreSQL-advantage, vendor-locked-in code. The switch was way more straightforward than the front-end switch I did from Java/J2EE to Python/Django. (But I’ll repeat that both switches were worthwhile.)

But this was the only time I’d ever switched back ends, and I don’t think I’m alone in noting how rare it is to switch RDBMSs. Front-end switches are way more common.

Why?

I think I was listening to a stackoverflow.com podcast where Jeff Atwood said the stackoverflow data was their most important asset—more important than even their front-end code. I think most websites / businesses feel that way about their data. Their data help them make decisions, make money, etc, etc. The processes that act upon those data are important too, but subordinate to the data.

Same thing with a web site. The front end is obviously important (no front-end, no visitors!), but the data are king. I think this is why most web sites and businesses are very loathe to do anything that might disrupt the database. But the front end? There’s room to experiment there. In fact, most good databases end up with multiple front-ends, because their data are so valuable. And once you have multiple front-ends, you become even more loathe to change your database, because now more clients depend on it.

So not only does the oft-quoted ORM argument of “but what if you want to switch databases?” ring hollow to me, it actually strikes me as downright silly. Because front-ends, (you know, like where the ORM is kept?) change way more than back ends, and yet we still haven’t seen tools nearly as silly as ORMs to help people abstract away the front-end in case of an inevitable switch.

And if having no front-end abstraction layers has not meant the end of the world, I think we should do away with back-end abstraction layers, and other foolish attempts to write “platform independent code at all costs”. Swapping out the back end is less common than swapping out the front end. Or, to put it more bluntly, designing for the eventual swapping-out of the back end is a very unusual, perhaps even bogus, project requirement.

SQL NULLs are Evil! Part II 21 July 2009

Posted by manniwood in SQL, SQL NULLs.
1 comment so far

In SQL NULLs are Evil (Part I), I detailed the evilness of SQL NULLs, and solutions for more accurately representing missing data in tables.

One thing I did not cover was the evilness of NULLs in foreign keys, and solutions to address that.

For instance, let’s say we have a table of products where we want to keep a record of which user created a row, and which user modified a row, like so:

create sequence user_ids start 1;
create table users (
    id bigint constraint "User id must be unique."
        primary key not null,
    ...
);

create sequence product_ids start 1;
create table products (
    id bigint constraint "Product id must be unique."
        primary key not null,
    name varchar(80) constraint
        "Product name must be unique."
        unique not null,
    inserted_on timestamp with time zone
        default current_timestamp not null,
    updated_on  timestamp with time zone
        default current_timestamp not null,
    inserted_by bigint constraint
        "Product must be inserted by existing user."
        references users(id) not null,
    updated_by  bigint constraint
        "Product must be updated by existing user."
        references users(id) not null
);

One thing we could do for freshly-inserted products is just cheat, and say that the updated_by user is the same as the inserted_by user, even though, strictly speaking, freshly-inserted products have never been updated. (In fact, that’s the cheat we use for updated_on; we just use the current time the new row is inserted. This is not exactly accurate; another solution could be to use a sentry date that is impossible in the real world.)

Another solution is to literally make an unknown user and set any new row’s inserted_by foreign key to the unknown user. As you can see, I’ve set up the sequences (used for generating primary keys) to begin at 1; I’ll use ids of less than 1 for special purposes. 0 seems like a good idea for the unknown user:

insert into users (id, username, ...)
           values (0, 'unselected' ...);

I could even go back and put a default value of 0 in products.updated_by’s column definition, so that freshly-inserted products’ updated_by would default to the unknown user automatically.

Getting More Extreme

Once you start to think this way, you can decide what sorts of sentry values you want or need. For instance, let’s say you are working on a database where columns in a bugs table may have foreign keys to a program features table. Now, consider this list of program features:

program_features
  -3   'Any'  -- used for search dropdowns
  -2   'Unsure'
  -1   'Not Applicable'
   0   'Unselected'
   1   'Back End'
   2   'Business Logic'
   3   'User Interface'
   ...

Furthermore, let’s assume there is only a handful of features, and that they fit in a drop-down list when the user enters a new bug. ‘Unselected’ could be entered as the program feature of a new bug if the user did not bother to select a program feature from the drop-down menu. ‘Not Applicable’ could be used if the bug did not pertain to any particular feature of the program, or ‘Unsure’ could be selected if the user was unsure.

In search menus, where there may be a drop-down to search for bugs by program feature, you may be using -3 as a sentry value to mean ‘Search all’. You may (or may not, depending on your tastes) encode the ‘all’ sentry value into your table as a reminder of what sentry value you are using (and, maybe you’d add a check to your bugs table to disallow -3 as a program feature foreign key).

These are, of course, just suggestions. I’ve found some of these techniques useful in the databases I build. Used judiciously, these sorts of sentry values let you have non-null foreign keys in many areas of your database where you might have put up with NULLs.

You can also usefully represent kinds of missing data, and search on them later. So, for instance, instead of looking for bugs where the product feature is unknown (which is what SQL NULL really means; not “missing”, but “unknown”), you can actually search for bugs where the user purposefully didn’t choose a program feature, or expressly didn’t know which program feature contained the bug, or whatever. When these sorts of things need to be tracked, it’s nice to know you can do so.

If you start to think about your data this way, you’ll find that there are different flavours of “missing”. You can decide how much you care about such things, and design your database accordingly. As a pleasant benefit, there will be one less place in your data model where NULLs are allowed.

Polyglot Programming Eases Context Switching 20 July 2009

Posted by manniwood in Programming.
add a comment

One thing that has stuck in my head as a defence of ORM comes from my copy of The Definitive Guide to Django, that I quoted in my earlier blog entry on The RDBMS as Final Arbiter of Your Data Model. It says in chapter 5, page 66, that

Writing Python is fun, and keeping everything in Python limits the number of times your brain has to do a “context switch.” It helps productivity if you keep yourself in a single programming environment/mentality for as long as possible. Having to write SQL, then Python, and then SQL is disruptive.

It occurred to me that many of my projects have been polyglot: I little Perl, a little Java, some shell, maybe a Makefile, some html, some JavaScript, a dash of CSS, and a solid foundation of SQL. And, lately, Python in the place of Java.

One thing that I think the author of the quote above got wrong is that when you are switching from your business rules code to your data retrieval and manipulation code, you’re already making a context switch, whether you are switching languages or not.

When you switch from coding business rules to coding what will show up in a browser, you’re already making a context switch. It’s just that, in addition to making a context switch, you are also switching languages: you’re going from Python for the business rules code, to HTML (and perhaps templates) for the display code.

Not only is this language switch that accompanies your context switch OK, it’s actually preferable! After all, Python is a general purpose language that’s quite good for coding business rules, whereas HTML is not even an language: it’s a markup system for displaying web pages.

Not only do I think context switches already happen when going from one major area of your app to another (such as business rules to front-end display), I think that having domain-specific languages for both of these tasks eases the transition.

Naturally, there’s as much of a context switch going from your business code to your data definition and manipulation code, as there is going from your business code to your display code. And, the same way going from Python to HTML eases the transition to the display layer, so does SQL ease the transition to the data layer.

Another way I can put it is this: some books have charts. The charts can be described with words, so that there will not be a context switch between two different ways of visually representing ideas (words versus charts), but most of us prefer to see charts as charts, while still having other ideas conveyed as words. The reason why we have more than one way of representing ideas is because certain ideas are expressed more succinctly and accurately using one visual representation instead of another.

So too with the problems we try to solve in software. Within reason, different programming languages should not be feared, they should be embraced. If there’s a better way to describe a solution to your problem, use it! Do not fear domain-specific languages.

I remember how, for a time, I compiled my Java projects with Ant. I quickly switched back to Make, because not only was Make more succinct than the horrendous xml morass of Ant, it also allowed me to use the full expressive power of the Unix shell.

Obviously, you can go overboard, but, as a general rule of thumb, polyglot programming should be encouraged. Today’s software solutions solve a lot of different problems from a lot of different problem domains. All of these problems may add up to a unified solution, but when you attack each sub-problem, you don’t want a unified language. You want the expressive power of languages tailored for each problem domain.

The database-backed web sites that I build for a living have multiple problem domains, and that’s why I end up using all sorts of languages and markup: Bash, Make, SQL, Python, HTML, CSS, JavaScript. I never feel as though this is too many languages, because while I’m working on each sub-problem, the language or markup that I’m using seems like the right tool for the job.

Finally, I must say I like the renewed interest in domain-specific languages that seems to be happening right now. I hope one day, it’s trivial to build domain-specific languages for particular areas of a project. But in the meantime, enjoy the languages you are currently using—as long as they are making you more productive, that is.

Software Engineering Was Already Dead 19 July 2009

Posted by manniwood in Programming.
add a comment

Yesterday’s Coding Horror Blog, Software Engineering: Dead? points to software luminary Tom DeMarco’s article Software Engineering: An Idea Whose Time Has Come and Gone?, and I kind of chuckled.

I don’t think this is even a question. The vast majority of software was never engineered. It was crafted. And it’s not a new idea that maybe software engineering is dead, as though that’s a question that needs to be timidly asked, just to hedge one’s bets. It never even existed for the majority of the software that was ever written.

Back in 2002, I read Software Craftsmanship, by Pete McBreen. He successfully dismantled the idea that software development was software engineering, and found a better metaphor that served as the title of his book: software craftsmanship. And when I read his book, nodding with agreement the whole way through, I didn’t think that software engineering was dead in 2002, either. McBreen was documenting the fact that software engineering, for most software development, had never existed.

So props to you, Mr. McBreen: you called it 7 years ago, and others are catching up to you.

Has It All Been Said by Others? 17 July 2009

Posted by manniwood in Programming.
add a comment

I often find the technical issues I blog about are blogged by others as well. I may think that Objects are Hammers (and traditional data structures are like Swiss Army knives) but Jeff Atwood thinks that Objects Suck. He’s more succinct, and he blogged it before I did.

I may think that SQL generation is a templating problem rather than a code generation problem, and Ted Neward may think that ORM is the Vietnam of computer science, but recently, Jeremy Zawodny declared, simply, Database Abstraction Layers Must Die!, where he makes a criticism very similar to one I made in The RDBMS as the Final Arbiter of Your Data Model, only much more succinctly:

[Abstracting away your RDBMS so that you can switch vendors is] no different from saying “I’m going to limit myself to the subset of PHP that’s the same in Perl and C, because I might want to switch languages one day and ‘painlessly’ port my code.”

Well said, Mr Zawodny.

Sometimes I think I don’t need to blog about these issues, because other people do (and do it better), so the important stuff is already being said.

[UPDATE: But also, as I read other blogs with splashy titles like "Objects Suck" and "ORM: the Vietnam of Computer Science", I'm finally finding my own blogging voice.

I'm going back and shamelessly editing my earlier blog posts and toning them down.

Why?

I've decided that the best way to contribute to the technical blogosphere is to try to have more nuance and care in my writing. Although I'm happy to hold and to share a strongly-held opinion, overstating claims is already being done!

So to stand out, I'm going to tone down. :-) ]

  • ORM, though it has its place, is the Vietnam of Computer Science; a more user-friendly way of putting this would be to say that ORM is not a panacea, and there are a lot of edge cases where it does not work well.
  • Objects, though useful, are often {mis|over}used where classic data structures like lists and maps would do better
  • Your data model should generally be in your database, not your application
  • Understanding the relational data model, and how it differs from SQL, is very useful knowledge if you use an RDBMS a lot
  • Not all data are relational
  • If you code database-backed web sites, you must know SQL; no excuses, even if you use an ORM
  • Some things are templating problems, not code generation problems
  • etc, etc…

So maybe in addition to my own humble offerings, more people need to blog about these topics. Some of my favourite bloggers may overstate their claims—I may overstate my claims sometimes—but these are the topics that I’m passionate about, and that I’m most interested in adding my voice to.

Everything Is an Edge Case: Lessons from Frameworks that I Applied to Django 16 July 2009

Posted by manniwood in Django, Programming, Python.
add a comment

Commenters on my From J2EE to Django: Observations from Porting a Web App have noted that I don’t use a lot of Django’s capabilities: I wrote my own SQL mapper/templater (Pybatis), thereby jettisoning Django’s ORM; I wrote my own session persistence layer that leverages Pybatis; I swapped out Django’s templates for Jinja 2; I don’t use the admin features; and I don’t use the automated form generation!

That doesn’t leave a whole lot, but let me give props to Django on the stuff I had no desire to write on my own:

  • I have no desire to deal with all the raw response headers and such.
  • I have no desire to deal with mapping urls to functions in modules. I also love the way Django does it, both from the pretty-url point of view, and the functions-not-classes point of view. (It seems so bloated to me when I look back on my J2EE code and see that I had to make an entire servlet object for each new kind of user interaction, whereas with Django, I only make one function. Less is more!)
  • I have no desire to work hard at integrating Python with Apache. mod_wsgi and Django’s support for it make me very happy.
  • I have no desire to write the configuration stuff that Django provides.
  • I have no desire to write the middleware stuff that Django provides.

In my opinion, even though I’ve jettisoned what a lot of people might see as Django’s major components, I think the features I’ve chosen to use make using Django worth-while. And I really appreciate the fact that Django does not force me to use its ORM or its templating language. Thank you, Django team.

But, why don’t I use the other features of Django? The short answer is the title of this blog entry: Everything is an edge case. There’s a great description of Perl that I wish applied to web development: “Perl makes the easy things easy, and the hard things possible.” But it doesn’t apply to web development; it only applies to Perl.

With web development, there’s not a lot of easy stuff: it’s all hard stuff. For instance, for the last database-backed web application I wrote, there were no forms that could have been auto-generated even if I’d wanted to: every form was unique and complicated in its own way. Each form served the application well, but there wasn’t enough commonality between the forms that form code generation was possible. (Jinja includes eased some of the pain, but that’s templating, not code generation.)

Same with ORM: I already think that ORM is the Vietnam of computer science, and I wanted to create my data model in the database, directly in SQL, so I could get exactly the schema I wanted. I wanted to write my database access code in SQL myself (especially the reporting code). The ORM would have been in my way. So I used a thin, template-based access layer like Pybatis.

The whole automatically-generated admin site that Django could produce for me? It would have been great, but the demands of my project outstripped what could be automatically generated. Again, every form was tweaked and twisted and user-frienlified to the point where only hand-coding would do.

And this is not a criticism of Django, by the way. When I used J2EE, I had stopped using frameworks altogether: it was easier for me to use raw servlets and iBATIS, and cobble together my own “framework” of often-used idioms as I went along.

And you know what would happen? I’d start to try to codify stuff in higher levels of abstraction (“Now all my editors can be simply configured from a configuration file! The problem editing things is solved!”) only to have a new requirement render my newest abstraction useless. As the title of this blog says, when building non-trivial database-backed web applications for demanding clients, everything is an edge case. On all the web sites I end up working on, there are no common, repeated, trivial cases of anything that can be encoded in a framework.

On the other hand, there are a number of basic housekeeping features that a lot of frameworks provide nowadays. These features go unsung, and yet they are often the most useful.

Finally, I think one thing that works against (or with?) Django and many other Python frameworks is that Python is such a productive language to code in (compared to Java, anyway) that it’s often quite easy to take a framework that takes care of stuff you didn’t want to write yourself, and then code the rest of what you needed by hand in a couple of days. If there’s one thing that continues to impress me about Python, it’s that if I can’t find a library to do something I need, I generally surprise myself by how quickly I can cobble something together to do what I need.

SQL NULLs are Evil! 15 July 2009

Posted by manniwood in PostgreSQL, SQL, SQL NULLs.
2 comments

In my last post, I spoke about how null in your business application language is not the same as NULL in SQL. Whereas null in your business application language (Java, Python) happens to throw a null pointer error (or equivalent) when you try to add it to numbers or concatenate it to strings or otherwise interact with it, SQL happily returns NULL.

As a result, Joe Celko says to avoid NULL as much as you can in your SQL. C. J. Date says that the relational data model isn’t even supposed to contain NULL! (Although it should be noted that NULL exists in SQL because C. F. Codd put it in his original writings on the topic, and SQL was quick to pick it up. Alas.)

As I said in my previous post, in SQL, NULL does not mean “there’s nothing here!”; it means “I have no idea!”

Let’s explore this evil further, shall we?

I’m going to fire up psql, the PostgreSQL client, and try a few things.

First, I want NULLs to show up as ‘NULL’ and not whitespace:

mydb# \pset null 'NULL'
Null display is "NULL".

Now let’s create a table t that holds quarterly earnings (in whole dollars only, so I can use int for my columns; this is just a toy demo).

mydb# create table t (year int, q1 int, q2 int, q3 int,
mydb# q4 int);
CREATE TABLE

Now let’s enter earnings for the first two quarters of this year:

mydb# insert into t (year, q1, q2) values (2009, 50, 40);
INSERT 0 1

Now let’s calculate earnings for this year so far:

mydb# select q1 + q2 + q3 + q4 as "earnings"
mydb# from t where year = 2009;
 earnings
----------
     NULL
(1 row)

Any grizzled SQL programmer will tell you this makes perfect sense, and that 90 is not the answer.

mydb# select * from t;
 year | q1 | q2 |  q3  |  q4
------+----+----+------+------
 2009 | 50 | 40 | NULL | NULL
(1 row)

We don’t know the values for q3 and q4 yet, so we have left them null. Here’s the problem. In SQL, NULL means “I have no idea”, which literally means q3 and q4 could have values, it’s just that the database does not know what they are! So q1 + q2 + q3 + q4 translates to 50 + 40 + “I have no idea” + “I have no idea”. The only logical total is “I have no idea”! Because when q3 and q4 do not have empty values, but unknown values (which could be 0, but could be 65, or could be -1…), then the sum of q1 through q4 is also unknown. Ah, the joys of SQL.

The same thing happens with text too, by the way. If we create a table for users, and we leave the address2 field NULL (a common thing), but then we try to form a complete address by concatenating the second address field to the other columns, here’s what happens:

mydb# create table u (username text, addr1 text, addr2 text);
CREATE TABLE
mydb# insert into u (username, addr1) values
mydb# ('bob', '32 Short St.');
INSERT 0 1
mydb# select username || ': ' || addr1 || ' ' ||
mydb# addr2 as "ADDR" from u;
 ADDR
------
 NULL
(1 row)

This should come as no surprise now.

What are the solutions?

There are two.

1. Use COALESCE. If you are saddled with a database that is riddled with NULLs being (incorrectly) used to indicate emptiness (instead of unknown), SQL’s COALESCE is here to help. I won’t discuss it any further, but it’s very useful when you’re faced with NULLs that you can’t vanquish.

2. Fix your data model to not allow NULLs! One thing I’ve come to really appreciate about SQL is NOT NULL and DEFAULT values. They really help you make more accurate, more robust data model. Another thing I’ve come to appreciate about PostgreSQL in particular is that it allows you to store empty strings, whereas other RDBMSs (I’m looking at you, Oracle) silently convert empty strings to NULLs! I do love Oracle, but, when you can, choose an RDBMS that allows empty strings in your data model. It’s less evil that way.

Here’s how to fix our earnings table:

mydb# update t set q3 = 0, q4 = 0 where year = 2009;
UPDATE 1
mydb# select * from t;
 year | q1 | q2 | q3 | q4
------+----+----+----+----
 2009 | 50 | 40 |  0 |  0
(1 row)
mydb# alter table t alter q1 set not null,
mydb# alter q1 set default 0;
ALTER TABLE
mydb# alter table t alter q2 set not null,
mydb# alter q2 set default 0;
ALTER TABLE
mydb# alter table t alter q3 set not null,
mydb# alter q3 set default 0;
ALTER TABLE
mydb# alter table t alter q4 set not null,
mydb# alter q4 set default 0;
ALTER TABLE

So now we have made our earnings table not accept nulls, and we have provided sane defaults for each quarter! If you approach each column of each table as being NOT NULL but with a DEFAULT, you’ll be surprised at how many sane defaults there are in most situations. For instance, in the case of quarterly earnings, the earnings for a quarter that hasn’t come yet are not unknown (NULL); they are 0! We know they are 0, because that quarter hasn’t arrived yet, so we’ve earned 0 dollars!

Hence, here’s all we have to do to enter the row for next year, now that we have NOT NULL and sane DEFAULTs:

mydb# insert into t (year) values (2010);
INSERT 0 1
mydb# select * from t where year = 2010;
 year | q1 | q2 | q3 | q4
------+----+----+----+----
 2010 |  0 |  0 |  0 |  0
(1 row)

This will come in very handy for reporting, when you come to do that later on:

mydb# select q1 + q2 + q3 + q4 as "earnings"
mydb# from t where year = 2009;
 earnings
----------
       90
(1 row)

Here’s how we fix our user table:

mydb# update u set addr2 = '' where username = 'bob';
UPDATE 1
mydb# alter table u alter username set not null;
ALTER TABLE
mydb# alter table u alter addr1 set not null,
mydb# alter addr1 set default '';
ALTER TABLE
mydb# alter table u alter addr2 set not null,
mydb# alter addr2 set default '';
ALTER TABLE

Now we’ve made our address fields not accept NULLs, and made them default to the empty string. This makes more sense, because when we did not provide addr2 when we inserted user bob, we did so because we knew addr2 was not provided. So addr2 was not unknown (NULL), it was known to be empty, making the empty string the best representation of this state. Naturally, now this works too:

mydb# select username || ': ' || addr1 || ' ' || addr2 as "ADDR"
mydb# from u where username = 'bob';
        ADDR
--------------------
 bob: 32 Short St.
(1 row)

If there’s one thing I’ve discovered from designing data models that eschew NULLs wherever possible, it’s this:

1. they’re usually not necessary, and

2. there’s usually a sane default that actually makes the data model more informative than the “Idunno” shrug of the shoulders that is SQL’s NULL.

Also, if we take C. J. Date’s idea seriously that a relational database is a series of propositions all of which must evaluate to true, the use of NULLs in your data model throws a real wrench in the machinery. Because with NULLs, there’s always the problem that certain propositions will be neither provably true nor false, but UNKNOWN.

I haven’t yet touched upon interesting things you can do to avoid NULLs in foreign keys, which I may discuss in a later post.

SQL NULL is not Application Language NULL 14 July 2009

Posted by manniwood in Java, PostgreSQL, Programming, Python, SQL, SQL NULLs.
3 comments

When I first started using Python, it irritated me that null was None. C uses NULL, Java and JavaScript use null; why did Python have to use None? And why did Lisp have to use nil? Can’t all languages just standardise on null? It seems so arbitrary and capricious.

I’ve had a change of heart: I actually think the word ‘null’ needs to go away: languages should specify exactly what they mean by null, the way Python means None when it says None.

Like most application languages, Python throws errors when you try to add None to an integer, or concatenate None with strings:

>>> 1 + None
Traceback (most recent call last):
  File "", line 1, in
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
>>> "foo " + None
Traceback (most recent call last):
  File "", line 1, in
TypeError: cannot concatenate 'str' and 'NoneType' objects

In Java, you get null pointer errors trying to do similar things with Integer objects and Strings. Maybe it would be nice if Java also called null None, seeing as that’s what null really means in Java too.

In the case of SQL, it would be nice if NULL was called UNKNOWN. (Unfortunately, UNKNOWN is already taken! When was the last time you used the UNKNOWN keyword in SQL? Me neither. But it’s there…)

SQL’s NULL is often (ab)used to indicate “none”, but that’s incorrect: really a SQL NULL means “I have no idea”.

Watch this to see what I mean.

I fire up a PostgreSQL client, and ask it to print “NULL” instead of whitespace whenever NULL is encountered in result sets:

mydb=# \pset null 'NULL'
Null display is "NULL".

and now I try to add an integer and NULL, then try to concatenate NULL to a string:

mydb# select 1 + null as "RESULT";
 RESULT
--------
   NULL
(1 row)

mydb# select 'foo ' || null as "RESULT";
 RESULT
--------
 NULL
(1 row)

No tracebacks or null pointer errors in SQL! Nope. You get results. Why? Because SQL NULL does not mean “None”, it means “I have no idea!”

Hence, the only logical answer to adding one and “I have no idea” is “I have no idea”, because we have no idea what to what we are adding the number one—so if we are forced to provide an answer, we must respond “I have no idea!” And I’ll say it again: in SQL, NULL means “I have no idea”, not “None”. (Given that SQL has the UNKNOWN keyword, you’d think that UNKNOWN would be the result to the above queries. For whatever reason, SQL returns NULL instead. On the plus side, this re-inforces the fact that even SQL considers NULL to be the equivalent of UNKONWN.)

Likewise for concatenating ‘foo ‘ and SQL NULL: The answer is not ‘foo ‘ as you might expect if you were concatenating ‘foo ‘ and nothing; the answer is SQL NULL, because we are concatenating ‘foo ‘ and “I have no idea”. And when we have no idea what we are concatenating ‘foo ‘ to, we cannot predict what the resulting string would be, so the only logical answer is “I have no idea!”

In a sense, you could argue that Java null and Python None also mean “I have no idea”, it’s just that, in those languages, when you try to add 1 to “I have no idea”, you get an error. The compiler is basically stopping and asking for a time out: “Just what are you asking me to do here? I can’t add one and ‘I have no idea’!” Wouldn’t it be nice if SQL did that? But it does not. SQL happily returns the result: “I have no idea what one plus ‘I have no idea’ equals.”

My next blog post will be an appreciation of why Joe Celko says to avoid nulls in SQL whenever you can, and why C. J. Date says nulls should not even be part of the relational data model. I want to keep on typing, but I want to leave this blog entry with only one major idea: that nulls in your code’s business logic layer are not the same as nulls in your RDBMS!