jump to navigation

RDBMS Static Typing vs. General Purpose Language Dynamic Typing 23 August 2009

Posted by manniwood in Programming, Python, SQL, Uncategorized.
add a comment

One thing that’s fun about being on the job interview circuit is that you get to re-think a lot of your biases and tastes.

Something that has occurred to me is that I prefer to program in dynamically typed languages when I can (lately that would be manifested as a preference for Python over Java) and yet this fails to explain my love for RDBMSs, and the static typing that goes with them!

For general-purpose programming, Paul Graham’s opinion definitely rings true, so I’ll quote him at length:

As far as I can tell, the way they taught me to program in college was all wrong. You should figure out programs as you’re writing them, just as writers and painters and architects do.

Realizing this has real implications for software design. It means that a programming language should, above all, be malleable. A programming language is for thinking of programs, not for expressing programs you’ve already thought of. It should be a pencil, not a pen. Static typing would be a fine idea if people actually did write programs the way they taught me to in college. But that’s not how any of the hackers I know write programs. We need a language that lets us scribble and smudge and smear, not a language where you have to sit with a teacup of types balanced on your knee and make polite conversation with a strict old aunt of a compiler.

—Paul Graham, Hackers and Painters

What’s interesting is that I find for the projects I’ve worked on, I really like to nail down the data model in an RDBMS, and then code the business logic and display code using as flexible a language as possible. I think this bias comes from two requirements I’ve encountered a lot in my programming career:

1) make stuff easy to report on

2) change data capture and display on a fairly routine basis.

To make stuff easy to report on, it’s nice to have constraints and other forms of data integrity enforcement, so that your reports are already half-way complete just because of how your RDBMS stores (or refuses to store) your data: you know your data are correct, and you just have to aggregate them. Honestly, I sometimes find RDBMSs a bit too rigid to deal with the ever-changing needs of business, but all mature RDBMSs seem to have good support for alter <anything> commands.

And if you can get your data stored in a consistent way in an RDBMS, SQL is at your beck and call to report on your data in all sorts of interesting ways. This generally pleases the business very much.

For the projects I’ve tended to work on, after all that hard work has gone into getting the data model right, I generally prefer to make as thin a layer as possible over that database, so that the database can essentially speak for itself. A dynamically typed language comes in handy here, because it will have good support for lists and maps, and not a whole lot of casting or type conversion has to happen to shuttle data back and forth from the display to the RDBMS, and vice versa. Manipulating and validating form input is generally much easier in higher-level languages as well, with their somewhat easier out-of-the-box implementations of regular expressions, etc, etc.

Of course, this leaves open some interesting possibilities. If my next project has ever-changing needs, and a data model cannot really be pinned down, will I try to squeeze that design into an RDBMS, or will I try to find a more flexible, non-traditional data store? I can easily imagine turning to a non-traditional data store if the data are too unruly, and don’t follow the sorts of patterns that would suggest using an RDBMS.

If the data were to rarely be reported on, I might not even miss SQL that much. After all, SQL really shines when it comes to reporting.

But if I had to generate reports from non-relational data, that might start to get interesting. As much as I love dynamically typed languages, this might be one area where I’d wish for SQL’s strong typing, because programatically preparing reports would feel sort of like re-inventing the SQL wheel.

I don’t know if I’ll ever be able to explain the seeming inconsistency in my love of dynamic typing in general-purpose programming languages, and my love of SQL and the strong typing that goes with it. Maybe it’s because I like storing my data in a formal way (when the data are amenable to it), but I like manipulating my data in a flexible way, especially when I know I can count on them being correct.

All I know is that on the projects I’ve worked on, it’s been a great combination. But I also know that all projects are different, and that not all data are relational. Maybe one day I’ll work on a project that uses dynamic typing from top to bottom. That would be interesting.

Will Python Make Me A Better Java Programmer? 21 August 2009

Posted by manniwood in Java, Programming, Python.
add a comment

As I said in J2EE Still Rules the Roost, the Python jobs are few, and the Java jobs are many.

So I’ve been interviewing primarily for Java jobs lately.

But it strikes me that knowing Python is perhaps an advantage. For instance, employers generally want to know that you’re excited about learning new technologies, and that you have a proven track record of doing so. Well, if you’ve taught yourself a new language and migrated your latest project to that language, that’s pretty good proof, right?

Also, more generally, the world of technology is always moving forward. We all have to keep up, and we should all ideally be forward looking.

I happen to be particularly impatient for that future to arrive, so when left to my own devices, I tend to embrace newer technologies. This is why I’m still chomping at the bit to learn Lisp, or maybe even Arc: I think Paul Graham is on to something when he says that languages are converging towards the feature set offered by Lisp. (Even if not the syntax offered by Lisp. Apparently, Perl 6 is going to have Lisp-style macros—if it ever ships.)

More immediately, will my knowledge of Python help me or hinder me if I end up programming Java at my day job? I think it will only help. Any time you’ve got knowledge beyond what you are required to know to solve the immediate problem at hand, I think it can only be a good thing.

I never want to program without a REPL again 5 August 2009

Posted by manniwood in J2EE, Programming, Python.
add a comment

I’m hacking code in Python right now, and I just said to myself “I never want to program without a REPL again”.

I just read a blog entry at artfulcode.net, where the author said “Vim 7 has auto-complete, although I rarely use it. I’m much more comfortable with an enormous stack of O’Reilly books.” As someone who spends a lot of money on dead trees every year (you’re welcome, O’Reilly), this rang very true for me.

But in addition to consulting lots of programming books, I spend a lot of time with a python interpreter open, typing exploratory code. Often, this is quicker than picking up a book, because I’m already pretty sure what a bit of code will do—Python’s REPL just lets me test it to be certain I have the syntax right, etc, etc.

Now that I think about it, I spend a lot of days with a psql session open, too, so I can type exploratory SQL. There’s really no substitute for being able to turn “I wonder what would happen if…” into “Let’s find out right now!”.

Not too many months ago, when I was still hacking J2EE, I used Eclipse to help me through Java. For such a verbose language, a good development environment is really key, because there’s no REPL, and the APIs are gargantuan.

With Python, I’m back to hacking code in Vim again (sort of full circle, seeing as I used Vim back in my Perl/CGI days) because Python is so productive that a Python IDE isn’t that big of a win over a good editor, a stack of books, and… the REPL! I think a REPL has to be part of the baseline of any new language. (Sort of the way Java added garbage collection to the new baseline back in its day. After Java, you really couldn’t introduce a new general-purpose programming language without garbage collection.) The fact that I couldn’t do exploratory Java programming in a REPL was a real productivity killer that not even Eclipse could mitigate.

With the popularity of Python and Ruby, the baseline has moved up again: any new language must be interpreted as well as compiled. Eclipse tried really hard to compile Java on the fly, to make up for the fact that there was no Java interpreter to speed up development, but I always found myself stopping Tomcat, doing a build, and restarting, just to be certain that each iteration really worked and wasn’t using old class files.

Truth be told, I find myself doing the same thing with Python/Django: I run the full Apache/mod_wsgi/Django stack, and start and stop between iterations… but there’s no compile cycle! Huge restart difference! Every once in a while I run a ‘make clean’ target just to ensure all my .pyc files are fresh, but my stop/start cycle is practically instantaneous, because Python doesn’t compile the way Java does, and Apache is faster than Tomcat.

Of course, as I look to the future, I can easily see myself saying “I will never program without Lisp-style macros again!” The baseline keeps moving up.

I think Paul Graham is on to something when he says general-purpose programming languages are evolving towards lisp. (I just wonder if Lisp’s syntax can be decoupled from Lisp’s feature set.)

But for now, I’ll enjoy my REPL.

Everything Is an Edge Case: Lessons from Frameworks that I Applied to Django 16 July 2009

Posted by manniwood in Django, Programming, Python.
add a comment

Commenters on my From J2EE to Django: Observations from Porting a Web App have noted that I don’t use a lot of Django’s capabilities: I wrote my own SQL mapper/templater (Pybatis), thereby jettisoning Django’s ORM; I wrote my own session persistence layer that leverages Pybatis; I swapped out Django’s templates for Jinja 2; I don’t use the admin features; and I don’t use the automated form generation!

That doesn’t leave a whole lot, but let me give props to Django on the stuff I had no desire to write on my own:

  • I have no desire to deal with all the raw response headers and such.
  • I have no desire to deal with mapping urls to functions in modules. I also love the way Django does it, both from the pretty-url point of view, and the functions-not-classes point of view. (It seems so bloated to me when I look back on my J2EE code and see that I had to make an entire servlet object for each new kind of user interaction, whereas with Django, I only make one function. Less is more!)
  • I have no desire to work hard at integrating Python with Apache. mod_wsgi and Django’s support for it make me very happy.
  • I have no desire to write the configuration stuff that Django provides.
  • I have no desire to write the middleware stuff that Django provides.

In my opinion, even though I’ve jettisoned what a lot of people might see as Django’s major components, I think the features I’ve chosen to use make using Django worth-while. And I really appreciate the fact that Django does not force me to use its ORM or its templating language. Thank you, Django team.

But, why don’t I use the other features of Django? The short answer is the title of this blog entry: Everything is an edge case. There’s a great description of Perl that I wish applied to web development: “Perl makes the easy things easy, and the hard things possible.” But it doesn’t apply to web development; it only applies to Perl.

With web development, there’s not a lot of easy stuff: it’s all hard stuff. For instance, for the last database-backed web application I wrote, there were no forms that could have been auto-generated even if I’d wanted to: every form was unique and complicated in its own way. Each form served the application well, but there wasn’t enough commonality between the forms that form code generation was possible. (Jinja includes eased some of the pain, but that’s templating, not code generation.)

Same with ORM: I already think that ORM is the Vietnam of computer science, and I wanted to create my data model in the database, directly in SQL, so I could get exactly the schema I wanted. I wanted to write my database access code in SQL myself (especially the reporting code). The ORM would have been in my way. So I used a thin, template-based access layer like Pybatis.

The whole automatically-generated admin site that Django could produce for me? It would have been great, but the demands of my project outstripped what could be automatically generated. Again, every form was tweaked and twisted and user-frienlified to the point where only hand-coding would do.

And this is not a criticism of Django, by the way. When I used J2EE, I had stopped using frameworks altogether: it was easier for me to use raw servlets and iBATIS, and cobble together my own “framework” of often-used idioms as I went along.

And you know what would happen? I’d start to try to codify stuff in higher levels of abstraction (“Now all my editors can be simply configured from a configuration file! The problem editing things is solved!”) only to have a new requirement render my newest abstraction useless. As the title of this blog says, when building non-trivial database-backed web applications for demanding clients, everything is an edge case. On all the web sites I end up working on, there are no common, repeated, trivial cases of anything that can be encoded in a framework.

On the other hand, there are a number of basic housekeeping features that a lot of frameworks provide nowadays. These features go unsung, and yet they are often the most useful.

Finally, I think one thing that works against (or with?) Django and many other Python frameworks is that Python is such a productive language to code in (compared to Java, anyway) that it’s often quite easy to take a framework that takes care of stuff you didn’t want to write yourself, and then code the rest of what you needed by hand in a couple of days. If there’s one thing that continues to impress me about Python, it’s that if I can’t find a library to do something I need, I generally surprise myself by how quickly I can cobble something together to do what I need.

SQL NULL is not Application Language NULL 14 July 2009

Posted by manniwood in Java, PostgreSQL, Programming, Python, SQL, SQL NULLs.
3 comments

When I first started using Python, it irritated me that null was None. C uses NULL, Java and JavaScript use null; why did Python have to use None? And why did Lisp have to use nil? Can’t all languages just standardise on null? It seems so arbitrary and capricious.

I’ve had a change of heart: I actually think the word ‘null’ needs to go away: languages should specify exactly what they mean by null, the way Python means None when it says None.

Like most application languages, Python throws errors when you try to add None to an integer, or concatenate None with strings:

>>> 1 + None
Traceback (most recent call last):
  File "", line 1, in
TypeError: unsupported operand type(s) for +: 'int' and 'NoneType'
>>> "foo " + None
Traceback (most recent call last):
  File "", line 1, in
TypeError: cannot concatenate 'str' and 'NoneType' objects

In Java, you get null pointer errors trying to do similar things with Integer objects and Strings. Maybe it would be nice if Java also called null None, seeing as that’s what null really means in Java too.

In the case of SQL, it would be nice if NULL was called UNKNOWN. (Unfortunately, UNKNOWN is already taken! When was the last time you used the UNKNOWN keyword in SQL? Me neither. But it’s there…)

SQL’s NULL is often (ab)used to indicate “none”, but that’s incorrect: really a SQL NULL means “I have no idea”.

Watch this to see what I mean.

I fire up a PostgreSQL client, and ask it to print “NULL” instead of whitespace whenever NULL is encountered in result sets:

mydb=# \pset null 'NULL'
Null display is "NULL".

and now I try to add an integer and NULL, then try to concatenate NULL to a string:

mydb# select 1 + null as "RESULT";
 RESULT
--------
   NULL
(1 row)

mydb# select 'foo ' || null as "RESULT";
 RESULT
--------
 NULL
(1 row)

No tracebacks or null pointer errors in SQL! Nope. You get results. Why? Because SQL NULL does not mean “None”, it means “I have no idea!”

Hence, the only logical answer to adding one and “I have no idea” is “I have no idea”, because we have no idea what to what we are adding the number one—so if we are forced to provide an answer, we must respond “I have no idea!” And I’ll say it again: in SQL, NULL means “I have no idea”, not “None”. (Given that SQL has the UNKNOWN keyword, you’d think that UNKNOWN would be the result to the above queries. For whatever reason, SQL returns NULL instead. On the plus side, this re-inforces the fact that even SQL considers NULL to be the equivalent of UNKONWN.)

Likewise for concatenating ‘foo ‘ and SQL NULL: The answer is not ‘foo ‘ as you might expect if you were concatenating ‘foo ‘ and nothing; the answer is SQL NULL, because we are concatenating ‘foo ‘ and “I have no idea”. And when we have no idea what we are concatenating ‘foo ‘ to, we cannot predict what the resulting string would be, so the only logical answer is “I have no idea!”

In a sense, you could argue that Java null and Python None also mean “I have no idea”, it’s just that, in those languages, when you try to add 1 to “I have no idea”, you get an error. The compiler is basically stopping and asking for a time out: “Just what are you asking me to do here? I can’t add one and ‘I have no idea’!” Wouldn’t it be nice if SQL did that? But it does not. SQL happily returns the result: “I have no idea what one plus ‘I have no idea’ equals.”

My next blog post will be an appreciation of why Joe Celko says to avoid nulls in SQL whenever you can, and why C. J. Date says nulls should not even be part of the relational data model. I want to keep on typing, but I want to leave this blog entry with only one major idea: that nulls in your code’s business logic layer are not the same as nulls in your RDBMS!

Why Maps Are Not the Default Return Type of Python’s DB API 13 July 2009

Posted by manniwood in PostgreSQL, Python, SQL.
2 comments

One thing that irritates me about Python’s DB API is that rows in result sets are by default integer indexed, rather than column-name indexed.

So, for the query

select id as "ID",
       username as "USERNAME"
  from users

I have to retrieve “USERNAME” from a row of my result set with a numeric offset, like so:

username = myrow[1]

Python’s DB API does not, by default, let me access column rows by name. Why?

It turns out this irritating state of affairs comes from a “feature” of SQL: columns in result sets do not need unique names.

I figured I’d test this in my favourite RDBMS, PostgreSQL:

mydb=# create table t (t text not null, t text not null);
ERROR:  column "t" specified more than once

This made me happy: tables cannot contain duplicate row names.

But, result sets can contain duplicate row names:

mydb=# create table t1 (t text not null);
CREATE TABLE
mydb=# create table t2 (t text not null);
CREATE TABLE
mydb=# insert into t1 (t) values ('Hello');
INSERT 0 1
mydb=# insert into t2 (t) values ('world');
INSERT 0 1
mydb=# select * from t1, t2;
   t   |   t
-------+-------
 Hello | world
(1 row)

mydb=# select t1.t, t2.t from t1, t2;
   t   |   t
-------+-------
 Hello | world
(1 row)

So if I tried to retrieve column ‘t’ in the above two queries, I wouldn’t be providing enough information to get a single column.

Now it turns out that the psycopg2 implementation of the PostgreSQL DB API driver does allow you to return maps instead of sequences for each row:

import psycopg2.extras
curs = conn.cursor(cursor_factory=\
psycopg2.extras.RealDictCursor)

And I use this as the default in Pybatis. That way, I can do the following using my original query at the top of this post:

username = myrow["USERNAME"]

But it’s irritating to know that SQL allows same-named columns in result sets. It’s part of SQL’s legacy of being a flawed implementation of the relational data model. In the relational data model, tables and result sets are of the same type: they are both relations. So they both are constrained to having unique column names. (There aren’t actually columns in the relational data model, but I’ll ignore that right now.)

On the other hand, in SQL, tables and result sets are different. Tables must have unique column names, but result sets don’t have to.

Happily, there is an easy way to get around this. Always explicitly name the columns of result sets to ensure all of the column names are unique. Then, maps (or hashes, or dictionaries) can be used to represent rows from result sets:

mydb=# select t1.t as "t1_t", t2.t as "t2_t" from t1, t2;
 t1_t  | t2_t
-------+-------
 Hello | world
(1 row)

Also, that’s why I named my columns in the sample query opening this post. It’s just a habit now: it allows me to guarantee what names I’ll use when accessing my data in the dictionaries I get back as my result set.

This is one area where I wish SQL more closely followed the relational model. In the relational model, the target of a query is a relation (or relations), and the result of a query is a relation. Nice and simple.

If SQL followed the relational data model more closely, result sets would no more be allowed duplicate column names than tables. And that would be a beautiful thing.

SQL Generation Is a Templating Problem, Not a Code Generation Problem 2 July 2009

Posted by manniwood in Programming, Pybatis, Python, SQL.
2 comments

The only library I’ve ever used that I think got SQL generation right was iBATIS. Every other SQL generator, or object-relational mapper got it wrong.

The two basic flaws of everything from LINQ to Hibernate to Django’s ORM are 1) the assumption that developers don’t want to write SQL code, and 2) the use of a lower-level language to generate a higher-level language. Using .NET or Java or even Python to generate SQL makes as much sense to me as using Assembler to generete C. Code generation of this sort only works going from high-level to low-level; not the other way around. The other way around, you are using a more verbose language to generate a less verbose language. It escapes me why it is not intuitively obvious that using any other language to generate SQL is stupid. Just learn SQL!

I can find hope in the fact that nobody hides HTML generation behind layers of abstraction anymore. But object-relational-mappers continue to hide SQL generation behind layers of abstraction. Do you remember the old days when HTML could be generated by tools like Microsoft’s Front Page? Remember CGI.pm, which turned every HTML tag into a method?


print hr; # prints <hr>

Not only does this make no sense, it’s insulting. It assumes I don’t want to, or can’t, learn HTML. The same insult persists to this day with ORM-based SQL generation.

Here’s what I want to do. Let’s say I have captured user input in a map called form_fields that has a field called ‘creation_date’. When form_fields['creation_date'] is true, I want to show the creation date of the 20 most recently created users in my database, otherwise, not.

Here’s what I want the HTML to look like, using the Jinja templating engine:


<table>
<tr>
{% if form_fields.creation_date == True %}
<td>Created On</td>
{% endif %}
<td>Username</td>
</tr>
{% for user in users %}
<tr>
{% if form_fields.creation_date == True %}
<td>{{ user.created_on }}</td>
{% endif %}
<td>{{ user.username }}</td>
</tr>
{% endfor %}
</table>

Here’s what I want my SQL code to look like:


select
{% if form_fields.creation_date == True %}
created_on as "created_on",
{% endif %}
username as "username"
from users
limit 20

Obviously, I want to use templates for SQL the same way I do for HTML. The only difference is that whereas I send HTML to a browser once my templating engine has worked its magic, I send SQL to my RDBMS once my templating engine has worked its magic. And all I want back from my RDBMS is lists of maps with the key names specified by my “as” column aliases in my SQL code. (See my own Objects are Hammers, or Jeff Atwood’s Why Objects Suck or Ted Neward’s The Vietnam of Computer Science for why I do not want to return a list of objects or even engage in object-relational mapping in any way, shape, or form. Just give me a list of maps, please and thank you.)

In fact, that’s exactly what Pybatis allows me to do: I loved the central idea of iBATIS so much, I ported it to Python and named it Pybatis as an homage to the original. There are some differences in Pybatis: instead of using iBATIS’ custom XML templating syntax, I used Jinja templates, so that I could use the same templating engine for my SQL code as I was using for my HTML code. What could be easier? And, whereas iBATIS assumes you want to map result sets to objects (but, happily, allows outputing maps), Pybatis assumes you almost always want a list of maps (list of dicts, in Python). In fact, Pybatis was so easy to write (the original version took an afternoon), I’d encourage you to build your own SQL templating system using your favourite templating engine. You’ll never use ORM again.

In the realm of HTML generation, sanity has prevailed: everyone has realised that HTML generation is not a code generation problem but a templating problem. The beauty of this approach is that you are writing HTML most of the time, and you only drop into the templating language when you need to introduce variables and loops to the static world of HTML.

It turns out, the same approach works best for SQL as well. Sadly, this is not universally acknowledged like it is with HTML, but I hope one day it will be. For now, there is iBATIS, and my Python port (and homage), Pybatis.