jump to navigation

Some Sanity on NoSQL 5 November 2009

Posted by manniwood in Uncategorized.
add a comment

I haven’t posted in a while, but if you’re looking for a short, good read, check out What I like about the NoSQL crowd. I couldn’t have said it better myself.

Java’s Long.toBinaryString(long l); come on, guys! 21 October 2009

Posted by manniwood in Uncategorized.
2 comments

So I’ve been doing some bit fiddling in Java, and because I don’t do a lot of bit fiddling, I want to print out the bits of longs so that I can get some feedback.

So I set up a long like so:

long someLong = 1L;

And I ask Java to show me each individual bit like so:

System.out.println(Long.toBinaryString(oldhash));

And here’s what Java outputs.

1

Thanks for printing the other 63 zeros there, Java. Great effort.

Happily, I like collecting programming books, and John W. Perry’s Advanced C Programming by Example has a nice example of printing all the bits in short ints.

It’s actually kind of cool. First what you need is a way to test each bit in a series of bits. Let’s say you have the following byte:

00000010

and you want to test its second bit (from the end). You shift that bit to the end:

byte i = 2;  // i is 00000010
i  >>= 1;  // shift one place to get 00000001

If we were interested in the first (end-most) bit, we would harmlessly shift zero places:

i = 2;  // i is 00000010
i  >>= 0;  // shift zero places to get 00000010

If we were interested in the third bit, we would shift two places:

i = 2;  // i is 00000010
i  >>= 2;  // shift two places to get 00000000

You take advantage of the fact that byte’s binary representation of 1 is

00000001

so if you & together 00000001 with any other short, the first seven zeros are guaranteed to make your result have seven zeros, but the final 1 will either & together with a 1 to give you 1, telling you the last bit was set, or & together with a 0 to give you 0, telling you the last bit was not set.

// let's test the second bit:
i = 2;  // i is 00000010
i  >>= 1;  // shift one place to get 00000001
i &= 1;  // i is 00000001; the second bit was set

// let's test the second bit again:
i = 6;  // i is 00000110
i  >>= 1;  // shift one place to get 00000011
//   00000011
// & 00000001
// -----------
// = 00000001
i &= 1;  // i is 00000001; the second bit was set

So you can write a testBit function (sorry—method; Java doesn’t have functions) that tests the i-th bit of a byte like so:

// return 1 if bitToTest-th bit of val was set,
// else return 0
byte testBit(byte val, int bitToTest) {
    val >>= bitToTest;
    val &= 1;
    return val;
}

And you can write a toBinaryString method that uses testBit like so:

String toBinaryString(byte val) {
    StringBuilder sb = new StringBuilder(8);
    for (int i = 7; i >= 0; i--) {
        sb.append((testBit(val, i) == 0) ? '0' : '1');
    }
    return sb.toString();
}

And so you can print all the zeros in your bytes:

byte i = 6;
System.out.println(toBinaryString(i));

Which will print this:

00000110

instead of this:

110

Nice, eh?

Catching Up with Ted Neward 17 October 2009

Posted by manniwood in Uncategorized.
1 comment so far

Sometimes all I want to do in my blog is link to other blogs, like A Farewell to ORMs and ORMs are a thing of the past.

Not that ORMs are bad for all situations; just that they are not a panacea. I wonder what Ted Neward thinks of this continual rediscovery that ORMs have their issues?

ORM: Whatever Works 12 October 2009

Posted by manniwood in Uncategorized.
1 comment so far

Mwanji Ezana asked me in a comment on my previous blog post:

I think one of the major advantages of ORMs is their lazy-loading, caching and query-batching ability. It’s not just about generating a schema and queries.

In all your anti-ORM posts, I’ve never seen you mention these capabilities. What do you make of them? Are they unimportant to you?

(Thanks for reading, Mwanji!)

I have a few comments, and I’ll have to start with the definition of ORM itself: as I discussed in my previous blog post, ORM has come to mean a lot of things, including SQL mappers like iBATIS, which I personally don’t consider ORM, but which, apparently, much/some of the programming community does.

So I’ll repeat that I only dislike the ORMs that write SQL for me behind my back. For the rest of this blog entry, if I say ORM, just think ORM of the sort that automagically does things for you; not ORM that facilitates easier writing of SQL queries, such as iBATIS.

When it comes to lazy-loading, caching, and query-batching, I like them all! It’s just that I think they are all best when decoupled from ORM. Caching, in particular, is arguably something that you may want decoupled from your ORM.

For instance, Django, has separate ORM and caching mechanisms: caching even has its own standalone chapter in the Django book.

I think this makes sense: caching is for more than just database queries, so I think it’s a great facility to offer outside of an ORM.

I consider query-batching to be another facility that can just as easily be offered outside of an ORM. If anything, the most effective ways I know to do large batch jobs on RDBMSs is to use the tools offered by the RDBMSs to do so, rather than those offered by any ORM or library. RDBMSs’ batch tools usually work best from the command line, allowing you to not only avoid your ORM for batch jobs, but avoid the whole application altogether.

I think I’m partly anti-ORM for aesthetic reasons: I try to avoid impedance mismatches instead of embrace them.

Consider a project where an object model perfectly describes the business domain, but there’s a need to store the data in SQL, perhaps because of SQL’s ability to generate great reports, or whatever. It’s the classic object/relational impedance mismatch.

Some (most?) projects embrace this mismatch by using some sort of ORM to bridge the gap.

Personally, I’d see if I could find a really good object database so that I didn’t have to deal with the mismatch. Why not store my objects in an object database and not even have to deal with an RDBMS? Maybe there’s an object database out there that can still generate the reports I need, or do other things I though I needed SQL to do.

On the other hand, if it turned out that the project needed features only SQL can provide, I’d think long and hard about whether or not my business data really had to be an object model. Maybe I could use the relational data model after all? Maybe in my application code I could use lists of maps to represent my data (so it would be a lot like SQL tables and/or result sets) and avoid the impedance mismatch by essentially bringing my relational model up into my application layer.

But that’s just me. I’m heavily biased towards solving problems by not having to solve them in the first place: Got an impedance mismatch? Pick a side. Now you don’t have to liaise between two ways of looking at the same data, because you just eliminated one by deciding that the other was more important.

But I’m happy to admit to at least two things:

1) Not all developers have the elimination/simplicity bias that I have, and

2) Not all projects can just pick one paradigm and eliminate another.

The continued popularity of ORM must mean that a lot of people are using it in a lot of successful projects.

I may personally suspect that a lot of projects probably succeed in spite of ORM, but I have to admit that it’s only because ORM has never been a good fit for the project I’ve worked on, so that experience has influenced my thinking. But I think us programmers need to admit that it’s a big world of programming problems out there, and one size does not fit all.

If your project is doing great, and you’re using ORM, then in the context of your project, you are right, and I am wrong. I’m glad you’re not listening to my criticisms of ORM, and I’m glad you’re sticking with what works.

On the other hand, if ORM is not working for you, or it’s showing some strain, check out Ted Neward’s The Vietnam of Computer Science, or about 25% of the blogs I’ve ever written. ;-) You may find some observations that ring true, even if you continue to use ORM.

Heroic Programming and Simple Programming 15 September 2009

Posted by manniwood in Uncategorized.
1 comment so far

The first popular language for web sites was Perl. I remember having C envy as a Perl programmer. My C envy was rooted in what I call productivity guilt. At the time, I thought Real Programmers did their own garbage collection. Real Programmers slung around null-terminated strings. Real Programmers used pointers. Perl took care of all that for me, and I got working code out the door fast, but I felt guilty about it, like I wasn’t using a Real Programming Language.

My guilt at not using A Real Programming Language had a silver lining: I learned C and C++. I ended up liking C quite a lot, even though I rarely use it professionally.

Joel Spolsky decries the rise of Java Schools (he wishes programmers still knew C/C++), and he has a point: even though I’ve never programmed a lot of C/C++ during my day job, a knowledge of C certainly makes me more conscious of what is going on under the hood of Perl and Java. It makes me a better programmer of both of those languages.

My favourite benefit of learning C is that it allows me to understand the code of open source software (a lot of which is still written in C). At one point, I got really interested in Apache, and my C knowledge came in handy.

I made a discovery, while learning about Apache, that has stuck with me until this day. It has to do with what I may as well call heroic programming versus simple programming.

Here was my view of heroic programming, back when I was a Perl developer, envious of the Real Coders Who Used C: Heroic programmers used their superior intellects to craft cleverly written software with all the pointer arithmetic and memory allocation/deallocation sprinkled throughout their code in a bug-free manner.

Here’s something I discovered with Apache: garbage collection, one of the more difficult aspects of C programming, was abstracted away behind a brilliant architecture that made it easier to use.

The Apache designers took advantage of the fact that there are a lot of things that happen in a web server that are life-cycle based. The best example of this is servicing a request: it has a definite beginning and end. So the Apache designers thought: Why don’t we attach a pool of memory to each request? Whenever a piece of code servicing a request needs to allocate memory, it will allocate the memory out of the pool associated with that request. At the end of the request, the pool will be automatically deallocated.

This taught me a huge lesson: real programmers do not make code difficult for its own sake. There is no honor in tackling a complex problem in a complex way, when you could tackle a complex problem in a simple way.

Real programmers take the most difficult problems of a project and solve them at the outset, abstracting them behind an API. The rest of the codebase leverages the API, and is simpler and easier to maintain as a result.

When I learned servlets, I immediately appreciated a similar design win: servlets make dealing with threads almost entirely worry-free. For most purposes, you only have to follow one rule: have absolutely no class variables in your servlets and your servlets will be thread-safe. Why? The servlet API takes care of starting and stopping of threads for you: already-started threads call into your servlets.

It is generally accepted that there are a few books on Java threading that every good developer has to have read. But I also appreciate how knowing a technology does not mean using it at every opportunity.

If I embark on a project that has a lot of Java threading, I’ll break out my copy of Java Concurrency in Practice, build a decent threading API (much like the way the servlet API does), and leverage that throughout the rest of my project. Assuming I can’t find an API that already solves my problem.

Similarly, I’ve also abandoned my “productivity guilt” using higher-level languages. If a project allows me to use a higher-level language that abstracts away garbage collection or pointers or threading, I’ll do it, as long as I can still meet the performance requirements.

I no longer think Real Programmers always do their own garbage collection, or always manage their own threads, or always do their own pointer math. They know how to; but they also know when to.

The best programmers ship working, easy-to-maintain code before their competitors do.

Comments on Why It’s Impossible to Become a Programming Expert 12 September 2009

Posted by manniwood in Uncategorized.
add a comment

A couple of quotes from Justin James’ Why it’s impossible to become a programming expert rang true to me:

All too often, an expert programmer is the person who is adept at using a variety of reference tools and documentation to find out how to achieve their goals.

and

…if you were to grill [good programmers] on anything outside a narrow area … there is a really good chance that they will know where to get the answer from but not actually know the answer.

So true!

Steve Yegge makes a good case that programmers should know more math (and they should—I should, anyway ;-) and Paul Graham thinks programmers have a lot to learn from painters and other makers.

But Justin James is on to something when he says, in essence, good programmers are good researchers.

This is where I get to chuckle a bit, and blow my own horn, because I have a master of library science degree. I’m quite happy admitting that I have math envy of people with CS degrees (hey—we’ve all go our weak spots) but I’m a killer researcher. (I wonder if, one day, computer programming will be seen as the truly multi-disciplinary field that it really is? Topic for another post…)

Another thing that struck me was that Justin James said at the start of his article that he wanted to learn more Lisp but just didn’t have the time, but at the end of his article, bemoaned the fact that programming languages and APIs and frameworks have grown to the point where you don’t have the time to become an expert in anything any more.

(I remember a joke that says an expert is someone who knows more and more about less and less.)

Perhaps the modern-day key to being a good programmer is being an effective, targeted generalist. (And a killer researcher.)

For instance, if you don’t do your reading and follow the tech world in general, how are you supposed to evaluate which technologies you should spend your time learning, and which you can safely ignore?

At a higher level, the tech world always seems to follow patterns: data structures (trees, lists, maps, graphs), encapsulation (functions or objects), code generation (compilers, templating systems, IDEs, Lisp macros), etc. A lot of these problems get solved with different tools, but the basic problems and goals keep repeating. Even architectures come back in new guises: how different is browser/server to client/server, really?

If you are a good generic programmer, you probably know a lot of the higher-level terrain of computing, so you don’t lose your bearings getting closer to any particular part of the landscape.

One final observation: although computing sometimes seems cyclical (trends fall into disuse and become re-popularised, disguised as new innovations), other parts of it are following a reasonably discernible evolutionary path.

I’m always fond of quoting Paul Graham when he says programming languages are becoming more and more like Lisp, so if you want to arrive at the final destination of programming languages, learn Lisp today.

Or, as Phil Greenspun puts it in his tenth rule:

Any sufficiently complicated C or Fortran program contains an ad hoc informally-specified bug-ridden slow implementation of half of Common Lisp.

I don’t know if Justin James will ever get around to learning a lot of Lisp, but I bet he’s been doing a lot of reading about Lisp because he knows Greenspun’s tenth rule.

I’ve posted in previous blogs about wanting to know more Lisp myself. Justin James just gave me another reason: it will make me a more effective generalist.

jQuery Django REST 11 September 2009

Posted by manniwood in Uncategorized.
add a comment

While working on a webapp, I had a situation where a handful of items of the same type were presented on the same page, and all were editable, for the user’s convenience.

In the old days, I would have wrapped the handful of items in a single form, encouraging the user to edit all of the items in one go, and submit the bunch all at once. I would either report back with a “your changes have been saved” page (really old school) or reloaded the page with a “your changes have been saved” notification, and the forms filled out with the new edits.

But with AJAX now a robust and well-supported technology (especially under the spiffy new frameworks), I figured why not

  • put each item in its own form
  • allow each item to be submitted separately
  • use AJAX to notify the user of form submission success/failure using the current page, so that no full form-submission/round-trip/page-load would be necessary?

And if I was going to write back-end code to support this sort of item update, why not see how RESTful I could make it?

First off, let me say that the solution I came up with is only RESTful; maybe only REST-like or REST-ish.

I’ve been doing some poking around on REST and what it really is, and it seems that it is an architecture more than a specification, which is kind of nice, because you don’t have to adopt all of it, especially if it gets in the way of solving your problem.

(Aside: How I Explained REST to My Wife is the best explanation I’ve read of REST.)

Anyway, let me first dispense with two things I did in my code that were decidedly not RESTful.

First, because this work was in the context of a larger application, I required a user to be logged in to perform the updates on the items. My understanding of REST is that it should be stateless. My takeaway is that login credentials would have to be provided with each individual action to enable true statelessness. I ignored this.

Second, true RESTful resources are accessible individually through regular URLs without appended key/value pairs. Hence, the ideal format of URLs to my items would have been


https://myserver.com/items/1234

https://myserver.com/items/2345

whereas I was going to still access my items like so:


https://myserver.com/itemEditRest/?id=1234

https://myserver.com/itemEditRest/?id=2345

And, in fact, even that is not true, because what I was really going to do was pass the form data in the body of the request (POST-like data, if you will) and not even in the URL.

I chose to do this out of a desire for simplicity: all the other attributes of the items were already going to be passed in the body of the request, and were already going to be parsed out and incorporated into my SQL update statement. So getting all of the attributes from one source (the request body) instead of two sources (ID from the URL but other attributes from request body) seemed a better idea, for my purposes.

This still left me with some interesting RESTful stuff to do. First off, because my items were being updated (rather than created or destroyed), I would use the recommended HTTP PUT method, because apparently, PUT is the HTTP method that RESTful services use (by convention) to allow updates on items.

My first goal was to see if I could send a PUT request through jQuery (and, by extension, the XMLHTTPRequest facility provided by all the major browsers).

It turns out I could! Whenever one of my forms’ save buttons (all sharing the same class=”saveButton”) was clicked, I would serialise that form’s data and send it to my server in the request body using PUT:

$('.saveButton').click(function() {
    // code omitted here, but basically just determining
    // the ID of the form that the
    // submit button was in, and assigning it to formID
    var formDataString = $(formID).serialize();
    $.ajax( { url:  '/itemEditRest',
        type: 'PUT',
        data: formDataString,  // data is request body
        processData: false,
        dataType: 'text',  // could also be "json"
                     // but I'll parse return data manually
        success: function(data, status) {
            // code omitted here;
            // do whatever I do when I'm successful
        },
        error: function(xhr, text_status, error_thrown) {
            var status = xhr.status;
            if (status == '400') {
                // code omitted;
                // handle bad form input
            } else if (status == '410') {
                // code omitted;
                // handle item no longer there
            } else {
                // code omitted;
                // handle any other truly
                // unexpected problem
            }
        }
      });
});

Some notes on the above code:

How cool is $(formID).serialize()? My understanding of PUT is that anything can be in a PUT’s request body: a text file, a PDF, a PNG, anything. I happened to want to put my form data in there (rather like a POST request). jQuery’s .serialize() method will take a locator that resolves to a form tag, and serialise all of that form’s “foo=bar, one=two” form inputs to the expected “foo=bar&one=two” query string. Very nice!

One thing that’s interesting is how little information I decided to send back from my server. For instance, if the data were successfully saved, I figured I may as well just send back a response with HTTP status code 204 (which means ‘no content’). jQuery correctly takes any 2xx response and calls its success handler! Very nice.

I figured if my form data were bad (for instance, the user input text where a number should be) I’d use the HTTP return status of 400 (which means bad request). What’s really cool is that an HTTP 400 response can have a body, so I actually send back JSON in the response body. The JSON contains a map of form field names and their associated human readable errors (such as “{ ‘transfer_amount’: ‘Not a valid monetary value’ }”). But it turns out I ignore the JSON in practice, because I’m already doing JavaScript validation on the client side, so the bad fields are already being called out. (In essence, I’m just having the server protect itself from garbage data if a user hits the “Save” button, ignoring the bad input warnings.)

Likewise I use status code 410 for missing data.

Remaining 4xx status codes, and all 5xx status codes can be handled by the “else” part of my error hanlder. Brilliant!

Of course, none of this is any good if my server side cannot produce this output. But with Django, I can.

First off, in my urls.py, I map my URL to my module and function that will handle my RESTful call. Note that all HTTP method calls to itemEditRest will go to item_edit.rest: GET, POST, PUT, DELETE; they will all go to item_edit.rest:

(r'^itemEditRest$', item_edit.rest),

So, here’s what item_edit.rest looks like:

# decorater that requires we be logged in;
# not very RESTful, I know...
@logon.require_login_rest
def rest(request):
    allowed = ['PUT']  # extend this
                       # as we add more
    if request.method == 'PUT':
        return restful_update(request)
    elif request.method == 'POST':
        #not implemented yet
        return HttpResponseNotAllowed(allowed)  # status 405
    elif request.method == 'GET':
        #not implemented yet
        return HttpResponseNotAllowed(allowed)  # status 405
    else:
        return HttpResponseNotAllowed(allowed)  # status 405

Unlike handling a GET/POST directly, as you normally would in Django, you instead look at the request method, and call the appropriate handler from there. (There’s even an HTTP response code for unsupported methods! Very cool…)

Here’s my handler for the PUT method:

(But first, at the top of my file, I have to handle some changes from Python 2.5 [which I use in production] and 2.5 [which I'm using in dev]):

local_parse_qsl = None
import urlparse
if hasattr(urlparse, 'parse_qsl'):
    local_parse_qsl = urlparse.parse_qsl  # Python v. >= 2.6
else:
    import cgi
    local_parse_qsl = cgi.parse_qsl  # Python v. < 2.6

try:
    import json  # Python version >= 2.6
except ImportError:
    import simplejson as json  # Python version < 2.6

OK, on to my handler function:

def restful_update(request):
    key_val_pairs = local_parse_qsl(request.raw_post_data)
    form_values = {}
    for kvp in key_val_pairs:
        form_values[kvp[0]] = kvp[1]

    error_messages = {}
    # validate function populates error_messages
    # dict with form field names as keys, and
    # human-readable error messages as values,
    # e.g. { 'phone': 'Not a valid phone number' }
    # If error_messages remains empty, it means
    # all form fields were good.
    validate(form_values, error_messages)

    if len(error_messages) != 0:
        # status 400 means bad request
        return HttpResponse(json.dumps(error_messages),
                            status=400,
                            mimetype='application/json')

    # code omitted; detect if item
    # or one of its parents is deleted
    if item['is_deleted'] == True:
        # HttpResponseGone is like HttpResponse
        # but uses a 410 status code
        return HttpResponseGone("{ 'details':
               'Item parent deleted.'}",
               mimetype='application/json')

    app_config.SQL_MAP.execute_commit(
        file='items/update.pgsql',
        map=form_values)

    # status 204 means no content,
    # which is very useful
    # we just need to indicate successful save
    return HttpResponse(status=204)

There are a few interesting things to note. First, Django will not parse out the form values for you, because this is not a GET or a POST. This is what all the local_parse_qsl stuff is about above. (Note, too, that because I am not using duplicate form field names, I can confidently turn my form values into a regular dict, rather than resorting to a dict of lists, “just in case” one of my form values is a multiple value.)

Another interesting thing is that although I’m returning some nice JSONified info in the request bodies, my front-end currently does not bother using the data, using only the return codes to decide what to do next. On the other hand, and future RESTful client may find the info useful.

Finally, my return code is not a typical HTTP 200 result; it is an explicit body-less 204 result. It’s lean, it’s mean, and it’s all that’s required to indicate a successful save.

As I’ve been happy to admit, this isn’t completely RESTful, but this dabbling has taught me a lot, and I now know jQuery and Django give me the tools I need to create full-blown RESTful services in the future, should I need to do so.

Now… go and get some REST.

Git Branches and Remote Repositories 5 September 2009

Posted by manniwood in Uncategorized.
2 comments

I’ll be honest: it took me a while to wrap my head around Git, and how it really worked, and how use it effectively.

I’m going to put in a high recommendation for Scott Chacon’s excellent Git Internals, published by PeepCode. It’s a US$9.00 pdf, and it’s better than any other book or online resource I’ve ever read about Git.

Chacon does a better job than anybody else of showing you how Git works, so that by the time he gets into every day tasks, why you are doing what you are doing makes perfect sense, and you’re just learning the commands and syntax to leverage Git’s capabilities.

When it comes to sharing Git branches between repositories, there’s still not a perfectly good, clear resource out there, so I’m going to share with you what I scribbled on my copy of the tear-out “Git Command Quick Reference” from Travis Swicegood’s Pragmatic Version Control Using Git. Hopefully, what I had to scribble on my quick reference card will appear in a second edition.

First off, let me describe what I want Git to do for me.

I want to have a clone of my repository on another geographically remote server. (For purposes of this discussion, I will assume that it has already been created, and is called myremote.)

I want to create a new branch in my local repository on my development machine at my desktop, and at the end of the work day, I want to push this branch out to my remote repository—perhaps for sharing, perhaps just for easy backup purposes.

Git does not push new local branches out by default, and I (and maybe it’s just me) find the documentation very uninformative on how to manage remote branches.

Here’s how.

Let’s say I’m at my desktop computer and I’m using my local Git repository.

Let’s say I make a new branch based on master:


git branch my-new-branch master

Now let’s say it’s the end of the work day, and my work in the branch my-new-branch is not complete. I don’t want to merge my-new-branch back into master, but I do want to push my-new-branch out to my remote repository for backup purposes.

Here’s how:


git push myremote my-new-branch

From now on, while working locally in my-new-branch, doing a


git push myremote

should do the right thing and push out changes I make in my-new-branch.

Now let’s say it’s a day or two later, and I’ve merged my-new-branch back into master. I know I no longer need my-new-branch, so I delete it locally:


git -d my-new-branch

Done.

Of course, my-new-branch still exists in my remote repository; I’ve only deleted it in my local repository.

I delete my remote copy of my-new-branch like so:


git push myremote :my-new-branch

Yes, that’s right: putting a colon in front of a branch and pushing it will delete it on the remote repository. Not very obvious, is it?

I may go into detail about pushing local tags out to remote repositories in another post.

Happy Gitting in the meantime.

Comments on Mark Pilgrim’s [XML/XHTML] Thought Experiment 4 September 2009

Posted by manniwood in Uncategorized.
add a comment

I think Mark Pilgrim is spot on about the realities of XML (especially XHTML) today: browsers have been accepting malformed XHTML since forever, so generating correctly-formed XHTML is really difficult, because every XHTML generator has a long history of never having had to.

Here’s the thing, though: I really wish Pilgrim had emphasised this point: the only reason why we can’t/won’t/don’t generate valid XHTML today is because of bad decisions that were made in the beginning. Conversely, the reason why the C programming language is parsed in such a consistent way is because of decisions that were made in the beginning of C. There’s no rule that says everything everywhere has to be poorly specified from the start, poorly implemented from the start, become popular, and have to therefore stay poorly specified and poorly implemented because “it’s always been that way, and it’s too difficult to change now”.

If anything, XHTML should be a lesson in the benefits and pitfalls of non-rigorous format specification and parsing.

I would be horrified if the takeaway from Pilgrim’s article was that we should always design clients for all new formats to be as accepting of garbage as possible.

Nonsense!

One lesson should be: take a little more care the next time you write a specification.

I can think of a great example: JSON.

Can you point to trillions of lines of malformed legacy JSON out in the wild? Nope. Bad JSON doesn’t parse. But bad XHTML does. Why the difference?

JSON is simpler than XHTML. It’s easy to implement, and easy to parse.

Another lesson should be: design simpler markup languages.

Then again, sometimes, you need something with the complexity and expressiveness of XHTML.

Yet another lesson could be: if you’re going to design something with the complexity and expressiveness of XHTML, expect the benefits and pitfalls of XHTML.

In other words, perhaps there is an inverse rule between the “richness” of a markup language’s feature set, and the expectations on the robustness the parsers of that will have to parse it.

I’ve been (re-)reading a lot of Paul Graham lately, and one thing he says that rings true to me is:

Everyone by now presumably knows about the danger of premature optimization. I think we should be just as worried about premature design—deciding too early what a program should do.

—Paul Graham, Hackers and Painters

If there’s one thing I think XML generally (not just XHTML in particular) suffers from is culture of premature design—especially in the way that it is used.

I remember a horrible phase of the late 1990s and early 2000s where everything had to be stored in XML. Key/value pairs were stored in XML instead of .ini files; tabular data was stored as XML rather than as .csv or fixed width files; hierarchical data were stored as XML rather than as JSON; sometimes entire databases were stored in XML instead of in an RDBMS… it was a horror show.

George Orwell’s second rule in his essay “Politics and the English Language” was “Never use a long word where a short one will do.” I think the same apples to markup schemes: never use a complex one where a simple one will do.

Or: Don’t use XML unless you absolutely have to.

When it comes to the current state of browsers, though, I think there’s a catch: I think the current demands we put on our browsers require us to use XML—or something of equal complexity that would end up looking a lot like it. There is no .csv or JSON solution for the browsers markup problem. We need something like XML.

With XHTML, that’s exactly what we have.

HTML5 and the abandonment of XHTML 2.0 actually improves the situation: there’s a tacit admission that the way we (mis)parse XHTML4 now is its own markup language that is neither valid SGML nor valid XML. HTML5 is not strictly SGML or XML—but it’s something of equal complexity that ended up looking a lot like it. ;-)

So I have at least two lessons from Pilgrim’s though experiment:

1. We cannot turn back the clock and correctly implmement XHTML as actual, correct, XML. Much to its credit, HTML5 accepts this: it is neither SGML nor XML—it has become its own markup language that merely looks like its forbears. Much of the markup that was “wrong” under XHTML (even though it would parse anyway) is now “correct” under HTML5 (because, well, it parses anyway).

2. Friends don’t let friends use XML. As the evolution of (X)HTML(5) has shown, large, feature-rich markup languages are hard to get right, and although they carry many benefits, they also carry problems. So if you need to solve a problem with markup, really look to see if you can use JSON or .ini or .csv or even a fixed-with flat file before jumping on the XML bandwagon. XML is often overkill anyway—except when it’s not.

Another reason why I need to learn Lisp 31 August 2009

Posted by manniwood in Uncategorized.
add a comment

Today’s Hacker News linked to Why Lisp macros are cool, a Perl perspective, and it made me want to learn Lisp. I just need to make the time…

One thing that Lisp doesn’t have going for it is its syntax. Lisp syntax is so uniform that it’s not particularly user-friendly. It’s macro-friendly (read the above link and you’ll see why) but not as human-readable as programming languages with more syntactic sugar.

Apparently, Perl 6 is going to have Lisp-like macros, even though Perl 6 will still have Perl-like syntax. It will be interesting playing with that when Perl 6 (Rakudo?) has a beta release. Is it possible that we could have it all? Richer, less-uniform syntax, and yet still the power of macros? That would be nice…