jump to navigation

Comments on Mark Pilgrim’s [XML/XHTML] Thought Experiment 4 September 2009

Posted by manniwood in Uncategorized.
trackback

I think Mark Pilgrim is spot on about the realities of XML (especially XHTML) today: browsers have been accepting malformed XHTML since forever, so generating correctly-formed XHTML is really difficult, because every XHTML generator has a long history of never having had to.

Here’s the thing, though: I really wish Pilgrim had emphasised this point: the only reason why we can’t/won’t/don’t generate valid XHTML today is because of bad decisions that were made in the beginning. Conversely, the reason why the C programming language is parsed in such a consistent way is because of decisions that were made in the beginning of C. There’s no rule that says everything everywhere has to be poorly specified from the start, poorly implemented from the start, become popular, and have to therefore stay poorly specified and poorly implemented because “it’s always been that way, and it’s too difficult to change now”.

If anything, XHTML should be a lesson in the benefits and pitfalls of non-rigorous format specification and parsing.

I would be horrified if the takeaway from Pilgrim’s article was that we should always design clients for all new formats to be as accepting of garbage as possible.

Nonsense!

One lesson should be: take a little more care the next time you write a specification.

I can think of a great example: JSON.

Can you point to trillions of lines of malformed legacy JSON out in the wild? Nope. Bad JSON doesn’t parse. But bad XHTML does. Why the difference?

JSON is simpler than XHTML. It’s easy to implement, and easy to parse.

Another lesson should be: design simpler markup languages.

Then again, sometimes, you need something with the complexity and expressiveness of XHTML.

Yet another lesson could be: if you’re going to design something with the complexity and expressiveness of XHTML, expect the benefits and pitfalls of XHTML.

In other words, perhaps there is an inverse rule between the “richness” of a markup language’s feature set, and the expectations on the robustness the parsers of that will have to parse it.

I’ve been (re-)reading a lot of Paul Graham lately, and one thing he says that rings true to me is:

Everyone by now presumably knows about the danger of premature optimization. I think we should be just as worried about premature design—deciding too early what a program should do.

—Paul Graham, Hackers and Painters

If there’s one thing I think XML generally (not just XHTML in particular) suffers from is culture of premature design—especially in the way that it is used.

I remember a horrible phase of the late 1990s and early 2000s where everything had to be stored in XML. Key/value pairs were stored in XML instead of .ini files; tabular data was stored as XML rather than as .csv or fixed width files; hierarchical data were stored as XML rather than as JSON; sometimes entire databases were stored in XML instead of in an RDBMS… it was a horror show.

George Orwell’s second rule in his essay “Politics and the English Language” was “Never use a long word where a short one will do.” I think the same apples to markup schemes: never use a complex one where a simple one will do.

Or: Don’t use XML unless you absolutely have to.

When it comes to the current state of browsers, though, I think there’s a catch: I think the current demands we put on our browsers require us to use XML—or something of equal complexity that would end up looking a lot like it. There is no .csv or JSON solution for the browsers markup problem. We need something like XML.

With XHTML, that’s exactly what we have.

HTML5 and the abandonment of XHTML 2.0 actually improves the situation: there’s a tacit admission that the way we (mis)parse XHTML4 now is its own markup language that is neither valid SGML nor valid XML. HTML5 is not strictly SGML or XML—but it’s something of equal complexity that ended up looking a lot like it. ;-)

So I have at least two lessons from Pilgrim’s though experiment:

1. We cannot turn back the clock and correctly implmement XHTML as actual, correct, XML. Much to its credit, HTML5 accepts this: it is neither SGML nor XML—it has become its own markup language that merely looks like its forbears. Much of the markup that was “wrong” under XHTML (even though it would parse anyway) is now “correct” under HTML5 (because, well, it parses anyway).

2. Friends don’t let friends use XML. As the evolution of (X)HTML(5) has shown, large, feature-rich markup languages are hard to get right, and although they carry many benefits, they also carry problems. So if you need to solve a problem with markup, really look to see if you can use JSON or .ini or .csv or even a fixed-with flat file before jumping on the XML bandwagon. XML is often overkill anyway—except when it’s not.

Comments»

No comments yet — be the first.