Itpastorn's Webdev + Thinkpad update & maintenance blog: July 2009

I believe there is a value to using XHTML syntax for documents sent to the browsers as text/html. That seemed like the normal thing to do just a short time ago. Now it is increasingly being met with skepticism and even ridicule. I believe I've encountered every XHTML myth busting argument there is from the good people in the WHAT WG cabal, but I still see a value in using XHTML syntax. My arguments are not centered around forward compatibility, extension mechanisms or XSLT, even though they could apply — server side. My reasons for using XHTML syntax is to avoid errors, misunderstandings and rookie mistakes. Since I teach web development for a living I encounter a lot of those.

The issues

HTML 5 is clarifying what XHTML really is. A lot of web sites are using an XHTML doctype, even though the code will be parsed as just like ordinary HTML, i.e. they use false XHTML. But draconian error handling, including Unicode errors, altered CSS applicability, the breaking of 99 % of all JavaScript code in existence, including all major libraries, and of course, Microsoft's not implementing true XHTML at all, all of this will continue to make true XHTML a non option for anything else but experiments and edge cases for the foreseeable future.

Put in one sentence: Specifying an XHTML doctype does not make a document XHTML. As we know by now, the doctype serves only one purpose in the browser and that is to trigger standards mode (assuming a good doctype). And if a browser treats XHTML 1.0 strict exactly the same as it treats HTML 4.01 strict, why not opt for the latter? And as HTML 5 has no other mechanism to specify XHTML other than the MIME-type, what one might chose to call false XHTML is no longer possible to use.

On the other hand, XHTML syntax that previously was illegal in HTML, like explicitly closing void elements (br, hr, input, meta) with a trailing slash, is now fully permitted, although described as a transitional feature. Judging from the fact that most new sites still use a transitional doctype we can safely assume that there is nothing stopping us from using an XHTML-like syntax even in HTML 5. I will proceed to argue that it often even is a very good choice.

polyglot documents

Pages using such syntax have even got a recently popularized name: polyglot documents. Let us consider a few features a polyglot document will lack, being sent as false XHTML:

No namespace support, but HTML 5 will (probably) special case SVG and MathML, so the most sought after compound documents will still be possible to author.
No draconian errors. A feature most developers won't miss at all.
XML parsers that rely on the MIME-declaration will fail or refuse the document. There should be easy workarounds for that.

The list can be expanded. I just want to illustrate the fact that in the near future, any benefits of using XHTML syntax will only to small degree be related to XML technical features. Indeed, when HTML 5 lib has become widely available and integrated with all server side scripting languages, we are promised that all of today's XML-server side tools, will work equally well for non-polyglot HTML documents.

The continued benefit 1: XHTML syntax works like a coding convention

Every major project that involves more than one programmer will soon run into the need of following agreed upon standards for things like indentation, placement of braces, usage (or non-usage) of a space between arguments in function calls and definitions, etc. A programmer that does not know or care about this will quickly see his contributions be rejected and is probably considered non-employable.

Douglas Crockford has introduced coding conventions for JavaScript to many and his JSLint tool has options that will ensure that you follow them. HTML Tidy has options to clean up code, but other than that I know of no common code convention for HTML. I know that for many of my friends the beauty of XHTML has been the clean syntax. For reasons like the following:

Enforcing lower case element and attribute names are easier on the eye than code that SHOUT.
Enforcing citation marks around attribute values makes errors easier to avoid or spot.
Explicit closing of elements like li, tr, th, td and p, also make code easier to read. No guessing the intention (was the implicit close intended or just sloppiness?) makes it easier to work with other peoples code, or even code that I've written myself a while ago.

Let me elaborate that second point. One particular nasty problem occurs when attribute values are generated using server side scripts. Let's say that for a few iterations in an applications life a particular value is always a single word, like in "login". Suddenly another developer (or you) decide that it is better to use two words, like "login name". And since the code that generates this value might be miles apart, like in another file and module, from the template that outputs the actual HTML, one can not take for granted that such a change would not break anything. In a sentence: Quoting attribute values makes code more robust!

Counterargument: You can do that equally well in regular HTML

The primaryu counterargument usually sounds like this: But you don't need XHTML syntax. Nothing is topping you from using lower case tag names and attributes or the optional closing tags in regular HTML, if you wish. True. But nothing is enforcing it either! And there is no tool available for testing it, at least none that I know of.

Neither does this counterargument apply too all aspects of my second argument.

The continued benefit 2: XHTML syntax is good for beginners!

A few years ago Lachlan Hunt wrote that XHTML is too hard for beginners. There is basically two things that make me take a stance that is exactly opposite of his. The first is that he is talking about true XHTML, I am talking about false. The second would be that my main job for almost a decade has been to teach complete newbies about web development. I would not presume to know even half of what he knows about the minute details of markup languages. I dare say, however, that I know much more about teaching this stuff to students.

Coding conventions should be taught from day one!

Here is a rule for all teachers of all things coding. Demand that students should use strict coding conventions from day one. Do not think that it can be introduced at a later stage. Sloppy habits are formed from day one, and are much harder to get rid of once they have formed. Often when I look over the shoulder of a student, and see ghastly looking code, the student will say that it will be fixed later. Judging from nearly a decade of experience I know that it will not happen!

Bad habits get picked up from day one. They should therefore be punished from day one. Requiring XHTML syntax is one way to enforce such practice.

XHTML is the more pedagogic syntax

Requiring students to close void elements is a very effective way of teaching them what elements are indeed void. In the days when we use named anchors for intra-page navigation (as opposed to setting the id attribute on any element) I had students that forgot, or lazily omitted the closing tag. Their pages worked just fine. The only downside was a more complex DOM and that was not discernible for their pages. In fact, some of them believed that such an anchor was a void element. It even took a while for me to grasp that it was not. XHTML helped me understanding that, and I've seen it help other people come to grips with similar issues.

Explicitly closing elements helps making students understand the concept of semantics. You do not insert an li just to get a bullet point, all things between the starting and ending tag is a list item. You do not insert a p-tag to get some space between your lines. All text between the starting and the ending tag is a paragraph. Etc. Being forced to constantly ask oneself where something should start and where it should end is a good for learning.

I also would like to add, that requiring XHTML is good for the mental health of me as a teacher, since a lot of errors will be caught by the students themselves during validation, and their code will be easier to read for me.

The true dowsides of false XHTML

Nothing in life is so rosy as to have no negative downsides. With every medicine has its side effects. The two most immediate ones for newbies both involve scripting:

Tag names are sometimes uppercased in the DOM

Such things happen when an organization badly applies the biblical principle of the left hand not knowing what the right hand is doing. However, this confusion will exist, no matter which syntax you chose. Using HTML syntax with all uppercase element names is not common practice and it won't be long until the principle has to be explained to a student anyway.

Technically redundant closing tags will cause un-intuitive text nodes to appear in the DOM

Consider this code:

<ul>
  <li>foo</li>
  <li>bar</li>
</ul>

How many child nodes to the ul-element are there in the DOM? To a newbie (and Internet Explorer) it looks like 2. To the trained eye it is 5. But once again new technology comes to the rescue. By introducing new DOM-walking APIs we can (in the future) ignore those white-space only text nodes, in a cross browser consistent manner, using native API-calls.

Note that the first white-space only redundant text node would still be left, even if we had omitted the closing tags. And lets say for a moment that a student had authored a script that relied upon there being no closing tags. How confused would he/she not be when it suddenly stopped working because someone suddenly used closing tags? How bristle would such code not be in real use?

The future

Maybe there are some technical benefits of XHTML as well, but I hope to have shown that even without them, the syntax has clear benefits — enough to tell my students that they should use it, either as XHTML 1.0 strict or as HTML 5 polyglot. So where do I want to go from here? This is my wishlist for the future:

The (X)HTML 5 spec should be strictly serialization and syntax neutral. XHTML syntax is not only something that should be allowed for transitional reasons.
I would love to have a (X)HTML coding convention tool, that could check for even more details than the current validator does. Things like indentation, or allowing shortened attributes for boolean attributes, while enforcing citation marks for all non-bolean ones, ought to be testable. Such a tool might even make me think that there can arise even better alternatives than polyglot documents.

XHTML 2 is the Itanium of web technologies. You remember Intel and HP celebrating the Itanium architecture as the new super-duper technology, that should leave all RISC-based competitors in the dust. VLIW (re-branded as EPIC) was touted as a disruptive innovation. The future was Itanium.

But it was not! Forget the marketing speak from Intel and its only Itanium customer worthy of being mentioned, HP. Itanium has flopped. Yes, other RISC architectures, are struggling. MIPS no longer power high end servers from Silicon Graphics. SPARC is loosing market share. Only IBM Power PC seems to be holding its ground in the server space. But the x86 architecture, that was supposed to die, is reigning more dominantly than ever.

Backwards compatibility is everything

Being backwards compatible is not only a nice feature. It is a prerequisite that simply seems non negotiable. Let's look at a few successful products to get an idea.

Windows 95 and DOS-based games

Before Windows 95 all high end games were run from the DOS-prompt. Windows 3.x was only a nuisance for game developers. In order to achieve the highest possible speeds they often tweaked the hardware interaction in every possible way. Getting these games to run under Windows 95 proved a challenge, to say the least. Microsoft solved this by special-casing game after game. The operating system would recognize a particular piece of software, know that it required special handling and adjust accordingly. Even in ways that broke protocol.

Punch cards

When were punch cards invented? 1725. When did they become a big success? 1890. When did they become surpassed by other technologies. In the early 1960's. When did IBM drop support for punch cards from their operating systems? Not for another 30 years. I would not even be surprised if it was possible to attach a punch card reader to a brand new z-series computer today, and actually have it work.

XHTML 2 was Dead Pre Arrival

It did not die as a markup language for the web. It never lived. The day the decisions was made not to be backwards compatible, it was doomed. It never really mattered that it had every conceivable shiny new feature. Technical merits are simply not enough. Therefore it simply does not matter how much you shout about them, or how much you disdain the fact that HTML 5, due to its legacy, is awful and badly designed.

Yes, I use PHP too, and no matter how much one shouts the relative technical merits of Ruby or Python, PHP seems not to grow weaker. Ugliness just is not that big a factor. A strong user base, re-use of code and know-how, ability to find advice and support, such things matter. Sociology always trump technology.

The future for XHTML 2

I have friends who prefer XHTML 2 to DocBook, for data storage on the server. Reading Steve Pembertons thoughts about the future, he seems to believe that is a viable niche and that it can make a comeback that way. And why not, in a controlled environment the improved semantics of XHTML 2 over legacy HTML may provide significant benefits. Pushing XHTML 2 as a progressive enhancement, server side, might work.

Encouraged by none other than Ian Hickson himself XHTML 2 will continue to be developed in a working group outside of the W3C. Can it make a comeback? Once upon a time the W3C decided to axe HTML. Many developers, myself included, thought it was the end of HTML. I was wrong. I therefore will not say, good bye XHTML 2, but a revoir.

Itpastorn's Webdev + Thinkpad update & maintenance blog

Sunday, July 19, 2009

The value of false XHTML