Saturday, June 27, 2009

Validation and doctype myths and (inconvenient) truths

People like me who support web standards often talk about validation and doctypes. Yet, even within our camp, there seem to be a lot of confusion. I will try to address a few misconceptions, especially a few new ones that has come out of the ongoing debate about HTML 5 and accessibility and RDFa.

Background: How a browser processes markup

Most people tend to think in two stages. There is markup and there is a rendered page on the screen. In reality this is a a more complex process.

First comes parsing. The purpose of this is to convert the markup into an representation inside the program that is so to speak "understandable" to the computer and usable for rendering on the screen as well as to assistive technologies, like a screen reader. This internal representation is accessible also to manipulation through the DOM API. Indeed it is often talked about as the internal DOM representation of the document. From now one I will simply call it the DOM. Just bear in mind that I refer to this internal functionality as a whole, and not to the API in the rest of this article.

Parsing in itself is a multi-step process. It involves the simple mapping of the HTML markup, but also the applying of CSS, handling events, etc.

The rendering (painting on the screen) is in turned made from the DOM. As is the exposure of the page to assistive technologies.

Personally I've found it helpful to think about this process in three stages:

  1. The code (HTML, CSS, etc) arrives as a network stream.
  2. The DOM, constructed by the parsing.
  3. The perceivable results (on the screen, in the speakers, in the braille terminal, on paper from print, etc.)

With this knowledge we can formulate the purpose of validation:

  • Validation provides a secure mapping between the markup and the DOM. You as a developer know what you are going to get.
  • Validation provides easier development, including easier error detection and better maintainability through cleaner and consistent code.

Browsers have always had mechanisms to handle badly written HTML. Indeed they have reverse engineered each other in this regard so much that invalid, tag-soup, piece of shit code usually renders just fine. And the HTML 5 spec goes to great length to explain just how such code should be handled by a browser. If one has supreme knowledge about every little detail of how browsers work internally, one can therefore get predictable results even with code that does not validate at all.

Web developers should, however, not be required to have such in-depth knowledge. Validation is a tool that helps us stay within safe boundaries. Stepping outside them might work, but it will always lead to extra work in the end.

With this in mind we can formulate a few spin-off effects of validation, such as:

  • Validation is an act of courtesy towards other people who one day might be charged with taking over your code base, or indeed only be asked to take a look on a mailing list or forum, to help you solve a problem.
  • Validation is a mark of professionalism, a sign that you care about code quality.

But the main effect is that validation is a tool that helps you as a developer get to your desired results. I always tell my students to validate early and validate often. After every major change to the code, re-run the validator!

Or to put it differently. Valid code is not the end goal, it is a very good tool in order to reach the end goals of predictability, consistency, maintainability and effectiveness.

Myth: You must validate in order to be accessible

Wrong. Validation is advisable of course, but in a pure technical sense not a requirement. It is perfectly possible to write unsemantic code, without proper hooks for assistive technologies, that still validates. And it is perfectly possible to do the other way round, although validation — especially to a strict doctype — will be one help towards accessible web sites.

Validation can check for the presence of accessibility features, such as alt-attributes, table column headings, etc. It can never ensure that the contents within those attributes and elements have been written in a usable way. Valid code is a good starting point for accessibility, not a guarantee of accessibility.

This is especially true for the non-strict versions of HTML 4.01 and XHTML 1.0. These 4 (2 * 2) (X)HTML versions contain a lot of elements and attributes, that should not be used. The validator will complain that they are deprecated, but it will still give the page a green light. Anything but strict doctypes or conformant HTML 5 (see below) should have been verboten long ago for any professional web developer.

Myth: There is no penalty for not validating

There is one camp, primarily accessibility experts, who would like browsers not to render pages that contain markup errors, or at least give clear warnings that they do. Recently it has been advocated that e.g. the HTML 5 canvas element should not render anything to sighted users unless there is a fallback for the blind. They also have argued that any refusal to incorporate ARIA or RDFa into HTML 5 can simply be overridden because validation does not matter.

In this context it is true. As long as browsers and assistive technologies support ARIA and Google, Yahoo and other search providers will honour RDFa, it will work. User agent behaviour is always the bottom line, the true de facto standard in practice.

This takes us back to my main theme for this article. Things might work when using code that is not valid. But you can be much more confident that it will, if it validates. In my experience, the most common error that I catch using a validator is spelling errors in tag and attribute names. Such errors may wreck your page in many ways, maybe even in unseen ways because you have misspelled an ARIA attribute. And such errors are easier to spot if they are not hidden behind hundreds of other validation errors, that by themselves actually are benign.

Let me repeat that. Many validation errors are benign: An un-encoded &, an unnecessary closing tag, and, with HTML 4.01, forgetting to specify the type attribute for a script. In themselves these errors will not harm the execution of the parser, the construction of the DOM or the rendering of the page on the screen and exposure of its contents to assistive technologies.

Actually, one may find oneself in a position where it is beneficial not to be valid. This applies both to HTML and CSS. New features, often available only in their experimental first forms, can be quite useful and add to a page's usability, esthetics or accessibility.

However, the problem with benign errors is that they often obscure the malicious errors. A page that contain several hundred validation errors is much harder to debug, than one that only has a few. It is therefore imperative that one chooses the best possible validator for one's purpose. E.g. validators can be configured to ignore vendor specific CSS-rules or to include ARIA.

Actually, the main reason I use an HTML 5 doctype for all my new sites, and gradually change my own sites to do the same, is so that I can use a validator that supports these new technologies.

Myth: You can use JavaScript to cheat the validator

Technically this is not a myth. Yes you can. Most validators will work on the raw HTML and will not process any scripts. However, the purpose of validation is not validation. The purpose is predictability, consistency, maintainability and effectiveness. Inserting or altering markup through scripts, or to put it better making changes to the DOM, should be made in such a way as not to jeopardize the very reasons we wanted or code to be valid in the first place.

This is just like school. Cheating may get you the grade, but you won't get the benefit of the knowledge. Cheating the validator means that you render the validation practically useless.

There are tools available that will let you see the generated HTML, that is the HTML as the browser has understood it too be, post parsing and post scripts being run on the page. This is sort of like going back from step 2 above to the first step. That code should be equally valid, and not differ from the original input in any unexpected way.

Follow up question: Is it OK to use JavaScript instead of the target attribute?

One of the most common uses of JavaScript to cheat the validator is to replace this:

<a href="http://..." target="_blank">linktext</a>

With this:

<a href="http://...">linktext</a>
<script>
// JQuery code that attaches event to all presumed external links
$("a[href^=http").click(function() {
    window.open(this.href);
    return false;
});

I belive this is good practice. Not because we are cheating the validator, but because we are using DOM-scripting to handle behaviour. We are using the right tool for the right job.

Follow up question: Is it OK to use JavaScript to defeat browser bugs?

My first answer is, is that really necessary? For example, it is quite possible to have the object element work in all browsers, including Internet Explorer 6, to serve Flash or Java Applets. Most JavaScript techniques to include Flash on a page have been obtrusive and not degraded gracefully. And they have used outdated browser-sniffing, potentially making them unlikely to work as newer browser versions or alternate browsers like Chrome get released. Using unobtrusive DOM-script to enhance the plugin experience is of course OK.

There are however bugs and lacking support for modern standards in some browsers (yes, we all know I primarily talk about MSIE now) that can only be alleviated through scripting. Chose carefully what scripts to use, though! And remember that there are many users that might not get your scripts, perhaps since a corporate proxy has stripped out all content from your script elements.

Does the doctype matter?

Tied in to the question about validation is the choice of doctype. It serves two purposes:

  1. It declares what vocabulary a developer intends to use.
  2. It is the main way in which all sane browsers chose their rendering mode. I defer this topic to Henri Sivonen, while asking people to note that Internet Explorer 8 is not sane in any way...

As regards the first point the doctype is of value to other people with whom you co-operate, a social contract between all developers in the team. But its main value is helping the validator see what rules to validate against.

Except for rendering mode switching, the doctype does not in any really discernible way affect how the browser will treat the functionality of your markup. E.g. even if you declare a strict doctype, it will happily honour elements like <font>, attributes like target or even the marquee-element! To a (sane) browser, there is no such things as different versions of HTML. XHTML 1.0 strict or HTML 4.01 frameset or HTML 5 is all the same. Any content sent with the MIME declaration text/html is treated the same.

By dropping the DTD, HTML 5 makes explicit, what so far has been implicit. There are different editions of the HTML standard, but in practice (inside the UA) there is but one HTML.

Even if you have used XHTML syntax and have an XHTML DTD, the markup still will be parsed as usual. To trigger XML parsing, one must change the MIME declaration, which of course will fail miserably with Internet Explorer and thus never happens except for niche web sites. It has also been conclusively proven that there is no benefit in switching modes depending on the UA. The speed difference between HTML and true XHTML is first of all negligible and it only really concerns the first step (parsing), which from a performance perspective is only a fraction of time, compared to whats going on in step 2 (the DOM) and step 3 (rendering). (There may be good reasons to use XHTML, but they are related to workflow, tools and data-exchange.)

Myth: HTML 5 re-introduces bad markup

OK this is not exactly a validation or doctype myth, but it is related. the short answer is of course that HTML 5 does not force bad markup down anyones throat. All good practices are still doable.

This myth started in the early days of HTML 5, when authors started to look at the spec and saw monstrosities like <font>, or even <marquee>! The dual purpose of writing a spec on how to handle bad markup, part of the browser requirements, together with a spec about how to produce good markup quickly turned into a communications debacle. The core team behind HTML 5 is perhaps not the ones with the best people skills in the world. Communication has broken down repeatedly. (On the other hand they are unlikely to change any bad behaviour, perceived or factual, through being bashed.)

In practice, though, HTML 5 has conformance requirements — which basically is a new term to describe validity — that are even more far reaching than the ones in HTML 4.01 strict or XHTML 1.0 strict. Being conformant should ensure that you are closer to adhering to best practice and accessibility principles. Validation thus is not less important, it is even more important than ever. Just remember that validation never really has been about anything else but getting a predictable DOM from your markup and encouraging best practices. The conformance criteria in HTML 5 are being written in such a way as to take your code even further towards these lofty goals.

This is one of the reasons there is no DTD for HTML 5. These new rules for validity are sometimes so precise or require such processing logic, that they can not be expressed through an DTD. As a side effect we get a doctype one actually can learn by heart:

<!DOCTYPE html>

That will make my students very glad!

7 comments:

  1. > Building the DOM is the second step.

    Building the DOM is what the parser does, so should probably be part of the first step. :-)

    > It involves the simple mapping of the HTML markup, but also the applying of CSS, handling events, etc.

    Applying of CSS is part of the rendering step (which itself is a series of steps: parse the style sheet, match elements against selectors, build up the rendering tree based on the value of 'display' for each element, lay out the boxes of the rendering tree, paint those boxes on the screen).

    On effects of validation: it might be worth drawing the analogy to spell-checking here. You run a spell-checker to find typos and grammatical mistakes. A validator will similarly find typos and other mistakes.

    > No you must not.

    Swenglish? :-) Certainly it is allowed to validate.

    ReplyDelete
  2. Zcorpan: I was trying to explain this from a web developer perspective. Simplification was necessary. Perhaps I am using the word "parsing" to narrow. You are describing it as involving everything, including painting "boxes on the screen"!

    I am basing my thoughts primarily on remarks by David Baron (in a talk at Google) and Henri Sivonen (why HTML is just as speedy as XHTML: "Parsing is cheap"). Right now he is building a HTML 5 parser for Mozilla, and as I comprehend his work (not having looked at the source code), it corresponds to my first step above.

    If I should use another word for the actual process of "reading the code", what would you recommend?

    Swenglish? Hoppsan! Fixed.

    ReplyDelete
  3. This comment is from Dean Edridge and was sent to my by e-mail:

    Good article, just one minor correction. There is no such thing as the "HTML5 doctype", a lot of people refer to it as the HTML5 doctype, but it's really just a generic doctype for (X)HTML. This new doctype for (X)HTML that was introduced in the (X)HTML5 spec "<!DOCTYPE html>", is simply called the "HTML doctype", (or I guess you could call it the "(X)HTML doctype") as it is not exclusively for just one language like doctypes of the past were (such as HTML4 Strict, XHTML1 Transistional). The "HTML doctype" will be used for all HTML and XHTML flavors from now on, such as: HTML6, XHTML6, HTML7, XHTML7 etc.

    ReplyDelete
  4. My point was that your step 1 and step 2 are really the same step: the parser builds the DOM by reading the network stream. The second step would be rendering the DOM+CSS.

    ReplyDelete
  5. OK, I have re-phrased my post, to make it more accurate. Many thanks for your input.

    ReplyDelete
  6. There are certain elements that display differently by default depending on the doctype, such as the line height of an image element in XHTML Transitional or XHTML strict.

    ReplyDelete
  7. @Anonymous:

    My point exactly. The doctype does not affect anything of importance except styling in a browser.

    ReplyDelete