Thursday, September 10, 2009

Pedagogic validation of HTML

I have been trying to make HTML5 better for education by participating in the HTML5 effort at WHATWG for a few years. Recently also joined the W3C HTML5 Working Group. One of the things that might come out of this effort is an option in the HTML5 validator for pedagogic validation. I will try to explain what such an option should check for, and how it will be beneficial to teaching web development.

I have previously written about what I call the value of false XHTML. Now I have been joined by the HTML5 superfriends, who request an option for easy polyglot validation. Henri Sivonen, who no doubt is a parsing rules genius and exceptionally knowledgeable, basically replied that such a thing is very hard, and contains so many minute details, that it might be of no value. Did you know that a line feed immediately following the starting pre-tag is forbidden, when using XML parsing rules? I most certainly did not. (And I am still a bit unsure if I got it right...)

I usually also skip the tbody-tags when I do tables, but they are nevertheless automatically inserted into the DOM in HTML, just like head and body is. That will not work for a true polyglot document, since in XHTML the DOM will not be the same. Henri also suggests that using XHTML syntax might lead to a false understanding, as if one would believe that <script /> would be possible to write in normal HTML.

I commented on Zeldmans blog that from my perspective these are non issues. Let me tell you about the everyday problems I encounter as a teacher of markup languages, in addition to what a normal validation would reveal:

  • Students forgetting to quote attribute values, even though they contain multiple words.<img alt=My dog>
  • Students messing up the balance of the quotation marks: <img src="foo.jpg alt=My dog">
  • Students messing up the DOM since they do not (yet) know all the rules for when an element is implicitly closed by another elements starting tag.
  • Students using document.write (and eval) in their scripts - yes I explicitly tell them not to, but they don't always listen, do they?

For reasons like these I tell them to use XHTML syntax today, since that will catch most of these errors.

document.write and eval is outside the scope of HTML validation, but ECMAScript 5th edition strict mode and JSLint will take care of most such problems. What I would like is for HTML validation to have similar checks, checks that enforce good habits and helps to avoid rookie mistakes.

What's the problem with true polyglot documents then?

  • The minutiae. The stuff Henri Sivonen rigthly reminds us of. The stuff that should be saved for a later class, since it is so highly technical and frankly will scare some away from coding by hand.
  • The boolean attributes. Some HTML5 form elements may have a lot of those! Allowing them to exist in their shortened form would mean less markup to type (= happier students, less bandwidth required).

As you see, I do share some of the concerns about XHTML syntax. But today the benefits clearly outweigh the drawbacks, from a pedagogic perspective. But this naturally leads me to the conclusion that there should be some middle ground, a way to specify a pedagogic profile for validation - and voila - Sam Ruby has started to work on such a feature!

I will now explain what features such additional checks should have, according to my experience, and how they are beneficial. I will grade my suggestions from 1 to 5, in rising order of importance.

Avoid implicit rules

But check that what's explicit comply with them. One should as a newbie see a 1:1 correlation between the DOM and the markup.

All elements should be explicit

This would mean that:

  • Root-element (html), head and body tags should not be optional (grade 5).
  • tbody tags should not be optional (grade 1).

In order to avoid making classes boring, I usually teach HTML together with some CSS from day one. I do not teach HTML first for a few weeks, and then I teach CSS. Besides being boooring, this will lead to some students starting to use presentational elements and attributes, because they will really want to have design features from day one.

CSS rules (usually) apply to the DOM, and not the markup, e.g. you can not have have a table#foo > tr selector even if there are no tbody-tags in the code. The student would think that a tr is a child element of table, since implicitly added elements might not be taught until later. It is, however, not usually so that one starts using tables early on - since they are not used for layout when I teach CSS in conjunction with HTML, so I can live this particular check not being implemented, hence it is graded at 1 only.

Explicitly grouping meta data about a document in the head section, and specifically being able to put some scripts in the head and other scripts in the body, is however very essential. In HTML5 we might also see scoped style-elements, which makes the use of explicit head and body tags even more important.

All elements must be explicitly closed
  • All normal elements must have closing tags.(grade 5)
  • Void elements must have a trailing slash. (grade 3)

The use case for closing tags is really simple. Besides making things easier to understand, it also alerts students about implicitly closed elements. If they would try to include a table in a paragraph the validator would complain when it encounters the closing p-tag. For such reasons I tell my students that closing tags are mandatory, and I want a simple way to enforce that behavior.

All non-shortened attributes must have their values quoted. (grade 5)

As I've said above, this is a very common error. In the worst case scenario it might lead to very unexpected results. Look at this example, where the value attribute is supposed to contain the words Login name:

<input type=text value=Login name name=login>

This code snippet actually produces a DOM as if the markup had read:

<input type="text" value="Login" name="">

Arguably, enforcing quotation marks also leads to better readability.

There are a few attributes that might be exceptions to this rule:

  • If the only possible value is an integer.
  • If the only possible value is a keyword containing only US-ASCII letters.

However, enforcing good habits takes precedence over any other concern. I always start teaching the hardest possible rules, and the I gradually relax them. This works better than doing it the other way round.

Attribute values that contain > or = probably are errenous (grade 3)

<abbr title="Et cetera>etc</abbr>
<abbr title="Et cetera class="foo>etc</abbr>

These are two examples of mismatched quotation marks., Yes, it happens a lot that I look over a shoulder of a student and say that they have forgotten to close the attribute - even though they have syntax highligthing on in the editors. (Not everyone's a genius and some are color blind!) If I could have tools that took care of the easy stuff, I'd be able to spend more time explaining the real issues and everyone in my classroom would be happier.

Since at least the second error might be confused with valid use, this behavior probably should generate a notice, not a real error. Using the equal sign in an attribute value might be an indication of a real error, please check that your markup is correct.

Many of these errors would probably get reported even with todays validation rules. I am gunning for those edge cases where two errors even each other out so that they mask each other.

Language must be specified (grade 5)

This one is self-explenatory. There really should be a lang-attribute set on the root element. This actually should be a regular conformance critera, but since such a rule will wreck a lot of currently valid sites, that is probably not doable.

The alt attribute must always be present om images. (grade 5)

While the jury (maybe) still is out on whether this should be a regular conformance criteria (as I think it should) or not, at least the following could be said without any hesitation or doubt: No single argument against a mandatory alt-attribute applies to the learning situation. Even if we say that sites like Flickr should be able to be conformant even if users do not spply usable alt text and having considered every other option it is decided to make the alt attribute optional, those use cases for sure do not apply in the class room! If HTML5 eventually would go down that route, for this reason alone a pedagogic profile in the validator has earned its right to exist.

I am actually a bit reluctant to add this point. I fear it might re-open a can of worms and be taken as an argument against having alt as a mandatory attribute, since those who wish would now have a way of checking for its existence. However, I hope that everyone realizes that this is not the same discussion. My only point is, that if worst should come to worst, this feature is necessary.

What else?

I am going to give this some thought — and Sam Ruby a few initial test cases… After which I might revisit this subject and alter my list of things to test. Of course all feedback is welcome!

One thing that I've thought about is a check for code indentation. But first of all it is probably not possible to check for this in a reasonable manner. And would it be possible to agree on a standard? Nah! I don't think so.

P.S. If someone wonders why my blog has the word Thinkpad in its name, I do still have an ambition to document my joys and woes about using Fedora Linux on my Z61p (and on my still un-bought W700…). Patience, patience.


  1. A quick note to self:

    "Attribute values that contain > or = probably are errenous (grade 3)"

    Another benefit of having such a check is that even when two errors do not even each other out, making the error visible using today's rules, the actual error message and location will be much more intuitive with my extra rules.

  2. Lars, this is a great post. My HTML / CSS / microformats teaching/workshops experiences echo yours. I've found that when students are instructed to write valid XHTML, they tend to make fewer errors (whether caught at validation time or browsing time) than if they write "just" HTML.

  3. I also enjoyed reading it, and I understand most of the concerns that are kind of typical when you are into showing people how uneasy and how dull learning HTML is *something right opposite to the common expectations*. Yet, my style has been obvious and I tried not to be "very forgiving" while teaching, like a bad HTML being somehow interpreted at most recent browsers. I can't expect new learners by-passing difficult situations with all-ease and no trouble at all.

    If they want to understand how HTML marks-up and how it works, one should show them to take the standard path before showing Alice's hole. Besides, I don't believe one could understand Document Object Model and today's most XML applications by doing otherwise.

    I think "no pain there is no gain" applies here.

    I agree with the comments of Tantek, although the validator does not mean to be the embodiment of web standards itself, aiming to catch flawless results always helped learners gaining a great deal of HTML awareness, hence making so must be a decent practice. Then of course, by some appendix or addendum, you can tell them about how stupid putting empty ALT attributes, or other tricks whatsoever. This will be creating your own (quality) difference :) best regards

  4. > Did you know that a line feed immediately following the starting pre-tag is forbidden, when using XML parsing rules? I most certainly did not. (And I am still a bit unsure if I got it right...)

    It's not forbidden, it's just that it means different things in text/html and in XML. In text/html, the line feed is eaten by the parser. In XML, it is not. So if you want your markup to work in the same way in text/html and in XML, you can't use a line feed there.

    > Attribute values that contain > or = probably are errenous (grade 3)

    It is very common for URLs to contain =. I think mismatched quotes are caught in the validator already without banning > or = in quoted attributes.

    > Language must be specified (grade 5)

    Nooo! :-) This would result in even more bogus lang. (It is already used incorrectly so much that UAs are better off ignoring it and applying language analysis on the text.)

  5. @zcorpan:

    1. Yeah! I knew that I got it wrong. It is sort of forbidden if you want an identical DOM... I'll change it.

    2. Good catch! Yes, mismatched quotes usually cause validation errors today. I gave this a rather low grade because of that. There are still two aspects though: (a) when two errors even out. (Yes I've seen it.) (b) The possibility of giving the user a warning that is easier to understand. (Probably the better argument of the two.)

    You are indeed correct about the fact that equal signs are common in links. I should have said that one should exclude the href and the src attribute from that specific test.

    3. Do you mean that there is so much bogus lang that one should not teach students to use it at all?

    WCAG 1.0 specifies that you should use the lang-attribute, and WCAG 2 is interpreted to mean the same thing (3.1.1)

    Isn't it true that almost all bogus lang is "en", since students and lazy developers simply copy-paste code. Perhaps we could add an info notice to the validation report that explicitly interprets the attribute value into a full sentence - since we are talking about pedagogic validation!

  6. Beyond the polyglot validation issue, the W3C's validator needs to present errors in a more student friendly language. It's one thing for a student to identify a problem point in their code, it's another for them to understand the recommendation that is offered in what seems to them a foreign language.

  7. I don't think warning about attribute values containing equals signs is a good idea. The href and src attributes commonly do when they have query strings in the URL, and flagging that would be misleading. Perhaps if the warning was restricted to being applied to a more limited set of attributes, it might work.

    Also, I'm surprised you didn't mention unencoded ampersands and less than signs, which, from my experience, is among the most frequent errors.

  8. Quick answer to Lachlan:

    "unencoded ampersands and less than signs"

    I was under the impression that they are caught with normal validation. I do agree that they are very common, though.

    As for your first comment, I have already answered Zcorpan on that one. Bear in mind that this post is intended as a discussion starter. I am sure things might change during our discussion, perhaps even my mind!

  9. @Aarron.

    You are indeed right. I have not looked into this issue in detail, but generally I find the error messages in the HTML 5 validator easier to understand. There is still some work that needs to be done, though.

  10. @Lachlan:
    Ampersands and equal signs within URLs should always be encoded.

  11. - this convertes shows how html5 pages may look now.

  12. Additional idea about what pedagogic validation could include:

    Mixing old school h1-h6 with sectioning elements should not be allowed. Thus there should be a warning if I have:

    <p>Lorem ipsum…</p>
    <p>Dolor sit amet…</p>

    It should be written like this:

    <p>Lorem ipsum…</p>
    <p>Dolor sit amet…</p>

    (Still using h2 to be backwards compatible.)

    Rule: Students should not mix the two ways of outlining a page, but pick one or the other.

  13. Addendum to my last comment. (Yes I am using the blog to collect my own ideas.)

    Of course, subheadings, nicely wrapped in <hgroup>, are allowed.

  14. Another note: One should not mix attributes that have values and boolean attributes:

    <input type="password" required name="pw" /> = Bad!

    <input type="password" name="pw" required /> = Good!