Thursday, September 10, 2009

Pedagogic validation of HTML

I have been trying to make HTML5 better for education by participating in the HTML5 effort at WHATWG for a few years. Recently also joined the W3C HTML5 Working Group. One of the things that might come out of this effort is an option in the HTML5 validator for pedagogic validation. I will try to explain what such an option should check for, and how it will be beneficial to teaching web development.

I have previously written about what I call the value of false XHTML. Now I have been joined by the HTML5 superfriends, who request an option for easy polyglot validation. Henri Sivonen, who no doubt is a parsing rules genius and exceptionally knowledgeable, basically replied that such a thing is very hard, and contains so many minute details, that it might be of no value. Did you know that a line feed immediately following the starting pre-tag is forbidden, when using XML parsing rules? I most certainly did not. (And I am still a bit unsure if I got it right...)

I usually also skip the tbody-tags when I do tables, but they are nevertheless automatically inserted into the DOM in HTML, just like head and body is. That will not work for a true polyglot document, since in XHTML the DOM will not be the same. Henri also suggests that using XHTML syntax might lead to a false understanding, as if one would believe that <script /> would be possible to write in normal HTML.

I commented on Zeldmans blog that from my perspective these are non issues. Let me tell you about the everyday problems I encounter as a teacher of markup languages, in addition to what a normal validation would reveal:

  • Students forgetting to quote attribute values, even though they contain multiple words.<img alt=My dog>
  • Students messing up the balance of the quotation marks: <img src="foo.jpg alt=My dog">
  • Students messing up the DOM since they do not (yet) know all the rules for when an element is implicitly closed by another elements starting tag.
  • Students using document.write (and eval) in their scripts - yes I explicitly tell them not to, but they don't always listen, do they?

For reasons like these I tell them to use XHTML syntax today, since that will catch most of these errors.

document.write and eval is outside the scope of HTML validation, but ECMAScript 5th edition strict mode and JSLint will take care of most such problems. What I would like is for HTML validation to have similar checks, checks that enforce good habits and helps to avoid rookie mistakes.

What's the problem with true polyglot documents then?

  • The minutiae. The stuff Henri Sivonen rigthly reminds us of. The stuff that should be saved for a later class, since it is so highly technical and frankly will scare some away from coding by hand.
  • The boolean attributes. Some HTML5 form elements may have a lot of those! Allowing them to exist in their shortened form would mean less markup to type (= happier students, less bandwidth required).

As you see, I do share some of the concerns about XHTML syntax. But today the benefits clearly outweigh the drawbacks, from a pedagogic perspective. But this naturally leads me to the conclusion that there should be some middle ground, a way to specify a pedagogic profile for validation - and voila - Sam Ruby has started to work on such a feature!

I will now explain what features such additional checks should have, according to my experience, and how they are beneficial. I will grade my suggestions from 1 to 5, in rising order of importance.

Avoid implicit rules

But check that what's explicit comply with them. One should as a newbie see a 1:1 correlation between the DOM and the markup.

All elements should be explicit

This would mean that:

  • Root-element (html), head and body tags should not be optional (grade 5).
  • tbody tags should not be optional (grade 1).

In order to avoid making classes boring, I usually teach HTML together with some CSS from day one. I do not teach HTML first for a few weeks, and then I teach CSS. Besides being boooring, this will lead to some students starting to use presentational elements and attributes, because they will really want to have design features from day one.

CSS rules (usually) apply to the DOM, and not the markup, e.g. you can not have have a table#foo > tr selector even if there are no tbody-tags in the code. The student would think that a tr is a child element of table, since implicitly added elements might not be taught until later. It is, however, not usually so that one starts using tables early on - since they are not used for layout when I teach CSS in conjunction with HTML, so I can live this particular check not being implemented, hence it is graded at 1 only.

Explicitly grouping meta data about a document in the head section, and specifically being able to put some scripts in the head and other scripts in the body, is however very essential. In HTML5 we might also see scoped style-elements, which makes the use of explicit head and body tags even more important.

All elements must be explicitly closed
  • All normal elements must have closing tags.(grade 5)
  • Void elements must have a trailing slash. (grade 3)

The use case for closing tags is really simple. Besides making things easier to understand, it also alerts students about implicitly closed elements. If they would try to include a table in a paragraph the validator would complain when it encounters the closing p-tag. For such reasons I tell my students that closing tags are mandatory, and I want a simple way to enforce that behavior.

All non-shortened attributes must have their values quoted. (grade 5)

As I've said above, this is a very common error. In the worst case scenario it might lead to very unexpected results. Look at this example, where the value attribute is supposed to contain the words Login name:

<input type=text value=Login name name=login>

This code snippet actually produces a DOM as if the markup had read:

<input type="text" value="Login" name="">

Arguably, enforcing quotation marks also leads to better readability.

There are a few attributes that might be exceptions to this rule:

  • If the only possible value is an integer.
  • If the only possible value is a keyword containing only US-ASCII letters.

However, enforcing good habits takes precedence over any other concern. I always start teaching the hardest possible rules, and the I gradually relax them. This works better than doing it the other way round.

Attribute values that contain > or = probably are errenous (grade 3)

<abbr title="Et cetera>etc</abbr>
<abbr title="Et cetera class="foo>etc</abbr>

These are two examples of mismatched quotation marks., Yes, it happens a lot that I look over a shoulder of a student and say that they have forgotten to close the attribute - even though they have syntax highligthing on in the editors. (Not everyone's a genius and some are color blind!) If I could have tools that took care of the easy stuff, I'd be able to spend more time explaining the real issues and everyone in my classroom would be happier.

Since at least the second error might be confused with valid use, this behavior probably should generate a notice, not a real error. Using the equal sign in an attribute value might be an indication of a real error, please check that your markup is correct.

Many of these errors would probably get reported even with todays validation rules. I am gunning for those edge cases where two errors even each other out so that they mask each other.

Language must be specified (grade 5)

This one is self-explenatory. There really should be a lang-attribute set on the root element. This actually should be a regular conformance critera, but since such a rule will wreck a lot of currently valid sites, that is probably not doable.

The alt attribute must always be present om images. (grade 5)

While the jury (maybe) still is out on whether this should be a regular conformance criteria (as I think it should) or not, at least the following could be said without any hesitation or doubt: No single argument against a mandatory alt-attribute applies to the learning situation. Even if we say that sites like Flickr should be able to be conformant even if users do not spply usable alt text and having considered every other option it is decided to make the alt attribute optional, those use cases for sure do not apply in the class room! If HTML5 eventually would go down that route, for this reason alone a pedagogic profile in the validator has earned its right to exist.

I am actually a bit reluctant to add this point. I fear it might re-open a can of worms and be taken as an argument against having alt as a mandatory attribute, since those who wish would now have a way of checking for its existence. However, I hope that everyone realizes that this is not the same discussion. My only point is, that if worst should come to worst, this feature is necessary.

What else?

I am going to give this some thought — and Sam Ruby a few initial test cases… After which I might revisit this subject and alter my list of things to test. Of course all feedback is welcome!

One thing that I've thought about is a check for code indentation. But first of all it is probably not possible to check for this in a reasonable manner. And would it be possible to agree on a standard? Nah! I don't think so.

P.S. If someone wonders why my blog has the word Thinkpad in its name, I do still have an ambition to document my joys and woes about using Fedora Linux on my Z61p (and on my still un-bought W700…). Patience, patience.