Friday, November 26, 2010

Tidy5 aka the future of HTML Tidy

UPDATE 2011-11-19: The most immediate of my concerns have been addressed by Björn Höhrmann who has submitted basic support for HTML5 in Tidy to a forked version available on Github.

I have been a long time fan of Tidy, a tool to clean up and do some basic checks on the code. However, the tool is not really being updated any more, and since I have moved to using HTML5 and ARIA on all my new projects, it has lost much of its usefulness.

I also see no momentum picking up and thus think it should be considered folding Tidy into html5lib. By that I mean using html5lib to get Tidy like functionality.

Today I wrote a mail that I cross posted to the discussion list for Tidy and the help list for WHATWG. This blog post is essentially a longer version of that email.

Tidy must go HTML5

Here is the deal with HTML5. Pretty soon every browser will have an HTML5 parser. Except for IE, browsers do not have multiple parsers.

This means that tokenization and DOM tree building will follow the rules defined in HTML5 – as opposed to not really following any rules at all, since HTML 4 never defined them.

Simply put, there is no opt out of HTML5. An HTML 4 or XHTML 1.x doctype is nothing more than a contract between developers. Technically all it does is to set the browser in standards compliance mode.

Thus, I do not see any future in a tool that does not rely on the HTML5 parsing algorithm. Tidy can not grow from its current code base, but needs to have the same html5lib at its core that is in the HTML5 validator, which basically is the same as the one being used in Firefox 4.

Additionally, Tidy suffers from:

  • Implementing WCAG 1 checks in a world that has gone WCAG 2.0.

  • Not recognizing ARIA, which is an extremely valuable technology on the script heavy pages of today.
  • Not recognizing SVG and MathML.

I know one can set up rules to enable Tidy to recognize more elements and attributes, but for full HTML5 + ARIA + SVG + MathML (and perhaps RDFa), that is simply not doable without superhuman efforts.

The merge

A basic Tidy5 implementation could look like this:

  1. Parse the tag soup into a DOM.
  2. Serialize HTML from that DOM.
  3. Compare the start and the end result.

Perhaps any error reporting can be made during the parsing process. Henri Sivonen could probably answer the question if that is possible.

However, there is also talk about having a lint like tool for HTML, that goes beyond what the validator does. So in addition to the above, there can be settings for stuff like:

  • Implicit close of elements. Tolerate, require or drop all closing tags?
  • Implicit elements – tolerate, require or drop (maybe require body but drop tbody...)?
  • Shortened attributes – tolerate, require or drop?
  • HTML 4 style type attributes on <script> and <style> – tolerate, require or drop?
  • Explicit closing of void elements – tolerate, require or drop?
  • Full XHTML syntax (convert both ways)
  • Indentation. Preferably with an option not to have block elements with a very short text content not to be broken up into 3 rows as in Tidy today.

Besides purification and linting, such a tool/library can be used for:

  • Security. This will require the possibility of white and/or blacklisting elements and attributes. And preferably also attribute values.
  • HTML post processing. This will enable authors to see indented code, that is explicit, while at the same time such "waste" can be removed before gzipping. This would be akin to JS minification and it could be performed on the fly from within PHP, Python, Java, Ruby, C#, server side JS or whatever. It can also be done manually before uploading from the development environment to production - or it could be integrated into the uploading tool!

Checking templates

The main feature that Tidy has today, is the ability to handle templates, by preservering/ignoring PHP or other server side code. To what extent the HTML5 parser can be modified to handle that feature I do not know.

From a maintenance and bug fixing point of view, I see huge wins in having a common base for Tidy, the HTML5 validator and HTML parsing in Gecko.

In fact, a very radical idea for Firefox (or any other browser using html5lib) would be to actually integrate these tidy-inspired features directly in their development tools, re-using the existing parser! A Firebug extension that lets me validate as well as tidy up my code directly within the browser would be super awesome!

But the actual possibility thereof is beyond my technical knowledge to evaluate, so I need to hear from people who know this stuff better than I do.

Integration with accessibility checking

Although automatic testing can not not substitute manual tests, they can give a developer an in the ball park idea about the accessibility of a page and fix the most obvious mistakes.

The fact that Tidy today do integrate WCAG 1.0 is better than nothing and any implementation of Tidy5 should strive to integrate WCAG 2.0 in a similar fashion. That really is a no brainer. Having to use only one tool and getting all errors in the same buffer (for programmers) or the same console (for manual checks) is certainly convenient.

OK, that was my two cents. What do you think?