Friday, November 26, 2010

Tidy5 aka the future of HTML Tidy

UPDATE 2011-11-19: The most immediate of my concerns have been addressed by Björn Höhrmann who has submitted basic support for HTML5 in Tidy to a forked version available on Github.

I have been a long time fan of Tidy, a tool to clean up and do some basic checks on the code. However, the tool is not really being updated any more, and since I have moved to using HTML5 and ARIA on all my new projects, it has lost much of its usefulness.

I also see no momentum picking up and thus think it should be considered folding Tidy into html5lib. By that I mean using html5lib to get Tidy like functionality.

Today I wrote a mail that I cross posted to the discussion list for Tidy and the help list for WHATWG. This blog post is essentially a longer version of that email.

Tidy must go HTML5

Here is the deal with HTML5. Pretty soon every browser will have an HTML5 parser. Except for IE, browsers do not have multiple parsers.

This means that tokenization and DOM tree building will follow the rules defined in HTML5 – as opposed to not really following any rules at all, since HTML 4 never defined them.

Simply put, there is no opt out of HTML5. An HTML 4 or XHTML 1.x doctype is nothing more than a contract between developers. Technically all it does is to set the browser in standards compliance mode.

Thus, I do not see any future in a tool that does not rely on the HTML5 parsing algorithm. Tidy can not grow from its current code base, but needs to have the same html5lib at its core that is in the HTML5 validator, which basically is the same as the one being used in Firefox 4.

Additionally, Tidy suffers from:

  • Implementing WCAG 1 checks in a world that has gone WCAG 2.0.

  • Not recognizing ARIA, which is an extremely valuable technology on the script heavy pages of today.
  • Not recognizing SVG and MathML.

I know one can set up rules to enable Tidy to recognize more elements and attributes, but for full HTML5 + ARIA + SVG + MathML (and perhaps RDFa), that is simply not doable without superhuman efforts.

The merge

A basic Tidy5 implementation could look like this:

  1. Parse the tag soup into a DOM.
  2. Serialize HTML from that DOM.
  3. Compare the start and the end result.

Perhaps any error reporting can be made during the parsing process. Henri Sivonen could probably answer the question if that is possible.

However, there is also talk about having a lint like tool for HTML, that goes beyond what the validator does. So in addition to the above, there can be settings for stuff like:

  • Implicit close of elements. Tolerate, require or drop all closing tags?
  • Implicit elements – tolerate, require or drop (maybe require body but drop tbody...)?
  • Shortened attributes – tolerate, require or drop?
  • HTML 4 style type attributes on <script> and <style> – tolerate, require or drop?
  • Explicit closing of void elements – tolerate, require or drop?
  • Full XHTML syntax (convert both ways)
  • Indentation. Preferably with an option not to have block elements with a very short text content not to be broken up into 3 rows as in Tidy today.

Besides purification and linting, such a tool/library can be used for:

  • Security. This will require the possibility of white and/or blacklisting elements and attributes. And preferably also attribute values.
  • HTML post processing. This will enable authors to see indented code, that is explicit, while at the same time such "waste" can be removed before gzipping. This would be akin to JS minification and it could be performed on the fly from within PHP, Python, Java, Ruby, C#, server side JS or whatever. It can also be done manually before uploading from the development environment to production - or it could be integrated into the uploading tool!

Checking templates

The main feature that Tidy has today, is the ability to handle templates, by preservering/ignoring PHP or other server side code. To what extent the HTML5 parser can be modified to handle that feature I do not know.

From a maintenance and bug fixing point of view, I see huge wins in having a common base for Tidy, the HTML5 validator and HTML parsing in Gecko.

In fact, a very radical idea for Firefox (or any other browser using html5lib) would be to actually integrate these tidy-inspired features directly in their development tools, re-using the existing parser! A Firebug extension that lets me validate as well as tidy up my code directly within the browser would be super awesome!

But the actual possibility thereof is beyond my technical knowledge to evaluate, so I need to hear from people who know this stuff better than I do.

Integration with accessibility checking

Although automatic testing can not not substitute manual tests, they can give a developer an in the ball park idea about the accessibility of a page and fix the most obvious mistakes.

The fact that Tidy today do integrate WCAG 1.0 is better than nothing and any implementation of Tidy5 should strive to integrate WCAG 2.0 in a similar fashion. That really is a no brainer. Having to use only one tool and getting all errors in the same buffer (for programmers) or the same console (for manual checks) is certainly convenient.

OK, that was my two cents. What do you think?

11 comments:

  1. I want it badly :-D Tidy5 including Lint and WCAG2 is the future!

    ReplyDelete
  2. I see that "Laura" has added my blog post as a comment on http://rebuildingtheweb.com/en/html5-shortcomings/

    That post illustrates the problem we have with Tidy in an HTML5 world.

    I do disagree with that articles basic premise, though. I do not think we can go back and re-spec the HTML5 parsing rules to accommodate Tidy or any other non-browser tool the way that article suggests. It is a cry for HTML5 to be a bit more like XHTML 2.0 and that simply will not happen - like it or not.

    ReplyDelete
  3. I guess Vlad has pretty realistic expectations.

    The problem is the rhetoric: HTML5 was touted as backward compatible.

    Firstly, I think, in many cases too much weight has ben laid on backward compatibility. The discovery of the socalled HTML5 Shiv invalidated many of the earlier decisions when it come to compatibility with IE6. Without much actual reevaluation of anything happeing as a result. For example, the figure element as for long stuck in a debate about which legacy element that could possibly be compatible with IE6.

    Secondly, the rhetoric has not been true, as Vlad shows. The back compatibility has been about browser compatibility, at best. This is nothing new, actually. The point has been made in the HTMLwg by several.

    Much of the care fore back-compat that has been built into HTML5, is quite cryptic. It is all about looking at what actually is possible to do with legacy browsers. The same kind of effort has not been put into trying to be compatible with e.g. authoring tools.

    Btw, one way which HTML5 takes this into account - and Vlad does not say this - is by allowing some legacy strict-mode triggering doctypes. This works OK, except in those authoring tools which actually does decide what it permits the author to do by looking at the DOCTYPE. This incorporate, I'm afraid, a bit more authoring tools than I wish it did.

    ReplyDelete
  4. Definitely a useful idea. I also like your I also like your suggested pattern of tolerate, require and drop for various properties.

    It would be ideal if this could serialize any of:

    • pure XML,
    • pure HTML, and
    • XHTML 1 appendix C compatible XHTML.

    Ideally in my view with the appendix C compatible as the default.

    I also think it would be useful to separate the tidying operations into the various phases:

    • parsing
    • DOM manipulation
    • attribute value tidying
    • serialization

    So

    • parsing
       - coalesce duplicate attributes (style, class)
       - recognize unknown element tags indicating:
          + head, body, either
          + block (so implied-p-close-tag or phrase so implied-no-p-close-tag
          + void or non-void
          + content is CData or PCData

    • DOM manipulation
       - duplicate lang and xml:lang wherever the other appears (tolerate, require, drop)
       - remove duplicate IDs throughout the document (or alter them by appending an indexed-number)
       - add implied elements, by element (tolerate, require, drop)
       - add explicit close tags when implied, by element (tolerate, require, drop)
       - a user-provided list of unknown attribute names to recognize (for when omitting unknown attributes)
       - unknown attributes (tolerate, drop, optionally changing to comments)
       - enclose phrase text optionally in p or div in body element
       - enclose phrase text optionally in p or div all elements with flow content model
       - strip comments
       - escape CDATA tags
       - remove CDATA tags
       - add CDATA tags where needed for XML
       - add xml:space="preserve" when needed for XML (i.e., pre)
       - add optional user-specified lang and dir attributes to the root html element
       - duplicate id and xml:id on the same element (tolerate, require, drop)
       - convert HTML base element to xml:base or convert xml:base to html base element
       - automatically add id attribute values to block level elements or sections, or divs, etc. when missing
    • attribute value, comment content, and CDATA content tidying
       - URIs: change backslash to solidus
       - URIs: convert to IRI (from percent encoded international or puny code)
       - URI: convert to ASCII compatible percent-encoded URIs (including amp; quot; lt;, and apos;)
       - fix CDATA section prohibited characters (e.g., stray "]]>")
       - fix comment prohibited characters (e..g, stray "--")
       - compact boolean attributes (tolerate, require, change to full)
       - omit optional end tags (with self-closing or not)
       - character control to output either named reference, decimal reference, hexadecimal reference, or literal character for nbsp;, amp;, quot;, apos;, lt;, all other html repertoire, svg repertoire, mathml repertoire
       - casing control for element tag names, attribute names, attribute values to output:
          + preserved case
          + xml casing / schema determined casing
          + lower case
          + upper case

    Just my 2¢.

    ReplyDelete
  5. I messed up something in my comment. The items starting:

    “- omit optional end tags (with self-closing or not)”

    should be under a “• serialization” list item.

    ReplyDelete
  6. Definitely a useful idea. I also like your I also like your suggested pattern of tolerate, require and drop for various properties.

    It would be ideal if this could serialize any of:
    • pure XML,
    • pure HTML, and
    • XHTML 1 appendix C compatible XHTML.

    I also think it would be useful to separate the tidying operations into the various phases:

    • parsing
    • DOM manipulation
    • attribute value tidying
    • serialization


    • parsing
    - coalesce duplicate attributes (style, class)
    - recognize unknown element tags indicating:
    + head, body, either
    + block (so implied-p-close-tag or phrase so implied-no-p-close-tag
    + void or non-void
    + content is CData, or Any, PCData/Mixed

    • DOM manipulation
    - duplicate lang and xml:lang wherever the other appears (tolerate, require, drop)
    - remove duplicate IDs throughout the document (or alter them by appending an indexed-number)
    - explicit implied HTML elements, by element (tolerate, require, drop)
    - explicit HTML close tags when implied, by element (tolerate, require, drop)
    - a user-provided list of unknown attribute names to recognize (for when omitting unknown attributes)
    - unknown attributes (tolerate or drop and optionally changing dropped unknown attributes to comments)
    - enclose stray phrase text optionally in p or div in body element
    - enclose stray phrase text optionally in p or div in all elements with flow content model in HTML 4.01 transitional but not in HTML5
    - drop all comments
    - escape CDATA tag escaping (in style and script elements): tolerate, require, drop (the escapes specifically)
    - CDATA sections: tolerate, require (for script and style elements), drop (relying on HTML definition so not XML compatible)
    - add xml:space="preserve" when needed for XML (i.e., pre)
    - add optional user-specified lang and dir attributes to the root html element
    - duplicate id and xml:id on the same element: tolerate, require, drop
    - convert HTML base element to xml:base or convert xml:base to html base element
    - automatically add id attribute and generate ID values for all block level elements (or all sectioning elements, all divs, all headings, or all p elements, etc.)

    • comment content and CDATA section content tidying (possibly needed at the parsing stage instead)
    - fix CDATA section prohibited characters (e.g., stray "]]>")
    - fix comment prohibited characters (e..g, stray "--")

    • attribute value, tidying
    - URIs:
    + change backslash to solidus
    + convert to IRI (from percent encoded international or puny code)
    + convert to ASCII compatible percent-encoded URIs (including amp; quot; lt;, and apos;)

    • serialization
    - compact boolean attributes (tolerate, require, drop – i.e., change to full)
    - optional end tags (tolerate, require, drop)
    - self-closing non-void empty elements (tolerate, require, drop)
    - self-closing tag on void element: tolerate, require, drop
    - character control to output either:
    + named reference,
    + decimal reference,
    + hexadecimal reference,
    + or literal character
    for:
    + nbsp;,
    + amp;,
    + quot;,
    + apos;,
    + lt;,
    + gt;
    + all other characters in html character reference repertoire
    + svg repertoire
    + mathml repertoire
    - casing control to output
    + preserved case
    + xml casing / schema-driven casing
    + lower case
    + upper case
    for
    + element tag names,
    + attribute names,
    + attribute values (keywords, QNames, Names, IDs, NMTokens)

    ReplyDelete
  7. Henri Sivonen was specifically asked to comment, but for strange technical reasons OpenID would not work for him. This is from an email I got instead:

    "The main feature that Tidy has today, is the ability to handle templates, by preservering/ignoring PHP or other server side code. To what extent the HTML5 parser can be modified to handle that feature I do not know."

    If you have a [?php ... ?] template (or [% ... %], etc.) in a place where you could instead put an HTML comment such that the comment would round-trip in parser and serialization, then hacking an HTML5 parser, the tree model and the serializer to preserve the template would be relatively easy.

    For other cases, it would be very, very hard. (PHP treats the stuff outside [php ... ?] as arbitrary runs of bytes, so from an HTML point of view, [?php ... ?] can go *anywhere*, including inside attribute values, between attributes, in places where the absence of tags (that would get written by PHP) creates implicit elements, etc.)

    (Blogspot does not allow less than signs so I'm using brackets instead.)

    @Rob Burns: I decline to discuss your individual suggestions right now, even though I think many of them has some merit. Wen need *something* right now, and I think the focus must be HTML, not XML.

    If "Tidy5" could be expanded in the future, all the better, but I'd opt for the shortest route to get it out of the doors. That's why I suggested basing "Tidy5" on an existing HTML5 parser to begin with.

    ReplyDelete
  8. will future browsers still be able to read older HTML codes with obsolete tags or will CSS have to be used?

    ReplyDelete
  9. Browser provide default styling through their CSS. That won't change.

    Semantic HTML will always be usable.

    HTML5 explicitly describes how to parse legacy content, even if it's malformed.

    ReplyDelete
  10. Just making sure I'm not alone here...

    once tidy is configured correctly, the missing parts should not require a complete rewrite. Add definitions for new tags, and new doctypes, and 90% is mostly done.

    Pragmatically speaking, if you wanted to add support to tidy, it would be worth it. If you want to parse everything in a DOM, you are defining a different type of project. Not saying it is worse, it's just not what tidy was ever about.

    Tidy is streamline, written in C, Has almost no dependencies, and will ultimately yield better performance. Beyond being written closer to the metal, tidy is not a full DOM parsing Tool. Otherwise, it would be more like a formatting Expat).

    simply put, tidy will not be replaced with 3 weeks of Python. This is coming from someone who loves Python, it's just never going to be C.

    Put in the real effort to update the codebase, or simply start a new project.
    IMO, it is inappropriate to use the same name for such a different project.

    ReplyDelete
  11. Totally agree with you! wWe need HTML Tody to support HTML5!

    ReplyDelete