The section of the WHATWG HTML spec about parsing XHTML begins with this note:
> An XML parser, for the purposes of this specification, is a construct that follows the rules given in XML to map a string of bytes or characters into a Document object.
> Note: At the time of writing, no such rules actually exist.
What do the authors of HTML mean by this? Isn't there a spec for XML? There is -- here's what it has to say about comments (https://www.w3.org/TR/xml/#sec-comments):
12.2.5.45 Comment state
Consume the next input character:
U+003C LESS-THAN SIGN (<)
Append the current input character to
the comment token's data. Switch to
the comment less-than sign state.
U+002D HYPHEN-MINUS (-)
Switch to the comment end dash state.
U+0000 NULL
This is an unexpected-null-character
parse error. Append a
U+FFFD REPLACEMENT CHARACTER
character to the comment token's data.
EOF
This is an eof-in-comment parse error.
Emit the comment token. Emit an end-
of-file token.
Anything else
Append the current input character to
the comment token's data.
The spec defines what to do for every character, even characters that should not appear in valid HTML. An HTML parser will behave exactly the same as another HTML parser in all circumstances.
You can see the success of this approach on the real web; inconsistent HTML parsing between browsers is no longer the issue it used to be 15 years ago. It may be more work to write, but I wish HTML's precise, step-by-step format was more common. Writing a spec as a list of rules makes it easier to implement (as a first pass, you can just go line-by-line and translate it to code) and reduces the chance of inconsistencies like the one in the article (and their associated security implications).
I am very skeptical. This state-machine approach seems much more like an implementation than a specification. Having a reference implementation could certainly be a good thing, but this doesn't even look like something one could run and test against.
The declarative form of Comment, above, is wonderfully concise and clear when compared to these several lines of imperative, update-this/goto-there style alternative. You can see in your head what it should match without mentally simulating these specific instructions against some imagined parser state.
There certainly can be a host of terrible issues with BNF-style grammars. When they're just used as a notation to write down a bunch of rules, with no regard to actually implementing these rules, the result can be a sprawling and terribly ambiguous mess. For instance, this[1] is an abject disaster, chock full of ambiguity and senseless distinctions.
But if one is prepared to take the effort to really write a machine-readable grammar like this[2], the result is a straightforward, high-level, concise spec that can be compiled into an implementation to boot. What's not to like?
> An XML parser, for the purposes of this specification, is a construct that follows the rules given in XML to map a string of bytes or characters into a Document object.
> Note: At the time of writing, no such rules actually exist.
What do the authors of HTML mean by this? Isn't there a spec for XML? There is -- here's what it has to say about comments (https://www.w3.org/TR/xml/#sec-comments):
The HTML spec, on the other hand, writes out the token state machine explicitly. There are ten states involved with parsing comments; here's one (https://html.spec.whatwg.org/multipage/parsing.html#comment-...): The spec defines what to do for every character, even characters that should not appear in valid HTML. An HTML parser will behave exactly the same as another HTML parser in all circumstances.You can see the success of this approach on the real web; inconsistent HTML parsing between browsers is no longer the issue it used to be 15 years ago. It may be more work to write, but I wish HTML's precise, step-by-step format was more common. Writing a spec as a list of rules makes it easier to implement (as a first pass, you can just go line-by-line and translate it to code) and reduces the chance of inconsistencies like the one in the article (and their associated security implications).