In case anyone is interested, I made a little grammar up to show the kind of thing that I was thinking of as a start point not an end poit, based on recent posts. Maybe having something concrete helps.
. means a look-up in the lexical context, just a shorthand.
GRAMMAR:
document = (element | comment | pi )+
element = start-tag ( CHARACTER+ | element | comment | pi)*
end-tag
start-tag = name attribute* EOM
name =
START-TAG.TOKEN
attribute = attname ( typeable-token | ATTRIBUTE-TEXT)
attname = TOKEN
typeable-token = boolean | year | | symbol
boolean = TOKEN
== ("true" | "false" )
--> $boolean
year = TOKEN
== ( DECIMAL+ "-" CHARACTER* )
-->
$yearDate
number
= TOKEN
== (""-")? DECIMAL+ ("."
CHARACTER+)?
--> $integer or $decimal
symbol
= TOKEN
end-tag = END-TAG.TOKEN EOM
comment = COMMENT-TAG.CHARACTER* EOM
pi =
piname CHAR* EOM
piname
= PI-TAG.TOKEN E)M
Each lexical pass can be thread-parallelized by section. And the pass execution can be a parallelized by e.g. queuing the results of one thread into another as needed. And the recognition can be parallelized using SIMD.
LEXICAL PASS 1: TAG
DEMARCATION
TEXT =
ws* ("<" MARKUP EOM==">"
DATA? )+
Note:
A terminating "data" section should be marked as ws.
Note: EOM is the only delimiter signal the lexer needs to provide up, but it is only actually needed for start-tags, and would not be part of an infoset.
LEXICAL PASS 2:
ATTRIBUTE DEMARCATION
MARKUP
= ((?=[^!/?]) START-TAG | COMPLEX-TAG
START-TAG = (TAG-TEXT \" ATTRIBUTE-TAG
\"? ) +
Note:
apos not supported as attribute delimiter here.
LEXICAL PASS 3:
REFERENCE SUBSTITUTION
( DATA |
ATTRIBUTE-TEXT | SIMPLE-TAG | COMPLEX-TAG )
-> (CHARACTER
| NUMERIC-CHARACTER-REFERENCE -> CHARACTER
| ENTITY-REFERENCE -> CHARACTER+)*
Note:
numeric character reference is hex numeric character reference to
unicode number a la XML. No decimal reference. I didnt bother to put the production in, but it looks for &.
Note:
- I didn't bother to put the reference production: just & is start. Lazy.
- Hex NCR only?
- Entity reference is to all ISO/SGML/W3C/MathML entities with W3C
(MathML) mappings. Implementation can override, good for some publishers?
- In SGML terms, all entities are CDATA: No markup or references allowed in
entity references, and must not expand to more characters than
reference.
- There is one MathML character that needs bold tagging: if used, it must be
explicitly put into bold by tags, the bold cannot transport.
LEXICAL PASS
4: TOKENIZATION
TAG-TEXT = ( ws | "=" | TOKEN )+
COMPLEX-TAG = END-TAG | COMMENT-TAG | PI-TAG
COMMENT-TAG = "!--" CHARACTER*
"--"
PI-TAG = "?" TOKEN ws* CHARACTER* "?"
END-TAG = "/" TOKEN ws*