[Home] [By Thread] [By Date] [Recent Entries]

  • From: Tim Bray <tbray@t...>
  • To: Rick Jelliffe <rjelliffe@a...>
  • Date: Thu, 22 Jul 2021 07:40:08 -0700

Example of what a document looks like?

On Thu., Jul. 22, 2021, 3:06 a.m. Rick Jelliffe, <rjelliffe@a...> wrote:
In case anyone is interested, I made a little grammar up to show the kind of thing that I was thinking of as a start point not an end poit, based on recent posts. Maybe having something concrete helps. 

So it is two parts: 
  • First, a grammar which not made with parallel parsing considerations particularly in mind.  The capitalized names in the grammar are the non-terminals determined by the lexical processing.  (The sub-rules for recognizing the types of undelimited data values are given in the grammar not the lexer, which I think is easiest to read if unfamiliar.)
  • Second, the lexical processing is specified as given as a series  of logical passes. Each pass is amenable to be divided and run in a parallel fashion or as a pipeline or some event system or folded into the grammar; of course a real implementation of them might coalesce them or rearrange with the same intent. 

This uses some extensions:
     == means "if" 
     -->  $something means a data type conversion
     -> means a substitution (handling references)
     .  means a look-up in the lexical context, just a shorthand.
 

GRAMMAR:

    document =   (element | comment | pi )+

    element =  start-tag ( CHARACTER+ | element | comment | pi)*  end-tag

    start-tag = name attribute* EOM   

    name = START-TAG.TOKEN 

    attribute =  attname ( typeable-token | ATTRIBUTE-TEXT)

    attname = TOKEN

    typeable-token = boolean |  year |  |  symbol

    boolean = TOKEN 

        ==  ("true" | "false" )  

        --> $boolean
    year = TOKEN
        ==  ( DECIMAL+ "-" CHARACTER*  ) 

        -->  $yearDate

    number = TOKEN
        == (""-")? DECIMAL+ ("."   CHARACTER+)?    

       --> $integer or $decimal

    symbol = TOKEN

    end-tag = END-TAG.TOKEN  EOM

    comment = COMMENT-TAG.CHARACTER*  EOM

    pi = piname  CHAR*  EOM

    piname = PI-TAG.TOKEN  E)M


Each lexical pass can be thread-parallelized by section.  And the pass execution can be a parallelized by e.g. queuing the results of one thread into another as needed.  And the recognition can be parallelized using SIMD.

LEXICAL PASS 1: TAG DEMARCATION

    TEXT = ws*  ("<"  MARKUP EOM==">"  DATA?  )+ 

    Note: A terminating "data" section should be marked as ws.

    Note: EOM is the only delimiter signal the lexer needs to provide up, but it is only actually needed for start-tags, and would not be part of an infoset.


LEXICAL PASS 2:  ATTRIBUTE DEMARCATION

    MARKUP =  ((?=[^!/?])  START-TAG  |  COMPLEX-TAG

    START-TAG =  (TAG-TEXT   \"  ATTRIBUTE-TAG  \"? ) +

   Note:  apos not supported as attribute delimiter here. 


LEXICAL PASS 3: REFERENCE SUBSTITUTION

   ( DATA | ATTRIBUTE-TEXT  | SIMPLE-TAG | COMPLEX-TAG )

              ->  (CHARACTER 

               | NUMERIC-CHARACTER-REFERENCE -> CHARACTER 

               | ENTITY-REFERENCE  -> CHARACTER+)*  

     Note: numeric character reference is hex numeric character reference to unicode number a la XML. No decimal reference. I didnt bother to put the  production in, but it looks for &. 

   Note:

  • I didn't bother to put the reference production: just & is start. Lazy.  
  • Hex NCR only?
  • Entity reference is to all ISO/SGML/W3C/MathML entities with W3C (MathML) mappings. Implementation can override, good for some publishers?
  •   In SGML terms, all entities are CDATA: No markup or references allowed in entity references, and must not expand to more characters than reference.
  •   There is one MathML character that needs bold tagging: if used, it must be explicitly put into bold by tags, the bold cannot transport.


 LEXICAL PASS 4: TOKENIZATION

     TAG-TEXT = ( ws | "=" | TOKEN )+

     COMPLEX-TAG = END-TAG | COMMENT-TAG | PI-TAG 
     COMMENT-TAG = "!--"  CHARACTER*  "--"

     PI-TAG = "?" TOKEN ws* CHARACTER* "?" 

     END-TAG = "/" TOKEN ws*

      





[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member