[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Napkin grammar

  • From: Rick Jelliffe <rjelliffe@allette.com.au>
  • To: xml-dev <xml-dev@lists.xml.org>
  • Date: Thu, 22 Jul 2021 20:06:01 +1000

Napkin grammar
In case anyone is interested, I made a little grammar up to show the kind of thing that I was thinking of as a start point not an end poit, based on recent posts. Maybe having something concrete helps. 

So it is two parts: 
  • First, a grammar which not made with parallel parsing considerations particularly in mind.  The capitalized names in the grammar are the non-terminals determined by the lexical processing.  (The sub-rules for recognizing the types of undelimited data values are given in the grammar not the lexer, which I think is easiest to read if unfamiliar.)
  • Second, the lexical processing is specified as given as a series  of logical passes. Each pass is amenable to be divided and run in a parallel fashion or as a pipeline or some event system or folded into the grammar; of course a real implementation of them might coalesce them or rearrange with the same intent. 

This uses some extensions:
     == means "if" 
     -->  $something means a data type conversion
     -> means a substitution (handling references)
     .  means a look-up in the lexical context, just a shorthand.
 

GRAMMAR:

    document =   (element | comment | pi )+

    element =  start-tag ( CHARACTER+ | element | comment | pi)*  end-tag

    start-tag = name attribute* EOM   

    name = START-TAG.TOKEN 

    attribute =  attname ( typeable-token | ATTRIBUTE-TEXT)

    attname = TOKEN

    typeable-token = boolean |  year |  |  symbol

    boolean = TOKEN 

        ==  ("true" | "false" )  

        --> $boolean
    year = TOKEN
        ==  ( DECIMAL+ "-" CHARACTER*  ) 

        -->  $yearDate

    number = TOKEN
        == (""-")? DECIMAL+ ("."   CHARACTER+)?    

       --> $integer or $decimal

    symbol = TOKEN

    end-tag = END-TAG.TOKEN  EOM

    comment = COMMENT-TAG.CHARACTER*  EOM

    pi = piname  CHAR*  EOM

    piname = PI-TAG.TOKEN  E)M


Each lexical pass can be thread-parallelized by section.  And the pass execution can be a parallelized by e.g. queuing the results of one thread into another as needed.  And the recognition can be parallelized using SIMD.

LEXICAL PASS 1: TAG DEMARCATION

    TEXT = ws*  ("<"  MARKUP EOM==">"  DATA?  )+ 

    Note: A terminating "data" section should be marked as ws.

    Note: EOM is the only delimiter signal the lexer needs to provide up, but it is only actually needed for start-tags, and would not be part of an infoset.


LEXICAL PASS 2:  ATTRIBUTE DEMARCATION

    MARKUP =  ((?=[^!/?])  START-TAG  |  COMPLEX-TAG

    START-TAG =  (TAG-TEXT   \"  ATTRIBUTE-TAG  \"? ) +

   Note:  apos not supported as attribute delimiter here. 


LEXICAL PASS 3: REFERENCE SUBSTITUTION

   ( DATA | ATTRIBUTE-TEXT  | SIMPLE-TAG | COMPLEX-TAG )

              ->  (CHARACTER 

               | NUMERIC-CHARACTER-REFERENCE -> CHARACTER 

               | ENTITY-REFERENCE  -> CHARACTER+)*  

     Note: numeric character reference is hex numeric character reference to unicode number a la XML. No decimal reference. I didnt bother to put the  production in, but it looks for &. 

   Note:

  • I didn't bother to put the reference production: just & is start. Lazy.  
  • Hex NCR only?
  • Entity reference is to all ISO/SGML/W3C/MathML entities with W3C (MathML) mappings. Implementation can override, good for some publishers?
  •   In SGML terms, all entities are CDATA: No markup or references allowed in entity references, and must not expand to more characters than reference.
  •   There is one MathML character that needs bold tagging: if used, it must be explicitly put into bold by tags, the bold cannot transport.


 LEXICAL PASS 4: TOKENIZATION

     TAG-TEXT = ( ws | "=" | TOKEN )+

     COMPLEX-TAG = END-TAG | COMMENT-TAG | PI-TAG 
     COMMENT-TAG = "!--"  CHARACTER*  "--"

     PI-TAG = "?" TOKEN ws* CHARACTER* "?" 

     END-TAG = "/" TOKEN ws*

      





[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.