[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Napkin grammar

  • From: Rick Jelliffe <rjelliffe@allette.com.au>
  • To: xml-dev <xml-dev@lists.xml.org>
  • Date: Sat, 24 Jul 2021 18:35:08 +1000

Re: Napkin grammar
Here is an updated grammar and examples. Added are
   Clark names  {URL}:name
   Link tags  <:  :> 
   scoped IDREFs    rootid:myid
   short tags

So it is two parts: 
  • First, a grammar which not made with parallel parsing considerations particularly in mind.  The capitalized names in the grammar are the non-terminals determined by the lexical processing.  (The sub-rules for recognizing the types of undelimited data values are given in the grammar not the lexer, which I think is easiest to read if unfamiliar.)
  • Second, the lexical processing is specified as given as a series  of logical passes. Each pass is amenable to be divided and run in a parallel fashion or as a pipeline or some event system or folded into the grammar; of course a real implementation of them might coalesce them or rearrange with the same intent. 

This uses some extensions:
     == means "if" 
     -->  $something means a data type conversion
     -> means a substitution (handling references)
     .  means a look-up in the lexical context, just a shorthand.
 

GRAMMAR:

    document = (link | comment | pi )*  element (element | comment | pi )*

    Comment: a document can have multiple branches not a single root


    link = prefix  attribute EOM

    Comment: a link is a kind of element that is scoped by namespace prefix or branch id:
         it declares property values for every element/attribute with the same namespace
         or branch id.  A branch id is the id of the branch root.

    prefix = LINK-START.TOKEN 

           == TOKEN (could be empty for defalt)

    element =  start-tag ( CHARACTER+ | element | comment | pi)*  end-tag

    start-tag = name attribute* EOM   

    name = START-TAG.BI_TOKEN

        --> clark-name

    attribute =  attname ( typeable-token | ATTRIBUTE-TEXT)

    attname = BI_TOKEN

        --> clark-name

    typeable-token = boolean |  year |  |  symbol  | id | prefixed-name

    prefixed-name = BI_TOKEN | clark-name

      == contains ":"

       --> clark-name

    boolean = TOKEN 

        ==  ("true" | "false" )  

        --> $boolean
    year = TOKEN
        ==  ( DECIMAL+ "-" CHARACTER*  ) 

        -->  $yearDate

    number = TOKEN
        == (""-")? DECIMAL+ ("."   CHARACTER+)?    

       --> $integer or $decimal

    id = TOKEN 

        --> ID

        // iff lexer knows that this is a branch root and attribute name is "id", it can do this

    symbol = TOKEN

 

    end-tag = END-TAG.BI_TOKEN  EOM

        --> clark-name EOM

      Comment: the name in an end tag does not require a prefix or {} url


    comment = COMMENT-TAG.CHARACTER*  EOM

        --> clark-name EOM

    pi = piname  CHAR*  EOM

    piname = PI-TAG.BI-TOKEN  EOM

        --> clark-name  EOM

    clark-name =  ("{" .* "}": )? TOKEN


Each lexical pass can be thread-parallelized by section.  And the pass execution can be a parallelized by e.g. queuing the results of one thread into another as needed.  And the recognition can be parallelized using SIMD.

LEXICAL PASS 1: TAG DEMARCATION

    TEXT = ws*  ("<"  MARKUP EOM==">"  DATA?  )+ 

    Note: A terminating "data" section should be marked as ws.

    Note: EOM is the only delimiter signal the lexer needs to provide up, but it is only actually needed for start-tags, and would not be part of an infoset.


LEXICAL PASS 2:  ATTRIBUTE DEMARCATION

    MARKUP =  ((?=[^!/?:])  START-TAG  |  COMPLEX-TAG

    START-TAG =  (TAG-TEXT   \"  ATTRIBUTE-TAG  \"? ) +

   Note:  apos not supported as attribute delimiter here. 


LEXICAL PASS 3: REFERENCE SUBSTITUTION

   ( DATA | ATTRIBUTE-TEXT  | SIMPLE-TAG | COMPLEX-TAG LINK-TAG)

              ->  (CHARACTER 

               | NUMERIC-CHARACTER-REFERENCE -> CHARACTER 

               | ENTITY-REFERENCE  -> CHARACTER+)*  

     Note: numeric character reference is hex numeric character reference to unicode number a la XML. No decimal reference. I didnt bother to put the  production in, but it looks for &. 

   Note:

  • I didn't bother to put the reference production: just & is start. Lazy.  
  • Hex NCR only?
  • Entity reference is to all ISO/SGML/W3C/MathML entities with W3C (MathML) mappings. Implementation can override, good for some publishers?
  •   In SGML terms, all entities are CDATA: No markup or references allowed in entity references, and must not expand to more characters than reference.
  •   There is one MathML character that needs bold tagging: if used, it must be explicitly put into bold by tags, the bold cannot transport.


 LEXICAL PASS 4: TOKENIZATION

     TAG-TEXT = ( ws | "=" | BI_TOKEN )+

     COMPLEX-TAG = END-TAG | COMMENT-TAG | PI-TAG | LINK-TAG
     COMMENT-TAG = "!--"  CHARACTER*  "--"

     PI-TAG = "?" BI-TOKEN ws* CHARACTER* "?"

     END-TAG = "/" BI_TOKEN ws*

     LINK-TAG = ":" TOKEN? ws* (TAG-TEXT   \"  ATTRIBUTE-TAG  \"? ) + ":"

     BI_TOKEN = [^\S<"=]+


So an example: the Purchase order example could come in without change, but here I have some typed recognition of numbers, dates and tokens in attributes.

   <?hello abcd ?>

  <!-- comment -->

<PurchaseOrder PurchaseOrderNumber=99503 OrderDate=1999-10-20>
  <Address Type=Shipping>
    <Name>Ellen Adams</Name>
    <Street>123 Maple Street</Street>
    <City>Mill Valley</City>
    <State>CA</State>
    <Zip>10999</Zip>
    <Country>USA</Country>
  </Address>
  <Address Type=Billing>
    <Name>Tai Yee</Name>
    <Street>8 Oak Avenue</Street>
    <City>Old Town</City>
    <State>PA</State>
    <Zip>95819</Zip>
    <Country>USA</Country>
  </Address>
  <DeliveryNotes>Please leave packages in shed by driveway.</DeliveryNotes>
  <Items>
    <Item PartNumber="872-AA">
      <ProductName>Lawnmower</ProductName>
      <Quantity>1</Quantity>
      <USPrice>148.95</USPrice>
      <Comment>Confirm this is electric</Comment>
    </Item>
    <Item PartNumber="926-AA">
      <ProductName>Baby Monitor</ProductName>
      <Quantity>2</Quantity>
      <USPrice>39.98</USPrice>
      <ShipDate>1999-05-21</ShipDate>
    </Item>
  </Items>
</PurchaseOrder>

A more wild example:

   <?hello  References can go everywhere &#xAB; &#mdash; but only standard entities ?>
   <!-- same with comments &#xAB; &#mdash; -->

 <!-- a link tag for the whole document -->
  <:"/" 
Content-Type="text/
plain"
:>


<!-- Link tag for svg prefix. -->
   <:svg 
xmlns="http://www.w3.org/2000/svg"
         version ="1.1"
         schema="svg.rlx" :> 
 

           <svg:svg height=100 width=100  id=ABC>
              <svg:circle cx=50 cy=50 r=40 stroke=black stroke-width=3 fill=red   id=XYZ />
            </sv&#x67;>


             <!--  Below we have examples of a full QName used, a scoped link, and a dropped-prefix end-tag -->

             <svg:svg width=400 height=110>
                   <svg:rect width=300 height=100 id=XYZ />

                  <{http://www.example.com/link}:somelink    to=ABC:XYZ ></somelink>

             </svg>

   <!-- note: end of document -->


On Thu, Jul 22, 2021 at 8:06 PM Rick Jelliffe <rjelliffe@allette.com.au> wrote:
In case anyone is interested, I made a little grammar up to show the kind of thing that I was thinking of as a start point not an end poit, based on recent posts. Maybe having something concrete helps. 

So it is two parts: 
  • First, a grammar which not made with parallel parsing considerations particularly in mind.  The capitalized names in the grammar are the non-terminals determined by the lexical processing.  (The sub-rules for recognizing the types of undelimited data values are given in the grammar not the lexer, which I think is easiest to read if unfamiliar.)
  • Second, the lexical processing is specified as given as a series  of logical passes. Each pass is amenable to be divided and run in a parallel fashion or as a pipeline or some event system or folded into the grammar; of course a real implementation of them might coalesce them or rearrange with the same intent. 

This uses some extensions:
     == means "if" 
     -->  $something means a data type conversion
     -> means a substitution (handling references)
     .  means a look-up in the lexical context, just a shorthand.
 

GRAMMAR:

    document =   (element | comment | pi )+

    element =  start-tag ( CHARACTER+ | element | comment | pi)*  end-tag

    start-tag = name attribute* EOM   

    name = START-TAG.TOKEN 

    attribute =  attname ( typeable-token | ATTRIBUTE-TEXT)

    attname = TOKEN

    typeable-token = boolean |  year |  |  symbol

    boolean = TOKEN 

        ==  ("true" | "false" )  

        --> $boolean
    year = TOKEN
        ==  ( DECIMAL+ "-" CHARACTER*  ) 

        -->  $yearDate

    number = TOKEN
        == (""-")? DECIMAL+ ("."   CHARACTER+)?    

       --> $integer or $decimal

    symbol = TOKEN

    end-tag = END-TAG.TOKEN  EOM

    comment = COMMENT-TAG.CHARACTER*  EOM

    pi = piname  CHAR*  EOM

    piname = PI-TAG.TOKEN  E)M


Each lexical pass can be thread-parallelized by section.  And the pass execution can be a parallelized by e.g. queuing the results of one thread into another as needed.  And the recognition can be parallelized using SIMD.

LEXICAL PASS 1: TAG DEMARCATION

    TEXT = ws*  ("<"  MARKUP EOM==">"  DATA?  )+ 

    Note: A terminating "data" section should be marked as ws.

    Note: EOM is the only delimiter signal the lexer needs to provide up, but it is only actually needed for start-tags, and would not be part of an infoset.


LEXICAL PASS 2:  ATTRIBUTE DEMARCATION

    MARKUP =  ((?=[^!/?])  START-TAG  |  COMPLEX-TAG

    START-TAG =  (TAG-TEXT   \"  ATTRIBUTE-TAG  \"? ) +

   Note:  apos not supported as attribute delimiter here. 


LEXICAL PASS 3: REFERENCE SUBSTITUTION

   ( DATA | ATTRIBUTE-TEXT  | SIMPLE-TAG | COMPLEX-TAG )

              ->  (CHARACTER 

               | NUMERIC-CHARACTER-REFERENCE -> CHARACTER 

               | ENTITY-REFERENCE  -> CHARACTER+)*  

     Note: numeric character reference is hex numeric character reference to unicode number a la XML. No decimal reference. I didnt bother to put the  production in, but it looks for &. 

   Note:

  • I didn't bother to put the reference production: just & is start. Lazy.  
  • Hex NCR only?
  • Entity reference is to all ISO/SGML/W3C/MathML entities with W3C (MathML) mappings. Implementation can override, good for some publishers?
  •   In SGML terms, all entities are CDATA: No markup or references allowed in entity references, and must not expand to more characters than reference.
  •   There is one MathML character that needs bold tagging: if used, it must be explicitly put into bold by tags, the bold cannot transport.


 LEXICAL PASS 4: TOKENIZATION

     TAG-TEXT = ( ws | "=" | TOKEN )+

     COMPLEX-TAG = END-TAG | COMMENT-TAG | PI-TAG 
     COMMENT-TAG = "!--"  CHARACTER*  "--"

     PI-TAG = "?" TOKEN ws* CHARACTER* "?" 

     END-TAG = "/" TOKEN ws*

      




  • References:

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.