[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: A few questions about building an XML parser

  • From: David Carlisle <d.p.carlisle@gmail.com>
  • To: Roger L Costello <costello@mitre.org>
  • Date: Thu, 10 Mar 2022 08:32:31 +0000

Re:  A few questions about building an XML parser


On Wed, 9 Mar 2022 at 13:02, Roger L Costello <costello@mitre.org> wrote:

Hi Folks,

For learning purposes (and for fun) I want to build an XML parser.

While an XML parser is not a compiler, I think that an XML parser performs the same steps as the front end of a compiler.

I am reading a compiler book [1] and it says this:

---------------------------------------------------

The front end can be divided into lexical analyzer, syntax analyzer, and semantic analyzer. The lexical analyzer, sometimes also called the scanner, carries out the simplest level of structural analysis. It will group the individual symbols of the source program text into their logical entities. Thus the sequence of characters ‘W’, ‘H’, ‘I’, ‘L’, and ‘E’ would be identified as the word ‘WHILE’ and the sequence of characters ‘1’, ‘.’, and ‘0’ would be identified as the floating-point number 1.0.

The syntax analyzer, often also called the parser, analyzes the overall structure of the whole program, grouping the simple entities identified by the scanner into the larger constructs, such as statements, loops, and routines, that make up the complete program.

Once the structure of the program has been determined we can then analyze its meaning (or semantics). We can determine which variables are to hold integers, and which to hold floating point numbers, we can check that the size of all arrays is defined and so on.

---------------------------------------------------

Okay, back to XML. Consider this non-well-formed XML:

<Publisher>Harper&amp;Row</Publsher>

(The end-tag is misspelled)

What stage should the entity &amp; be converted to &?

  1. Lexical analysis stage
  2. Syntax analysis stage
  3. Semantic analysis stage

What stage should detect that the <Publisher> start-tag does not have a matching end-tag?

  1. Lexical analysis stage
  2. Syntax analysis stage
  3. Semantic analysis stage

Not shown in the example, but what stage should convert <!CDATA[Hello, World]]> to Hello, World?

  1. Lexical analysis stage
  2. Syntax analysis stage
  3. Semantic analysis stage

Some background information: Flex is a lexer generator; that is, it is a tool for  generating lexical analyzers. The Flex manual shows an example [2] of a lexer that scans a string which is enclosed in quotes. For this input:

    "Hello\040World"

the lexical analyzer generates this token:

    Hello World

Notice that the octal entity ( \040 ) has been resolved to its character (the space character). That example leads me to conclude that a lexical analyzer is responsible for converting XML entities, e.g.,

    The lexical analyzer converts &amp; to &

However, the Flex manual showed that a lexer “could” resolve an octal entity, but the manual didn’t say that the lexer “should” resolve entities, so I don’t know it is appropriate for the lexer to convert XML entities. What are your thoughts on this?


the \040 example is more like (although still possibly misleading)

a &#38; b 

with a character (not entity) reference.

An entity reference is a named reference to a typically user defined entity

<!ENTITY wibble "1<b>2</b>3" >
....
a &wibble; b

is more like

wibble="1<b>2</b>3"
"a " + wibble + " b"

so not resolved by the parser or at least certainly not by the lexical analysis, amp happens to be a pre-defined entity but that doesn't really make much difference to the lexical analysis of the entity reference &amp;  -  it is structurally the same as a reference to a document-defined entity.

David

/Roger

[1] “Introduction to Compiling Techniques” by J.P. Bennett

[2] See page 24, https://epaperpress.com/lexandyacc/download/flex.pdf

 



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.