[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Minimal set of rules for making HTML well-formed?

  • From: David Carlisle <d.p.carlisle@gmail.com>
  • To: "Costello, Roger L." <costello@mitre.org>
  • Date: Fri, 11 Oct 2019 17:56:00 +0100

Re:  Minimal set of rules for making HTML well-formed?


On Fri, 11 Oct 2019 at 17:22, Costello, Roger L. <costello@mitre.org> wrote:

Hi Folks,

 

Last week there was a suggestion to use SgmlReader to convert HTML to XHTML. After some experimenting, I discovered that SgmlReader has some problems. See below for one such problem.


That looks like user error, with sgml (or xml) you would need to use a catalogue to supply a dtd that defines & nbsp; to be & #160;

 

I’ve decided to implement my own tool to convert HTML to XHTML. I want the tool to make the minimal amount of changes to the HTML -- just make the changes necessary to make the HTML well-formed. As I see it, there are only 4 things that need to be done to make the HTML well-formed:

 

  1. Ensure that attribute values are delimited with either double or single quotes.
  2. Ensure that every start tag has a matching end tag.
  3. Ensure that elements are properly nested.
  4. Ensure that the XML reserved characters ( <, >, &, ', ") are escaped when used in data.

 

Am I missing anything? If my tool does those 4 things am I guaranteed that the resulting document will be well-formed?  /Roger




This is a black hole that you might not want to approach  see any of the 10000s of online discussions surrounding


:-)


The main problem that killed the polyglot idea is that making an html document that parses as well formed xhtml isn't enough in practice (you also want it to mean/render the same) but the requirement used in that document that the DOM trees resulting from an html or xml parse are the same, is very hard to achieve.


1 and 2 are rather hard to achieve without using an html parser on the front end (and if you have one of those, just dumping a serialisation of its dom tree will give xml (er except when it doesn't:-)

the html parse _always_ gives a well defined result so just knowing what are the tags isn't easy with an arbitrary tool not using the html parser, and the tag names may not be valid xml names

I just typed in this at random



and it shows for example an attribute with name <kk that can't be expressed in xml.

You might be tempted to say that's not valid input but again defining (and automatically checking ) what is valid really needs an html parser.


3) again yes so long as you know what the elements are and all the special rules around <script> etc.

4) of the characters that you mention only < and & need to be escaped in character data, ' and " do not, but you also need to escape any use of ]]> in character data, and use of control characters which are not allowed in xml 1.0 (actually you can't just escape the control characters, you need to remove them or encode them in some other way)


David




 



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.