[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: nextml

  • From: Liam R E Quin <liam@w3.org>
  • To: Michael Sokolov <sokolov@ifactory.com>
  • Date: Thu, 09 Dec 2010 00:56:24 -0500

Re:  nextml
On Thu, 2010-12-09 at 00:37 -0500, Michael Sokolov wrote:
> I want to be able to record the 
> position of elements as byte offsets in an original source file and use 
> those to extract well-formed fragments as text 

You can only do this if you check no-one else touched the file since yu
last made your index.  The "XML Promise" (as I call it) is that any XML
tool is licensed to process any XML document.

> (think extracting 
> snippets and highlighting in search results).  This can't be done 
> reliably in a SAX or StaX handler if the parser alters text in a 
> non-reversible manner: you can make a guess if you know what the 
> original line endings were, but if they're mixed all bets are off.  
> Currently one has to use HTML parsers for this.

HTML parsers also normalise line endings though, no? Both HTML and XML
inherit some of this from SGML.

> One more mini-addition: would it be possible to have parsers ignore the 
> BOM at the start of a UTF-8 file?  Some editors seem to insist on 
> creating them, they are allowed by the UTF-8 spec, and probably ought to 
> be considered external to the actual file content.  Also, maybe if we're 
> going to allow multiple root elements we could also allow whitespace in 
> the prolog?   People often put it there, and it seems like something 
> that could be tolerated easily enough.

I have always felt it was a bug in the XML spec that the XML declaration
becomes a regular processing instruction if there's a blank line in
front of it.

> Yeah, I disagree about entities (and therefore DTDs).  Let me try to 
> explain why, briefly, and then I promise to stop whining about it. The 
> problem w/DTDs (and entity decls defined in them) as I see it is they 
> introduce a dependence on an external file.
They don't have to - you can put everything in one file.

>  If entities were defined by 
> the standard (and built in to parsers), or were required to be defined 
> inline, that would remove my objections.

You can't really define &productName; in the XML spec to be, say,
"Internet Explorer 12.1" :-), and the Unicode long character names are
all in English, which is obviously not OK.  The ISO SGML entities are
insane.  You're right that a goal of XInclude was to reduce the need for
entities; there are still places where they're used and XInclude can't
be, e.g.
    href="&server;&docroot;intro/chapter6.xml"


> On restriction to UTF-8 (16 if we insist, but really do folks store 
> *files* as UTF-16?)

Yes. Frequently.

> : is this really a problem for non-western 
> languages?

If you manufacture memory and hard drives, then utf-8 is truly
delightful in countries where most characters will be 3 or more
bytes/octets in length in utf-8.

It's also a common misconception that Unicode is a 16-bit character set;
it defines more than 65536 characters, and "surrogate pairs" in
languages like Java make utf16 as complex as utf8; processing characters
in either utf-8 or ucs-32 are the most common choices outside the Java
world as far as I can tell.

Liam

-- 
Liam Quin - XML Activity Lead, W3C, http://www.w3.org/People/Quin/
Pictures from old books: http://fromoldbooks.org/
Ankh: irc.sorcery.net irc.gnome.org www.advogato.org


  • Follow-Ups:
  • References:
    • nextml
      • From: Amelia A Lewis <amyzing@talsever.com>
    • Re: nextml
      • From: Michael Sokolov <sokolov@ifactory.com>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.