[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: XSD - Validator problem

  • From: Tony Graham <Tony.Graham@M...>
  • To: xml-dev@l...
  • Date: Wed, 03 Oct 2007 10:52:53 +0100

Re:  XSD - Validator problem
On Wed, Oct 03 2007 10:07:55 +0100, mike@s... wrote:
> In UTF-8, at the start of a file it is 
>> just a nonsense character, useless, out-of-place, a sign of 
>> bad programming, and it messes up encoding detectors
>
> I'd have said it's a three-byte sequence which tells you pretty reliably
> that you're dealing with a UTF-8 encoded file - provided of course that you
> are looking for it. Agreed, BOM is a misnomer.
>
> I'm not defending the decision to add it to the XML spec by means of an
> erratum, however.

The usefulness or otherwise of U+FEFF in UTF-8 has been subject to
reinterpretation over the years.

In the Unicode Standard 2.0, there was no mention of U+FEFF with UTF-8,
either in the section on the BOM or in the appendix defining UTF-8.

In the Unicode Standard 3.0, section 13.6, "Specials", includes:

   Although there are never any questions of byte-order with UTF-8 text,
   this sequence can serve as signature for UTF-8 encoded text where the
   character set is unmarked.

In the Unicode Standard 5.0, section 3.10, "Unicode Encoding Schemes",
includes:

   While there is obviously no need for a byte order signature when
   using UTF-8, there are occasions when processes convert UTF-16 or
   UTF-32 data containing a byte order mark into UTF-8. When represented
   in UTF-8, the byte order mark turns into the byte sequence <EF BB
   BF>. Its usage at the beginning of a UTF-8 data stream is neither
   required nor recommended by the Unicode Standard, but its presence
   does not affect conformance to the UTF-8 encoding scheme.
   Identification of the <EF BB BF> byte sequence at the beginning of a
   data stream can, however, be taken as a near-certain indication that
   the data stream is using the UTF-8 encoding scheme.

So it's gone from irrelevant to useful to "Oh, if you must".

(BTW, in other reinterpretations, "Unicode Encoding Scheme" results from
splitting the meaning of "UTF", and the use of U+FEFF to indicate
non-breaking is deprecated these days.)

The Unicode FAQ both lists its use as a signature [1] and says to avoid
its use where "byte oriented protocols expect ASCII characters at the
beginning of a file" [2].  However, I don't think that XML necessarily
counts as one such byte oriented protocol.

Regards,


Tony Graham.
======================================================================
Tony.Graham@M...   http://www.menteithconsulting.com

Menteith Consulting Ltd             Registered in Ireland - No. 428599
Registered Office: 13 Kelly's Bay Beach, Skerries, Co. Dublin, Ireland
----------------------------------------------------------------------
Menteith Consulting -- Understanding how markup works
======================================================================

[1] http://www.unicode.org/faq/utf_bom.html#29
[2] http://www.unicode.org/faq/utf_bom.html#28


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.