[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

UTF-8, BOM [Was: nextml]

  • From: Tony Graham <Tony.Graham@MenteithConsulting.com>
  • To: xml-dev@lists.xml.org
  • Date: Fri, 10 Dec 2010 08:37:09 +0000

On Thu, Dec 09 2010 05:56:24 +0000, liam@w3.org wrote:
> On Thu, 2010-12-09 at 00:37 -0500, Michael Sokolov wrote:
>> One more mini-addition: would it be possible to have parsers ignore the 
>> BOM at the start of a UTF-8 file?  Some editors seem to insist on 
>> creating them, they are allowed by the UTF-8 spec, and probably ought to 
>> be considered external to the actual file content.  Also, maybe if we're 

The definition of the BOM/ZWNBS, the role of the BOM with UTF-8, and the
prominence of UTF-8 in the Unicode Standard has changed over time with
successive versions of the Unicode Standard [2].  The discussion of
detecting character encoding has also changed over time in successive
editions of XML 1.0.

You could review UTF-8 and BOM on the basis that much has changed since
the first XML 1.0 spec.

>> going to allow multiple root elements we could also allow whitespace in 
>> the prolog?   People often put it there, and it seems like something 
>> that could be tolerated easily enough.
> I have always felt it was a bug in the XML spec that the XML declaration
> becomes a regular processing instruction if there's a blank line in
> front of it.

It makes it usable as a file signature for the OS.  (If "<?xml" seems a
bit much, try EPUB, where you have to read the first 50+ bytes of a Zip
archive file [1].)

>> On restriction to UTF-8 (16 if we insist, but really do folks store 
>> *files* as UTF-16?)
> Yes. Frequently.
>> : is this really a problem for non-western 
>> languages?
> If you manufacture memory and hard drives, then utf-8 is truly
> delightful in countries where most characters will be 3 or more
> bytes/octets in length in utf-8.

Liam's roundabout way of saying YMMV.

> It's also a common misconception that Unicode is a 16-bit character set;
> it defines more than 65536 characters, and "surrogate pairs" in
> languages like Java make utf16 as complex as utf8; processing characters

Easier, probably, since you don't have surrogate pairs in UTF-8.

> in either utf-8 or ucs-32 are the most common choices outside the Java
> world as far as I can tell.


Tony Graham                         Tony.Graham@MenteithConsulting.com
Director                                  W3C XSL FO SG Invited Expert
Menteith Consulting Ltd                               XML Guild member
XML, XSL and XSLT consulting, programming and training
Registered Office: 13 Kelly's Bay Beach, Skerries, Co. Dublin, Ireland
Registered in Ireland - No. 428599   http://www.menteithconsulting.com
  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --  --
xmlroff XSL Formatter                               http://xmlroff.org
xslide Emacs mode                  http://www.menteith.com/wiki/xslide
Unicode: A Primer                               urn:isbn:0-7645-4625-2

[1] Section 4 in http://www.idpf.org/ocf/ocf1.0/download/ocf10.htm
[3] http://inasmuch.as/2007/10/03/bom-in-utf-8-good-bad-or-ugly/

  • References:
    • nextml
      • From: Amelia A Lewis <amyzing@talsever.com>
    • Re: nextml
      • From: Michael Sokolov <sokolov@ifactory.com>
    • Re: nextml
      • From: Liam R E Quin <liam@w3.org>

[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
First Name
Last Name
Subscribe in XML format
RSS 2.0
Atom 0.3

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.

Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.