[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8")

  • From: David Carlisle <davidc@n...>
  • To: costello@m...
  • Date: Thu, 20 Sep 2007 14:08:11 +0100

Re:  [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8")


> These are all ASCII characters. Thus, an XML parser opens the
> document, interprets the bit strings as ASCII characters up to the
> first ">" 

No as was said earlier, the first few bytes of the file do not need to be
read as ascii. (And must not be for several popular encodings such as
utf-16 for example)

It's true that the characters  that appear in an encoding declaration
are characters that do have an ASCII encoding, but there is no
requirement that the byte sequence that represents the encoding
declaration uses the ASCII encoding.

  These are all ASCII characters. Thus, an XML parser opens the document,
  interprets the bit strings as ASCII characters up to the first ">"
  character. From then on, it interprets the rest of the document using
  the encoding it finds in the XML declaration. 

The entire document, including the encoding declaration, is read
using the same encoding.



> Algorithm for Detection of the Character Encoding when there is no
> Internal Encoding Label

That isn't the same as the algorithm given in XML.
There, if there is no external metadata or xml declaration the file has
to be in utf16 or utf8, and the BOM is optional for utf8, so if the file
has no BOM, then the parser does not "give up" The file is treated as if
utf8 is specified.

Recommendation 3
  HTTP Header: specifying the encoding in an HTTP header is
  unreliable. When exchanging XML or HTML documents using the HTTP
  protocol, don't specify the Content-Type in the HTTP header. This will
  force applications to look inside the document for encoding
  information. 

is explictly the opposite of the  the RFC that defines the XML mime
types, so while there are arguments on both sides I think its dangerous
to state it as such a clear recommendation. In eth case of text/* mime
types (at least) I believe that the default charset is latin-1 so
effectively you _can't_ omit the charset: even if you don't specify it
explictly the receiver is supposed to act as if iso8859-1 is specified
(which will mean that if you don't specify a charset in the mime headers
then any utf8 document that has a non ascii character in it will be
parsed as  iso8859-1 and generate a fatal encoding error....

David


________________________________________________________________________
The Numerical Algorithms Group Ltd is a company registered in England
and Wales with company number 1249803. The registered office is:
Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, United Kingdom.

This e-mail has been scanned for all viruses by Star. The service is
powered by MessageLabs. 
________________________________________________________________________


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.