[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

RE: [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8")

  • From: "Rudick, Tom" <tmrudick@m...>
  • To: <xml-dev@l...>
  • Date: Thu, 20 Sep 2007 11:36:23 -0400

RE:  [Summary] Why is Encoding Metadata (e.g. encoding="UTF-8")
Hello All,
 
I have been following this thread regarding XML documents and their
character encodings.  I still don't quite understand how to tell what
the encoding of an XML document is when there is no external
information to go on.  
 
As discussed, you can either specify an encoding via HTTP headers
(externally), or in the XML document instead (internally).
 
If the HTTP headers do not indicate what the encoding of the document
is, you must read the document (at least the first line) and figure out
what the encoding is.  However, how is this accomplished?  If you don't
know the encoding of the document to begin with, how can you read even
the first line?
 
After reading this http://www.w3.org/TR/REC-xml/#sec-guessing, it seems
that instead of reading what <?xml encoding="utf-8"?> has to say,
parsers simply look at the first few octets of the document and compare
it to several known encodings of the text <?xml.  Then, they just
continue to read the rest of the document.  If parsers never actually
use the encoding attribute, is then any reason to have it other than
for human-readability?
 
Are there any encodings that have the same encoding of <?xml but
completely different encodings for other characters?

Does anyone have any further information on how exactly XML parsers
auto-detect character encodings within XML documents?
 
Thanks,
-Tom

-----Original Message-----
From: David Carlisle [mailto:davidc@n...] 
Sent: Thursday, September 20, 2007 10:03 AM
To: Costello, Roger L.
Cc: xml-dev@l...
Subject: Re:  [Summary] Why is Encoding Metadata (e.g.
encoding="UTF-8") put Inside the XML Document?



> 
> An XML Parser will make an initial "guess" of the encoding based upon
> the presence or absence of a Byte Order Mark (BOM). The XML parser
then
> interprets the bit strings using that guess up to the first ">"
> character (the end of the XML declaration).
> 

If the encoding isn't known in advance then (in theory)  you don't know
where the first > is (as you don't know  how > is encoded)


> Now that it knows the "real" encoding it interprets the rest of the
> document using the encoding it found in the XML declaration.

That still makes it sound as if the encoding declaration is read using
a
different encoding from the rest of the document. Once an encoding has
been determined then the encoding declaration line itself must be
consistent with that encoding. You can't use one byte per character
ascii
<?xml version="1.0" encoding="utf-16"?>
and then read the rest of the file using two (or four) bytes per
character.

Suppose I have an encoding "my-encoding" that's the same as as ascii
except that > and < are swapped round. then the following is a well
formed document

>?xml version="1.0" encoding="my-encoding"<
>foo<hello>/foo<


The parser knows it's been handed an xml file, can tell that it's not
going to parse as utf8 so there must be an xml declaration, so the
first
tfew bytes must encode "<?xml" it sees the bytes it sees and the only
encoding it knows about in which that sequence encodes  "<?xmlis the
"my-encoding" encoding so proceeds on that basis, which means it
successfullt finds  encoding="my-encoding" and knows all is well...

David

_______________________________________________________________________
_
The Numerical Algorithms Group Ltd is a company registered in England
and Wales with company number 1249803. The registered office is:
Wilkinson House, Jordan Hill Road, Oxford OX2 8DR, United Kingdom.

This e-mail has been scanned for all viruses by Star. The service is
powered by MessageLabs. 
_______________________________________________________________________
_

_______________________________________________________________________

XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.

[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@l...
subscribe: xml-dev-subscribe@l...
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.