[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: encoding problem fixed

  • From: "David Brownell" <david-b@p...>
  • To: "John Cowan" <cowan@l...>, "XML Dev" <xml-dev@i...>
  • Date: Fri, 30 Jul 1999 09:43:57 -0700

re encoding
----- Original Message -----
From: John Cowan <cowan@l...>
To: XML Dev <xml-dev@i...>
Sent: Friday, July 30, 1999 7:59 AM
Subject: Re: encoding problem fixed


> James Tauber wrote:
>
> > In other words, rather than creating an InputSource using a FileReader,
I
> > used James Clark's "fileInputSource" method in XT to make a URL out of a
> > file and create the InputSource from the URL string.
>
> Yes, indeed.  You should never use a Reader of any sort when processing
                           ^^^^^ wrong !!!
> XML (unless you have a non-standard Reader class that understands the
> XML declaration).  Always use an InputSource so that the parser can
> install its own bytes-to-chars converter based on the declaration.

Actually, that's not correct either.  My general advice is to pass a
URI to the parser -- which is required to do the correct thing! -- and
in those rare cases that can't be done:

    * If the data is externally typed according to character set,
      you MUST use some Reader ... e.g. given a MIME type of
      "application/xml;charset=Big5", then use a reader set
      up to use the "Big5" encoding (a Chinese encoding).  There
      isn't much choice of classes; InputStreamReader, or a custom
      reader that understands that encoding.

    * If the data is NOT externally typed, then you MUST rely on
      the XML parser's autodetection ... pass an InputStream.

Remember, with external typing (e.g. MIME objects) the MIME type is
authoritative.  And XML/text declarations are optional; for the top
level document, the "encoding=..." is also optional.  Autodetection
will not work in all cases ... which is why the notion of "always
use an InputStream" is incorrect.

Those using Sun's parser will notice a "Resolver" class that has a
method accepting a MIME type, which is interpreted according to the
relevant RFC, and another method (also static) taking a "File" which
ignores the JVM's normal understanding of file encodings to do a
better thing in that case also. (It autodetects -- better than any
system default in that case!)


> > The culprit is FileReader. It is the one doing the strange "read UTF-8
as
> > Windows code page".
>
> Actually, it's doing what it's expected to: reading the native charset,
> CP-1252.  (Unix JVMs use 8859-1 instead.)

Those are actually system-specific defaults ... many localized versions
of those environments work differently.  For example UNIX JVMs may well
use the "EUC-JP" coding in Japan, or MS-Windows the "Shift_JIS".


>     It has no way of knowing that
> *you* think the document charset is UTF-8.

The InputStreamReader class can be told such stuff, and you can create
one from a FileInputStream.

Another fix, for JDK 1.2 conformant JVMs, is to construct the URI for
the relevant file and construct the InputSource like this:

    new InputSource (new File (path).toURL ().toString ())

In fact, my own basic guidance is never to pass any sort of I/O stream
(InputStream -or- Reader!) to the parser; let the parser work from the
URI, if at all possible.  It's normally quite possible, and it's a lot
less likely to handle the encodings wrong than application code!!

- Dave





xml-dev: A list for W3C XML Developers. To post, mailto:xml-dev@i...
Archived as: http://www.lists.ic.ac.uk/hypermail/xml-dev/ and on CD-ROM/ISBN 981-02-3594-1
To (un)subscribe, mailto:majordomo@i... the following message;
(un)subscribe xml-dev
To subscribe to the digests, mailto:majordomo@i... the following message;
subscribe xml-dev-digest
List coordinator, Henry Rzepa (mailto:rzepa@i...)



PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.