[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: An XML document is not well-formed if encoding="..."does n
When a BOM occurs inside a file or stream of text, it's supposed to be treated as a zero-width non-breaking space; i.e., a "no-op" character. But, you're right in that this is another compatibility gotcha. The most recent time I was bitten by BOMs was when trying to use a Javascript minifier that first concatenated a bunch of JS files together, and it was not happy about the BOMs that ended up in the middle of that stream of code. At the time, I think I found a place in the ECMAScript standard that suggested the BOM was legal and should be considered whitespace, but I just looked again, and can't find it. Still, it seems to me that in most cases where you have multi-lingual text-based documents, like Markdown, to take one example, that the benefits of using a BOM are significant. On Sat, Dec 29, 2012 at 9:53 PM, David Lee <dlee@calldei.com> wrote: > I'm curious ... > Considering that UTF16 is a dangerous file format, (I agree it is ... ) > For people who use languages which have predominantly non-latin codepoints ... > Is UTF8 actually worse than UTF32 - file size wise ? > And does it matter much ? > > When Java was introduced with 16 bit chars I remember the huge debate about how wasteful that was ... but now rarely hear it, > (except that handling > 16 bit codepoint chars is still difficult). > > What about UTF8 vs UTF32 ? > > There definitely is an advantage to a fixed byte-per-char format ... But if someone had the Iron Fist to Declare "Thou Shalt Use ..." > Would UTF8 be that bad ? Consider that very often when filesize is an issue compression is used ... so the "raw" file size is not nearly as important as it used to be. > > As for BOM's ... I personally am not fond of them. On first glance they seem great ... like the "File Types" of Yore ... (which thank goodness Unix god rid of ...) > > But the problem with BOM's IMHO, like file types, ... is that they assume that you are dealing with files, and/or that all sequences of bytes have a known start ... aka "The Beginning", where you would put a BOM. I suggest that is a historical oddity, and/or too small a subset of real use that it is impractical to count on. What about say blob records in a database ? Streams of data with no beginning or end ? > I dont think any convention that requires you to have read "the Beginning" will consistently work with text ... > XML suffers with this assumption as well with the XML declaration declaring the encoding. > That only works when you have an entire document to look at. Until we can come up with a universal encoding format we have to suffer with out-of-band information to inform a decoder. > > > -David > > > > -----Original Message----- > From: Chris Maloney [mailto:voldrani@gmail.com] > Sent: Saturday, December 29, 2012 9:27 PM > To: Costello, Roger L. > Cc: xml-dev@lists.xml.org > Subject: Re: An XML document is not well-formed if encoding="..." does not match the actual encoding of the characters in the document, right? > > Roger wrote: > >> I would advocate using UTF-8 exclusively > > That's what I do with my own files, and what I advocate whenever I have any input to design decisions, but as Liam and others have said, it's not practical to expect everyone to adopt this convention. > > What I really want to know is, when can we start freely using BOMs in UTF-8? I really like this idea, because it is a simple, easy way for a text file to "declare" that it is in UTF-8, and eliminate the ambiguity when the text files are passed around. Unfortunately, a lot of software, especially on Linux, still chokes on these. > > On a slightly different topic (UTF-16), this discussion reminded of something else I read a while back, a technical note the Unicode Consortium advocating for the use of UTF-16 for internal processing (as opposed to file interchange): > http://unicode.org/notes/tn12/tn12-1.html. On the other hand, I just found from a Google search this recent thread on StackExchange, where several people argue that UTF-16 should be considered harmful: > http://programmers.stackexchange.com/questions/102205/should-utf-16-be-considered-harmful. > I guess the debate will rage on, but interoperability, on the whole, does seem to be getting better. > > Chris > > > > > On Sat, Dec 29, 2012 at 2:36 PM, Costello, Roger L. <costello@mitre.org> wrote: >> Hi Folks, >> >> I spoke with George Cristian Bina from oXygen XML and he gave me the scoop on how things work inside oXygen. >> >> George told me to do this: >> >> 1. Create an iso-8859-1 encoded XML file. >> >> 2. Using a hex editor, change encoding="iso-8859-1" to encoding="utf-8" >> >> 3. Drag and drop the file into oXygen. >> >> 4. oXygen will generate an encoding exception: >> >> Cannot open the specified file. Got a character >> encoding exception [snip] >> >> Next, here is something George told me. It is mind-blowing: >> >> If you have an iso-8859-1 encoded XML file loaded into oXygen >> and change encoding="iso-8859-1" to encoding="utf-8" then >> oXygen will automatically change the encoding of every character >> in the document to UTF-8. >> >> Wow! >> >> That is so fantastic, I jumped out of my chair when I read it. >> >> I just received this additional information from George: >> >> Please note that the encoding is important only when the file is loaded >> and saved. When the file is loaded the bytes are converted to characters >> and then the application works only with characters. When the file is >> saved then those characters need to be converted to bytes and the >> encoding used will be determined from the XML header with a default to >> UTF-8 if no encoding can be detected. >> >> /Roger >> >> ______________________________________________________________________ >> _ >> >> XML-DEV is a publicly archived, unmoderated list hosted by OASIS to >> support XML implementation and development. To minimize spam in the >> archives, you must subscribe before posting. >> >> [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/ >> Or unsubscribe: xml-dev-unsubscribe@lists.xml.org >> subscribe: xml-dev-subscribe@lists.xml.org List archive: >> http://lists.xml.org/archives/xml-dev/ >> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php >> > > _______________________________________________________________________ > > XML-DEV is a publicly archived, unmoderated list hosted by OASIS to support XML implementation and development. To minimize spam in the archives, you must subscribe before posting. > > [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/ > Or unsubscribe: xml-dev-unsubscribe@lists.xml.org > subscribe: xml-dev-subscribe@lists.xml.org List archive: http://lists.xml.org/archives/xml-dev/ > List Guidelines: http://www.oasis-open.org/maillists/guidelines.php > >
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|