[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: RE: There is a serious amount of character encodingconvers

  • From: John Cowan <johnwcowan@gmail.com>
  • To: Chris Maloney <voldrani@gmail.com>
  • Date: Fri, 28 Dec 2012 19:30:27 -0500

Re:  RE: There is a serious amount of character encodingconvers
Argh.  Let's try that again:

> I'd be very interested to hear if any of the XML / character
> encoding gurus on this list have any comments
> or links to updates to this article (which was written in 2004).
>  I am not sure if the issues the author describes have
> been remedied or not.

In 2004, UTF-8 was a noise encoding on the Web: see <http://googleblog.blogspot.com/2012/02/unicode-over-60-percent-of-web.html>.  As of the beginning of 2012, it was more than 60% of the documents visible to Google.  If you count pure ASCII documents as UTF-8, which you can do, it's up at 80%.

If the trend line continues, which is of course not something you can count on, I'd expect to see UTF-8 rise by another 5% or so, though perhaps pure ASCII will drop by about half the same amount leaving the total situation nearly unchanged.  In short: More than 80% of the Web is now UTF-8 one way or another, and less than 10% is Latin-1 and related encodings, leaving just about 10% for all the rest. (UTF-16 is less than 0.1%, according to Mark Davis.)  Not exactly a ringing endorsement  for "publish in any encoding you want" (per the article), is it.


On Fri, Dec 28, 2012 at 2:45 PM, Chris Maloney <voldrani@gmail.com> wrote:
Roger,

Here is a classic post from XML.com that is right in line with the
topic of character encodings that you have been posting about
recently, titled "XML on the web has failed":
http://www.xml.com/pub/a/2004/07/21/dive.html

It takes some work to really grok the problems the author is
describing, but it is well worth it, I think, and may make your head
spin (or hurt, depending).

I'd be very interested to hear if any of the XML / character encoding
gurus on this list have any comments or links to updates to this
article (which was written in 2004).  I am not sure if the issues the
author describes have been remedied or not.

Chris


On Fri, Dec 28, 2012 at 12:17 PM, David Lee <dlee@calldei.com> wrote:
> ---------
>
> You are writing about character encoding conversions as text moves from
> point to point to point.
>
>
>
> Is there a parallel with markup? Are there markup conversions as XML moves
> from point to point to point?
>
>
>
> Are there lessons learned in the character encoding community that could be
> applied to the XML community?
>
>
>
> --------
>
>
>
>
>
> Markup is text and has the same problems (and solutions).
>
> If we could start over from scratch with what we know now there would be
> less problems.
>
>
>
>
>
> IMHO, my preferred solution is to stick to a single encoding everywhere (I
> vote for UTF8 ... as it handles all Unicode codepoints).
>
> The next step is to make sure *every single link in the chain* uses that
> encoding.
>
> This is amazingly difficult even in "modern" languages like Java where the
> default behavior of converting code points to strings is to use
>
> the *system default encoding* which is always an unknown.   Even in pure
> java you have to track every single point that a byte array is converted to
> a String and visa versa,
>
> and explicitly set the encoding.   (or guarantee the system encoding is
> correct).
>
> Then you have to manage all places the data enters and leaves the program
> and make sure it's in the right encoding.
>
> Then  you have to make sure all places that *store* the data (like a
> database) don't muck with it.
>
> XML Itself cannot solve this problem alone as an XML document is  only the
> payload ...  However the XML Tools tend to be a bit more mature about
> dealing with this.
>
> But not always.
>
>
>
> Maybe in another 30  years more we will have migrated all our tools to be
> consistant about encodings.
>
>
>
>
>
> ----------------------------------------
>
> David A. Lee
>
> dlee@calldei.com
>
> http://www.xmlsh.org
>
>
>
>
>
>

_______________________________________________________________________

XML-DEV is a publicly archived, unmoderated list hosted by OASIS
to support XML implementation and development. To minimize
spam in the archives, you must subscribe before posting.

[Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
subscribe: xml-dev-subscribe@lists.xml.org
List archive: http://lists.xml.org/archives/xml-dev/
List Guidelines: http://www.oasis-open.org/maillists/guidelines.php




--
GMail doesn't have rotating .sigs, but you can see mine at http://www.ccil.org/~cowan/signatures


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.