[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Fwd: RE: Fwd: RFC 3548 on The Base16, Base32, and Base64 Data Encodings

  • To: xml-dev@l...
  • Subject: Fwd: RE: Fwd: RFC 3548 on The Base16, Base32, and Base64 Data Encodings
  • From: "Perry A. Caro" <caro@a...>
  • Date: Thu, 17 Jul 2003 11:11:30 -0700

binary to utf
Alessandro wrote:
> I guess you mean choosing 32768 + 1 Unicode characters, assigning a
> numeric (digit) value to each of them (usually different from the code
> point value of the character), expressing a binary block as a sequence
> of such "digits", and encoding the resulting character string in UTF-16.
> If the Unicode characters chosen are all below code point 65536, using
> UTF-16 will yield 2 octets per base-32768 "digit", with one bit lost out
> of 16.  Right?

Right. I experimented with such an encoding for binary in XML earlier this
year, and found it to be both feasible and very efficient, with the very
important caveat that this encoding only makes sense if you are committed to
using UTF-16 encoding only for your XML.  Binary encoded into Base32k text
has the potential to achieve an expansion of only 16/15, or 6%, compared to
the 33% for base64 in the equivalent UTF-8.

A "dumb" transcoding of UTF-16 with Base32k encoded binary to UTF-8 will
result in a cumumlative 59% expansion over the original binary data size,
thus the requirement to stick with UTF-16.

Because of Unicode normalization requirements, it is important to pick an
alphabet of codepoints that are unaffected by normalization, composition, or
decomposition, and that are legal XML of course. I used the following
ranges:

U+3400 thru U+4DB5	for 15-bit values of 0 thru 6581
U+4E00 thru U+9FA5	for 15-bit values of 6582 thru 27483
U+E000 thru U+F4A5	for 15-bit values of 27484 thru 32767

[The above ranges may be off-by-one, I'm typing this off of old notes.]

I was a little worried about using the private use area, since there are no
guarantees about how an XML processor will report them, but there is no
other contiguous range of Unicode codepoints of that size that avoid
normalization issues.

Perry A. Caro
Adobe Systems Incorporated

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.