[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Copying text (curly quotes) from Word into an XML document

  • From: "G. Ken Holman" <gkholman@C...>
  • To: xml-dev@l...
  • Date: Sun, 02 Sep 2007 18:33:22 -0400

Re:  Copying text (curly quotes) from Word into an XML document
At 2007-09-02 22:46 +0100, Pete Kirkham wrote:
>Windows-1252 single byte character codes have curly quotes at 147 and
>148 decimal  <http://en.wikipedia.org/wiki/Windows-1252>. These are
>10010011 and 10010100 binary.
>
>UTF-8 multibyte characters start with as many repeated 1s in their
>most significant bits as there are bytes in the sequence, then a zero,
>then data bits. For example, 147 would be 110/00010 10/010011 (slash
>splits control bits from data bits). UTF-8 single byte sequences
>always have 0 as most significant bit. So 10010011 cannot be a single
>byte UTF-8 character (msb is not zero) or the first byte of a
>multi-byte sequence (10 would indicate only one byte, which is not
>valid).
>
> > From: Costello, Roger L. [mailto:costello@m...]
> > QUESTIONS
> >
> > 1. Is the curly quote a valid UTF-8 character?
>Yes, it has the byte sequence hex C2 93

1100 0010 1001 0011 in UTF-8 is Unicode U+0093 which is a control character:

  Unicode character data base:
  0093;<control>;Cc;0;BN;;;;;N;SET TRANSMIT STATE;;;;

The "right single quotation mark" is U+2019:

  2019;RIGHT SINGLE QUOTATION MARK;Pf;0;ON;;;;;N;SINGLE COMMA 
QUOTATION MARK;;;;

Which would translate into a number of UTF-8:

E2 80 99

1110 0010 1000 0000 1001 1001

BTW, you are asking here in the singular "curly quote" yet above you 
are asking "curly quotes" ... the entries in Unicode are:

  2018;LEFT SINGLE QUOTATION MARK;Pi;0;ON;;;;;N;SINGLE TURNED COMMA 
QUOTATION MARK;;;;
  2019;RIGHT SINGLE QUOTATION MARK;Pf;0;ON;;;;;N;SINGLE COMMA 
QUOTATION MARK;;;;
  201C;LEFT DOUBLE QUOTATION MARK;Pi;0;ON;;;;;N;DOUBLE TURNED COMMA 
QUOTATION MARK;;;;
  201D;RIGHT DOUBLE QUOTATION MARK;Pf;0;ON;;;;;N;DOUBLE COMMA 
QUOTATION MARK;;;;

> > 2. Word uses Windows-1252 encoding, correct?
>Pass

You get your choice ... when you save a text file you can specify 
"Other encoding" and select Unicode, UTF-8, UTF-7, or many others.

> > 3. The curly quote in Windows-1252 has a specific binary sequence, correct?
>Yes, hex 93

 From the Wikipedia citation above, I see the following (though I 
don't see formal character names, so I'm guessing these are the Unicode names):

hex 91 is left single quotation mark
hex 92 is right single quotation mark
hex 93 is left double quotation mark
hex 94 is right double quotation mark

> > 4. When I copy the curly quote from Word into Notepad, the operating
> > system does a straight 1-1 copy of the binary sequence, correct?
>I believe the encoding of data on the clipboard is indicated by a
>mechanism similar to mimetype and its up to source and target
>applications to set the data and interpret it correctly.

Pass.  It depends if it is working in the abstract or not w.r.t. characters.

> > 5. When I copy the curly quote from Word into Notepad, there is no
> > conversion or translation of the binary sequence by the operating
> > system, correct?
>It's up to the application.

Pass.  I thought the clipboard was Unicode based, so when you use the 
word "copy" if you are using the clipboard I would assume it would 
work.  I just copied curly quotes from Word to Notepad and when 
saving using UTF-8 I get the Unicode characters, and when saving to 
"ANSI" I get Windows 1252 characters.

So you can experiment likewise with the clipboard and get these reults.

> > 6. Assuming the curly quote is a valid UTF-8 character, is the
> > Windows-1252 curly quote binary sequence the same as the UTF-8 curly
> > quote binary sequence?
>No.

Agree.  As shown above, the binary sequence 9x is a control character 
in Unicode and a displayable character in Windows-1252.

> > 7. Is the Windows-1252 curly quote binary sequence illegal in UTF-8,
> > i.e. the Windows-1252 curly quote binary sequence doesn't correspond to
> > any UTF-8 character?
>Yes.

UTF-8 isn't designed for Windows-1252 ... I think you are conflating 
character sets with character encodings.

The test I just did appears to indicate the abstract character in 
Windows 1252 position 146 is Unicode RIGHT SINGLE QUOTATION MARK as 
that is what is saved as UTF-8 so it is translating it to the proper 
Unicode value.

> > 8. Suppose I save the Word document as XML, and then I open the XML
> > using Notepad. The curly quotes no longer appear as curly quotes;
> > instead they appear as a bizarre character.  Why does the curly quote
> > now look like a bizarre character in Notepad, whereas when I copied the
> > curly quote from Word to Notepad it looked fine in Notepad?
>Notepad doesn't understand UTF-8 encoded files.

False ... I just opened Notepad and wrote out a file using UTF-8 and 
opened it up again and it was preserved.  An XML processor read the 
file and didn't complain about the encoding.  I'm running XP.

I don't know a lot about Windows applications understanding of code 
set 1252, but I think you need to be a bit more precise when talking 
about characters in the abstract and their character encoding in 
different encodings.  Some simple experimentation should answer your 
question with different applications, as I just did above with Word 
and Notepad.

I hope this helps.

. . . . . . . . . . . . . Ken

--
Upcoming public training: XSLT/XSL-FO Sep 10, UBL/code lists Oct 1
World-wide corporate, govt. & user group XML, XSL and UBL training
RSS feeds:     publicly-available developer resources and training
G. Ken Holman                 mailto:gkholman@C...
Crane Softwrights Ltd.          http://www.CraneSoftwrights.com/x/
Box 266, Kars, Ontario CANADA K0A-2E0    +1(613)489-0999 (F:-0995)
Male Cancer Awareness Jul'07  http://www.CraneSoftwrights.com/x/bc
Legal business disclaimers:  http://www.CraneSoftwrights.com/legal



[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2007 All Rights Reserved.