[Home] [By Thread] [By Date] [Recent Entries]
At 2007-09-02 22:46 +0100, Pete Kirkham wrote: >Windows-1252 single byte character codes have curly quotes at 147 and >148 decimal <http://en.wikipedia.org/wiki/Windows-1252>. These are >10010011 and 10010100 binary. > >UTF-8 multibyte characters start with as many repeated 1s in their >most significant bits as there are bytes in the sequence, then a zero, >then data bits. For example, 147 would be 110/00010 10/010011 (slash >splits control bits from data bits). UTF-8 single byte sequences >always have 0 as most significant bit. So 10010011 cannot be a single >byte UTF-8 character (msb is not zero) or the first byte of a >multi-byte sequence (10 would indicate only one byte, which is not >valid). > > > From: Costello, Roger L. [mailto:costello@m...] > > QUESTIONS > > > > 1. Is the curly quote a valid UTF-8 character? >Yes, it has the byte sequence hex C2 93 1100 0010 1001 0011 in UTF-8 is Unicode U+0093 which is a control character: Unicode character data base: 0093;<control>;Cc;0;BN;;;;;N;SET TRANSMIT STATE;;;; The "right single quotation mark" is U+2019: 2019;RIGHT SINGLE QUOTATION MARK;Pf;0;ON;;;;;N;SINGLE COMMA QUOTATION MARK;;;; Which would translate into a number of UTF-8: E2 80 99 1110 0010 1000 0000 1001 1001 BTW, you are asking here in the singular "curly quote" yet above you are asking "curly quotes" ... the entries in Unicode are: 2018;LEFT SINGLE QUOTATION MARK;Pi;0;ON;;;;;N;SINGLE TURNED COMMA QUOTATION MARK;;;; 2019;RIGHT SINGLE QUOTATION MARK;Pf;0;ON;;;;;N;SINGLE COMMA QUOTATION MARK;;;; 201C;LEFT DOUBLE QUOTATION MARK;Pi;0;ON;;;;;N;DOUBLE TURNED COMMA QUOTATION MARK;;;; 201D;RIGHT DOUBLE QUOTATION MARK;Pf;0;ON;;;;;N;DOUBLE COMMA QUOTATION MARK;;;; > > 2. Word uses Windows-1252 encoding, correct? >Pass You get your choice ... when you save a text file you can specify "Other encoding" and select Unicode, UTF-8, UTF-7, or many others. > > 3. The curly quote in Windows-1252 has a specific binary sequence, correct? >Yes, hex 93 From the Wikipedia citation above, I see the following (though I don't see formal character names, so I'm guessing these are the Unicode names): hex 91 is left single quotation mark hex 92 is right single quotation mark hex 93 is left double quotation mark hex 94 is right double quotation mark > > 4. When I copy the curly quote from Word into Notepad, the operating > > system does a straight 1-1 copy of the binary sequence, correct? >I believe the encoding of data on the clipboard is indicated by a >mechanism similar to mimetype and its up to source and target >applications to set the data and interpret it correctly. Pass. It depends if it is working in the abstract or not w.r.t. characters. > > 5. When I copy the curly quote from Word into Notepad, there is no > > conversion or translation of the binary sequence by the operating > > system, correct? >It's up to the application. Pass. I thought the clipboard was Unicode based, so when you use the word "copy" if you are using the clipboard I would assume it would work. I just copied curly quotes from Word to Notepad and when saving using UTF-8 I get the Unicode characters, and when saving to "ANSI" I get Windows 1252 characters. So you can experiment likewise with the clipboard and get these reults. > > 6. Assuming the curly quote is a valid UTF-8 character, is the > > Windows-1252 curly quote binary sequence the same as the UTF-8 curly > > quote binary sequence? >No. Agree. As shown above, the binary sequence 9x is a control character in Unicode and a displayable character in Windows-1252. > > 7. Is the Windows-1252 curly quote binary sequence illegal in UTF-8, > > i.e. the Windows-1252 curly quote binary sequence doesn't correspond to > > any UTF-8 character? >Yes. UTF-8 isn't designed for Windows-1252 ... I think you are conflating character sets with character encodings. The test I just did appears to indicate the abstract character in Windows 1252 position 146 is Unicode RIGHT SINGLE QUOTATION MARK as that is what is saved as UTF-8 so it is translating it to the proper Unicode value. > > 8. Suppose I save the Word document as XML, and then I open the XML > > using Notepad. The curly quotes no longer appear as curly quotes; > > instead they appear as a bizarre character. Why does the curly quote > > now look like a bizarre character in Notepad, whereas when I copied the > > curly quote from Word to Notepad it looked fine in Notepad? >Notepad doesn't understand UTF-8 encoded files. False ... I just opened Notepad and wrote out a file using UTF-8 and opened it up again and it was preserved. An XML processor read the file and didn't complain about the encoding. I'm running XP. I don't know a lot about Windows applications understanding of code set 1252, but I think you need to be a bit more precise when talking about characters in the abstract and their character encoding in different encodings. Some simple experimentation should answer your question with different applications, as I just did above with Word and Notepad. I hope this helps. . . . . . . . . . . . . . Ken -- Upcoming public training: XSLT/XSL-FO Sep 10, UBL/code lists Oct 1 World-wide corporate, govt. & user group XML, XSL and UBL training RSS feeds: publicly-available developer resources and training G. Ken Holman mailto:gkholman@C... Crane Softwrights Ltd. http://www.CraneSoftwrights.com/x/ Box 266, Kars, Ontario CANADA K0A-2E0 +1(613)489-0999 (F:-0995) Male Cancer Awareness Jul'07 http://www.CraneSoftwrights.com/x/bc Legal business disclaimers: http://www.CraneSoftwrights.com/legal
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |

Cart



