[Home] [By Thread] [By Date] [Recent Entries]
Windows-1252 single byte character codes have curly quotes at 147 and 148 decimal <http://en.wikipedia.org/wiki/Windows-1252>. These are 10010011 and 10010100 binary. UTF-8 multibyte characters start with as many repeated 1s in their most significant bits as there are bytes in the sequence, then a zero, then data bits. For example, 147 would be 110/00010 10/010011 (slash splits control bits from data bits). UTF-8 single byte sequences always have 0 as most significant bit. So 10010011 cannot be a single byte UTF-8 character (msb is not zero) or the first byte of a multi-byte sequence (10 would indicate only one byte, which is not valid). > From: Costello, Roger L. [mailto:costello@m...] > QUESTIONS > > 1. Is the curly quote a valid UTF-8 character? Yes, it has the byte sequence hex C2 93 > 2. Word uses Windows-1252 encoding, correct? Pass > 3. The curly quote in Windows-1252 has a specific binary sequence, correct? Yes, hex 93 > 4. When I copy the curly quote from Word into Notepad, the operating > system does a straight 1-1 copy of the binary sequence, correct? I believe the encoding of data on the clipboard is indicated by a mechanism similar to mimetype and its up to source and target applications to set the data and interpret it correctly. > 5. When I copy the curly quote from Word into Notepad, there is no > conversion or translation of the binary sequence by the operating > system, correct? It's up to the application. > 6. Assuming the curly quote is a valid UTF-8 character, is the > Windows-1252 curly quote binary sequence the same as the UTF-8 curly > quote binary sequence? No. > 7. Is the Windows-1252 curly quote binary sequence illegal in UTF-8, > i.e. the Windows-1252 curly quote binary sequence doesn't correspond to > any UTF-8 character? Yes. > 8. Suppose I save the Word document as XML, and then I open the XML > using Notepad. The curly quotes no longer appear as curly quotes; > instead they appear as a bizarre character. Why does the curly quote > now look like a bizarre character in Notepad, whereas when I copied the > curly quote from Word to Notepad it looked fine in Notepad? Notepad doesn't understand UTF-8 encoded files. Pete
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |

Cart



