[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: There is a serious amount of character encoding conversion
On 12/28/2012 9:01 AM, Costello, Roger L. wrote: > How did it find a match? > > The underlying byte sequence for the iso-8859-1 López is: 4C F3 70 65 7A (one byte -- F3 -- is used to encode ó). > > The underlying byte sequence for the UTF-8 López is: 4C C3 B3 70 65 7A (two bytes -- C3 B3 -- are used to encode ó). > > The search application cannot be doing a byte-for-byte match, else it would find no match. > > The codepoint for the UTF-8 ó character is F3. > > Hey, iso-8859-1 uses F3 to encode ó. > > So perhaps the search application is converting the UTF-8 bytes to codepoints and then comparing those codepoints to the iso-8859-1 bytes. That would result in a match. > One point of comparison: Lucene used to use Java characters internally (which are much like UTF-16), and now uses UTF-8 internally (not codepoints). I think it's unlikely that your search application is using iso-8859-1 internally, although it might be using codepoints, as you suggest. Of course it's no accident that iso-8859-1=Unicode codepoint; that was one sensible thing done by the character encoding gurus. -Mike
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|