[Home] [By Thread] [By Date] [Recent Entries]
This is basically a question of the encoding. If you use UTF-16 then it's the parser's job to take �� and map it into the encoding of the target environment. If you use UCS-4 as the encoding, then you probably did not generate �� in the first place but 𐀀... Best regards Michael > -----Original Message----- > From: James Clark [mailto:jjc@j...] > Sent: Tuesday, March 26, 2002 18:37 PM > To: Michael Rys; Julian Reschke; xml-dev@l... > Subject: RE: MSXML DOM Special Chars Less Than 32 > > What would you do about surrogates? In Java (and I think C#) the string > datatype allows an arbitrary sequence of 16-bit values. In particular, it > doesn't constrain high-low surrogates to occur as part of valid surrogate > pairs. How would you serialize a C# string that contains the sequence > 0xD800,0xD800? If you serialize it as ��, then what happens > if somebody writes ��? Is that equivalent to 𐀀? > > James > > --On 26 March 2002 18:13 -0800 Michael Rys <mrys@m...> wrote: > > > > > > > To give you a non-MS area where occasional non XML characters may appear > > inside strings: Look at the current ANSI/ISO proposals for serializing > > relational data into XML. None of the database companies (Oracle, IBM, > > Sybase, us etc) want to encode strings as base64. > > > > To answer your question below: Assuming that we could at least allow to > > use a char entity for an invalid XML char. That would already help. > > > > Best regards > > Michael > > > > PS: Please cc me directly. Otherwise I will not see the answer until > > several weeks later... > > > >> OK, assuming the data type *can* be changed: what encoding would you > >> suggest for encoding arbitrary Unicode data (where control characters > > may > >> appear, but only occasionally)? > >> > >> Surely not base64 (it's for byte streams, adds a lot of overhead and > > makes > >> your XML unreadable to humans). > >> > >> BTW: another side of this problem is DOM's current approach. > > createText() > >> doesn't have to throw an exception when the string contains forbidden > >> characters. There is no standard method to test for XML character code > >> compliance (note that there's also an issue regarding Java characters > > not > >> being valid Unicode characters in all cases). DOM level 2 doesn't > > describe > >> serialization, so current serializers in the best case throw an > > exception > >> (which is pretty late...) or ignore the issue at all (producing broken > >> XML). > >> > >> > >> > >> > >> > >> ----------------------------------------------------------------- > >> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an > >> initiative of OASIS <http://www.oasis-open.org> > >> > >> The list archives are at http://lists.xml.org/archives/xml-dev/ > >> > >> To subscribe or unsubscribe from this list use the subscription > >> manager: <http://lists.xml.org/ob/adm.pl> > > > > > > ----------------------------------------------------------------- > > The xml-dev list is sponsored by XML.org <http://www.xml.org>, an > > initiative of OASIS <http://www.oasis-open.org> > > > > The list archives are at http://lists.xml.org/archives/xml-dev/ > > > > To subscribe or unsubscribe from this list use the subscription > > manager: <http://lists.xml.org/ob/adm.pl> > > > > > > >
|

Cart



