[Home] [By Thread] [By Date] [Recent Entries]

  • To: "James Clark" <jjc@j...>,"Julian Reschke" <julian.reschke@g...>,<xml-dev@l...>
  • Subject: RE: MSXML DOM Special Chars Less Than 32
  • From: "Michael Rys" <mrys@m...>
  • Date: Tue, 26 Mar 2002 19:50:14 -0800
  • Thread-index: AcHVN/uV40/oD/2oTa+v4oD+ugC3vgACkVfw
  • Thread-topic: MSXML DOM Special Chars Less Than 32

This is basically a question of the encoding. If you use UTF-16 then
it's the parser's job to take &#xD800;&#xDC00; and map it into the
encoding of the target environment. If you use UCS-4 as the encoding,
then you probably did not generate &#xD800;&#xDC00; in the first place
but &#x10000;...

Best regards
Michael

> -----Original Message-----
> From: James Clark [mailto:jjc@j...]
> Sent: Tuesday, March 26, 2002 18:37 PM
> To: Michael Rys; Julian Reschke; xml-dev@l...
> Subject: RE:  MSXML DOM Special Chars Less Than 32
> 
> What would you do about surrogates?  In Java (and I think C#) the
string
> datatype allows an arbitrary sequence of 16-bit values.  In
particular, it
> doesn't constrain high-low surrogates to occur as part of valid
surrogate
> pairs. How would you serialize a C# string that contains the sequence
> 0xD800,0xD800?  If you serialize it as &#xD800;&#xD800;, then what
happens
> if somebody writes &#xD800;&#xDC00;? Is that equivalent to &#x10000;?
> 
> James
> 
> --On 26 March 2002 18:13 -0800 Michael Rys <mrys@m...> wrote:
> 
> >
> >
> > To give you a non-MS area where occasional non XML characters may
appear
> > inside strings: Look at the current ANSI/ISO proposals for
serializing
> > relational data into XML. None of the database companies (Oracle,
IBM,
> > Sybase, us etc) want to encode strings as base64.
> >
> > To answer your question below: Assuming that we could at least allow
to
> > use a char entity for an invalid XML char. That would already help.
> >
> > Best regards
> > Michael
> >
> > PS: Please cc me directly. Otherwise I will not see the answer until
> > several weeks later...
> >
> >> OK, assuming the data type *can* be changed: what encoding would
you
> >> suggest for encoding arbitrary Unicode data (where control
characters
> > may
> >> appear, but only occasionally)?
> >>
> >> Surely not base64 (it's for byte streams, adds a lot of overhead
and
> > makes
> >> your XML unreadable to humans).
> >>
> >> BTW: another side of this problem is DOM's current approach.
> > createText()
> >> doesn't have to throw an exception when the string contains
forbidden
> >> characters. There is no standard method to test for XML character
code
> >> compliance (note that there's also an issue regarding Java
characters
> > not
> >> being valid Unicode characters in all cases). DOM level 2 doesn't
> > describe
> >> serialization, so current serializers in the best case throw an
> > exception
> >> (which is pretty late...) or ignore the issue at all (producing
broken
> >> XML).
> >>
> >>
> >>
> >>
> >>
> >> -----------------------------------------------------------------
> >> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> >> initiative of OASIS <http://www.oasis-open.org>
> >>
> >> The list archives are at http://lists.xml.org/archives/xml-dev/
> >>
> >> To subscribe or unsubscribe from this list use the subscription
> >> manager: <http://lists.xml.org/ob/adm.pl>
> >
> >
> > -----------------------------------------------------------------
> > The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> > initiative of OASIS <http://www.oasis-open.org>
> >
> > The list archives are at http://lists.xml.org/archives/xml-dev/
> >
> > To subscribe or unsubscribe from this list use the subscription
> > manager: <http://lists.xml.org/ob/adm.pl>
> >
> >
> >
> 


Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member