[Home] [By Thread] [By Date] [Recent Entries]

  • From: "Costello, Roger L." <costello@m...>
  • To: <xml-dev@l...>
  • Date: Wed, 5 Sep 2007 11:10:39 -0400

Hi Folks,
 
I am compiling a list of well-formedness problems that may arise from
copying text from one document and pasting it into an XML document. 
 
For example, consider this XML document:
 
<?xml version="1.0" encoding="UTF-8"?>
<Document>
      <Para id="...">...</Para>
</Document>
 
Suppose that text is copied from a document and pasted into the XML
document, either as the content of the <Para> element or as the value
of the id attribute.
 
Here is my current list of problems:
 
1. The text may contain these reserved characters: {<, >, ', ", &}.
These characters may introduce syntax errors into the XML document and
may need to be escaped.
 
2. The editor that was used to create the text may use a different
encoding than the XML document's encoding. A binary string that
represents a character in one encoding may represent a different
character in another encoding.  Consequently, if the text was created
in an editor that uses a different encoding than the XML document then
the characters that result from pasting the text into the XML document
may not be the same.  Example: Word uses Windows-1252 encoding. The hex
value for the left curly (a.k.a. smart) quote is x93. In UTF-8 encoding
the hex value for the left curly quote is x201C. In UTF-8 the hex value
x93 corresponds to a control character.  Copying a left curly quote
from a Word document and pasting it into a UTF-8 XML document may
result in the XML document receiving a control character rather than a
left curly quote. 
 
Can you think of other problems that may result from copying text from
one document and pasting it into an XML document?
 
/Roger


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member