[Home] [By Thread] [By Date] [Recent Entries]
Hi Folks,
I am compiling a list of well-formedness problems that may arise from
copying text from one document and pasting it into an XML document.
For example, consider this XML document:
<?xml version="1.0" encoding="UTF-8"?>
<Document>
<Para id="...">...</Para>
</Document>
Suppose that text is copied from a document and pasted into the XML
document, either as the content of the <Para> element or as the value
of the id attribute.
Here is my current list of problems:
1. The text may contain these reserved characters: {<, >, ', ", &}.
These characters may introduce syntax errors into the XML document and
may need to be escaped.
2. The editor that was used to create the text may use a different
encoding than the XML document's encoding. A binary string that
represents a character in one encoding may represent a different
character in another encoding. Consequently, if the text was created
in an editor that uses a different encoding than the XML document then
the characters that result from pasting the text into the XML document
may not be the same. Example: Word uses Windows-1252 encoding. The hex
value for the left curly (a.k.a. smart) quote is x93. In UTF-8 encoding
the hex value for the left curly quote is x201C. In UTF-8 the hex value
x93 corresponds to a control character. Copying a left curly quote
from a Word document and pasting it into a UTF-8 XML document may
result in the XML document receiving a control character rather than a
left curly quote.
Can you think of other problems that may result from copying text from
one document and pasting it into an XML document?
/Roger
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |

Cart



