RE: nbsp is not that hard, folks
Hi there. So, what you are saying is that is to XML and HTML has "#define nbsp" is to C?? -----Original Message----- From: owner-xsl-list@xxxxxxxxxxxxxxxxxxxxxx [mailto:owner-xsl-list@xxxxxxxxxxxxxxxxxxxxxx] On Behalf Of Mike Brown Sent: Friday, November 08, 2002 7:13 AM To: xsl-list@xxxxxxxxxxxxxxxxxxxxxx Subject: nbsp is not that hard, folks Brian Grainger wrote: > If you're trying to escape in a document encoded as UTF-8, you > have to use Unicode escaping of the UTF-8 representation of the > entity. In this case, is equal to  , and   encoded as > UTF-8 is \u00A0. Good grief. No, you have your terminology badly mixed up, and you're throwing in an irrelevant notation. " " " " and "\u00A0" have nothing, NOTHING to do with UTF-8. There is something about nbsp that just confuses the heck out of people. I think it must be the fact that it looks like a space, and that you don't have an nbsp key on your keyboard. OK, read this. 1. There is a character -- an abstract unit in a "script" (a writing system; we are using Latin right now) -- called NO-BREAK SPACE by the Unicode Standard and ISO/IEC 10646. Unicode and ISO/IEC 10646 assign this character an integer number, 160, which is A0 in hex. We say Unicode all the time around here, but we mean ISO/IEC 10646 because that's what the XML and HTML specs reference. The two standards share the same character repertoire and numbering so there's no harm. 2. UTF-8 is an encoding scheme that provides a way of representing any of the approximately 1.1 million possible abstract characters in Unicode as a sequence of 1 to 4 bytes. The UTF-8 representation of the Unicode character 160 (no-break space), is the pair of bytes C2 A0, in that order. In contrast, iso-8859-1 is a character map that provides a way of representing the first 256 Unicode characters as a single byte. us-ascii is an even more limited set of just the first 128, mapped to a single byte. 3. This thing: \u00A0 - is a sequence of 6 bytes (ASCII bytes for slash, u, zero, zero, A, zero); - has special meaning in a programming language like Java or Python, where it is essentially a macro for the no-break space character; - is used when representing the character directly as encoded bytes is impractical or impossible. 4. This thing:   or this thing:   - is to SGML applications like HTML and XML what \u00A0 is to Java & Python; - is called a character reference (or "numeric character reference"). 5. This thing: - is to SGML applications like HTML and XML an "entity reference"; - refers to an entity (a separate collection of information) named nbsp; - depending on the circumstances, is intended to be treated by the XML parser or HTML user agent as equivalent to the entity's "replacement text"; - is, in HTML, predefined to have the replacement text of just one character, the no-break space; - is not defined by default in XML. 6. The thing here in between the quotes: "?" - is byte 0xA0; - is intended to be a no-break space because this email is iso-8859-1 encoded; - has exactly the same meaning in an XML document as  . - Mike ________________________________________________________________________ ____ mike j. brown | xml/xslt: http://skew.org/xml/ denver/boulder, colorado, usa | resume: http://skew.org/~mike/resume/ XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list XSL-List info and archive: http://www.mulberrytech.com/xsl/xsl-list
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format