[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Random Access XML
On Sat, 19 Feb 2011 15:36:46 -0500, John Cowan <cowan@mercury.ccil.org> wrote: > rjelliffe scripsit: > >> 1) For a start, we need to be able to know whether "<" "</" and ">" >> are >> tag delimiters without knowing context. So we must ban direct use of >> "<" >> and ">" in attributes and also get rid of CDATA sections. We should >> get >> rid of comments and PIs too, for the same reasons. (Actually, we >> only >> need to ban comments and PIs from after the first start tag. For >> other >> reasons, we might like to treat the first start-tag and before it >> specially.) > > Of course, random < is already banned everywhere, so if you ban > in > character content as well as attribute values, you get full > reversibility: > each of <, </, <?, <!--, >, />, and --> is guaranteed to be the open > or > close delimiter of a markup construct. Yes, if people are happy to keep comments and PIs after the prolog, I don't mind. (But I thought James' idea was to reduce the different number of nodes types in the parse tree, because multiple node types apparently freaks programmers out?) > MicroXML already bans > in character content so that it doesn't have > to > special-case ]]>, as required for full XML compatibility. The only > reason > it doesn't ban > in attribute values is that they are required for > compatibility with Canonical XML. Oh, is that a requirement? >> 3) The generic identifier would have to be more like an XPath. > > This could be achieved by convention, using a legal but rarely > employed delimiter like U+00B7 MIDDLE DOT, or any of the vast number > of > delimiters allowed by XML 1.0 Fifth Edition. Yes, lets make the 5th edition useful! :-) Using special characters ad hoc in names may be bad, but using them for systematic delimiters could be good. (I think using non-ascii characters for token separators wont get any traction, unless encodings are restricted to UTF-*. Or allow an builtin entity reference for the delimiter chosen.) For the sake of argument, say we use ⣠[triangle] eg <bookâ£sectionâ£personalName>, which is like a breadcrumbbar notation. A SAX processor for Random Access XML would plug after a normal SAX parser and replace element names like 'bookâ£sectionâ£personalName' or 'sectionâ£personalName' with 'personalName'. (I.e. report back just the element name--the last item. If sections only appear in books, then the start tags <bookâ£sectionâ£personalName> and <sectionâ£personalName> should not alter the infoset.) If we wanted to reduce name lengths, we could allow simple wildcards or ellipsis too: eg <bâ¦â£sâ¦â£personalName> Cheers Rick Jelliffe BTW, the idea of using paths in names to allow random access is not new or mine. IIRC the Dynatext readers indexed their SGML into a one element per line format, with a long path name at the beginning of each line. This allowed fast contextual searches using normal line-oriented text matching. I think Steve deRose had the patent on this, but I'd think it would be expired by now.
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|