[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] A syntax for locators (WAS Re: more QName madness)
John Cowan wrote: > Joe English scripsit: > > > Sarcasm aside, I could devise one. So could you, and so could any of > > the individual members of the Linking WG. Not trivially easy, > > but a good deal simpler than the proposed framework and with > > most of the expressive power. > > No sarcasm intended. I would be most interested in such a proposal > or even a sketch. Here's a method I've been using on a few internal projects. (Caveat: following the YAGNI principle [1] I've only implemented the bits of this that I've actually needed to date, but implementing the full thing looks to be pretty straightforward.) Syntax: locator ::= /* empty */ | locator '/' step | locator '//' step ; step ::= selector | NCName '(' selector ')' ; selector::= Ordinal /* = [1-9][0-9]*, interpreted as an integer */ | '@' NCName '=' Literal ; NCName ::= /* ... the usual */ ; Literal ::= /* ... the usual */ ; Semantics: A _locator_ takes as input a single XML node and returns at most one XML node. A _step_ takes a list of nodes and returns at most one element node. The base case, an empty locator, returns the input node (typically the document root). "loc / step" evaluates _loc_, then applies _step_ to the list of child element nodes of the result. "loc // step" applies _step_ to the list of proper descendants (in document order). In the latter two cases, if _loc_ fails then the expression as a whole fails. "NCName(selector)" selects only those element nodes in the input list which have a matching local-name, then applies _selector_ to the filtered list. An ordinal number _n_ returns the _n_th node in the input list (starting from 1); it fails if the list has fewer than _n_ elements. "@name='value'" selects the first element in the input list with an attribute having a local-name of _name_ and a matching value, failing if there is no such element. That's about it. Notes: The syntax is simple enough that it can be parsed with regexps, and it can be implemented with a streaming processor (e.g., a SAX Filter) without lookahead or backtracking. The target element can be identified as soon as its start tag is seen. The notation covers a broad range of use cases. It can address any element in the tree using only ordinal selectors and the "/" operator (like the XPointer "element" scheme or HyTime treelocs). The "NCName(ordinal)" form allows for more human-readable and human-writable locators, e.g., "/document(1)/chapter(2)/section(1)". "//@name=value" can be used to locate elements by ID (XPointer "shorthand pointer" or HyTime "nameloc"). Since the (local-)name of the ID-bearing attribute is specified in the locator itself, the consumer doesn't need to know about schema-determined, DTD-determined, or externally-determined IDs. I haven't come up with a use case where the producer of a link (a) knows the ID of the desired element but (b) doesn't know the name of the ID-bearing attribute, so (with the exception of a few namespace-related pathologies) there is no loss of expressivity. The "@name=value" form can also be used for attributes that have ID- or key-like semantics but aren't defined as IDs in a schema or DTD. For example, in HTML two <input>s in different <form>s can have the same @name, so the name attribute has declared value CDATA. These are addressible with locators like: //form(3)//input(@name='credit_card_number') [ Hm... two <input>s in the _same_ form can also have the same name... this scheme won't work to locate those. ] Lastly, this allows you to write very compact (but of course very fragile!) locators: "//N" selects the Nth element in document order. Locators have a nice associative property: if _loc1_ and _loc2_ are locators and _node_ is the input node, then: locate (loc1, locate(loc2, node)) = locate (loc1 ++ loc2, node) where ++ is string concatenation. Locators only return element nodes, so they don't meet all the XPointer requirements [2]. They can be used as a prefix in a more general pointer scheme though, something like: pointer ::= locator | locator '/' '@' Name /* select an attribute */ | locator '/' '$' ...something... /* select text nodes */ | locator '/' '?' ...something... /* select PIs */ | ... other stuff ... ; (I've implemented the first one, but haven't needed the others yet so haven't given them much thought -- YAGNI again.) I can think of a good reason _not_ to support ranges though. A good way to implement bidirectional links is to annotate each node with a list of all the locators that point to it, so it's easy to traverse back and forth across arcs. Things get hairier if a locator can point to a range of nodes or to a character span. On the QName problem: The most radical (over-?)simplification is that it only examines the local-name, not the expanded name or QName. Putting namespace names in locators is way too verbose, and using QNames leads to the usual problem of how to determine the namespace context. The solution I like best would be to use the namespace context of the _target_ document. Amy Lewis makes a compelling argument for this approach in [3]. The only real drawback is that it's only reliable if the target document is sane [4]; otherwise you can end up counting more elements than you intended (neurosis) or skipping ones that should be counted (borderline). Further, the possibility of psychosis means that you have to do the full-blown QName-to-expanded name processing and compare URIs instead of doing a simple lexical comparison against the original QName. Given all that, the marginal benefit of being able to match expanded-names instead of just local-names didn't seem worth the added effort. The syntax is intentionally incompatible with XPath (parentheses instead of square brackets) because the semantics are different. The main differences are that 'foo' matches 'pfx:foo' in locators but not in XPath, and '//foo[@name="value"]' can match multiple elements in XPath whereas '//foo(@name="value")' only matches the first one. (The main reason though is that square brackets are magic characters in my Language of Choice for XML processing, and parentheses don't need to be escaped). [1] YAGNI: <URL: http://www.xprogramming.com/Practices/PracNotNeed.html > [2] XPointer requirements: <URL: http://www.w3.org/TR/NOTE-xptr-req > [3] compelling argument: <20021114182802.GA6480@t...>, <URL: http://lists.xml.org/archives/xml-dev/200211/msg00549.html > [4] sane: <URL: http://www.flightlab.com/~joe/sgml/sanity.txt > --Joe English jenglish@f...
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|