|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: XPath calculation
From: Vakulenko, Andrey V <andrey.vakulenko@e...> >I am looking for Java implementation of an algorithm for finding the most >generic and short XPath expression for a selected element in DOM (XML >document). Any ideas and recommendations are welcomed. Sorry, no code, but here is a possible algorithm, no flames please. I think the problem comes down to seeing the document as a graph, with the root element, each element with an ID, and each unique set of vertices as potential anchor points. We want to find the shortest distance to a potential anchor point, then the simplest way to express the path between the anchor and the element. You start only considering the element: is it the root element or an element with an ID or the only element with that name? If not you expand the search one out: does it have parent or children or siblings that are a root element or an element wth an ID, or is it uniquely named? Next go to grandparent, cousins and grandchildren; at this stage you also consider whether any internal (within the current border) paths of distance 1 are unique. Next to the great-granchildren etc, and seeing whether any paths of distance 2 within the current border are unique. Continue, you will always stop because a document always has a root. (You would presumably handle the effect of longer uniquess patterns incrementally coming into play by giving every element a number, indicating the length of the shortest unique path in which it participates. This number would be added to the distance from the element to see if it is within the current boundary as a candidate IYSWIM) Now you have found the point you are going to anchor your XPath from. We now have to work backwards from that anchor to our element. Treat the anchor as the root of a new tree (dont allow up access from it, or rather, don' allow acess from the branch that contains both it and your element, because the anchor may not be an ancestor). Procede downwards in a similar way to the upwards way, but this time the alternatives are to find the element by a direct path or the nearest similar-sized set of vertices. If you find a unique set of vertices before you find a path, that is your second anchor. Trim the tree to fit both the anchor and your element and repeat. Something like that. If you wanted "shortness" to be strict string shortness, you would also have to factor in the sizes of element names and id values, I guess. A simpler heuristic might be just to go up ancestor-or-self::* until you find an element with an id or the root. Given that element names are often longer than 6 characters, you may as well just use *[n] as use the element name unless they are unusually small. Then attempt to make this path even smaller by checking if it has any unique subpaths in it (i.e. if it is *[@id="x1223"]/*[2]/*[5]/*[1] and the *[2] is an x and the *[5] is a y and there is only one x/y in the document, it is shorter to use that x/y/*[1]. This is not optimal but does not involve the long searches of the previous method. There would, of course, be lots of intermediate methods for better results. Cheers Rick Jelliffe
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








