[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Schema ambiguity detection algorithm for RELAX (2/4)
In this post, We will make a closer look to blanks in the score matrix, which I proposed in the previous post. In that post, I said when blanks are left, they can safely be assumed as unambiguous. But why? this post answers this question. To explain the proof clearly, I need a help of maneki-neko, our well-known mascot cat. (BTW, say hello to him at http://www.xml.gr.jp/relax/) Let G be an ambiguous grammar. More precisely, we don't know that G is ambiguous. But since our cat is almighty, he knows it is. And he knows the nice XML instance X, which can be interpreted in two ways I and I', whereas we only know the existence of X. Let me ask him a favor. Since he knows X and I and I'. I ask him to pick up all elements that are given the different label by I and I'. Since we have no idea about X, we cannot know these nodes. However, we do know that such elements are disjoint, and therefore can be considered as a set of interconnected parts. And each interconnected part forms a tree of its own. Ask him another favor. I want him to select one interconnected part(we know it's a tree) from nodes he just picked up. We have absolutely no idea of the tree he selected, but we know that that tree is finite in terms of size. One more favor. I want him to look at labels of the root node of the tree he selected. Since we first ask him to pick up only those elements that have two different labels, this root node has two different labels, too. Name them Lr1 and Lr2, respectively. We don't know what Lr1 and Lr2 actually are (only he knows), but we do know that those labels are in the grammar G, and Lr1!=Lr2. What I'd like to prove is, for these Lr1 and Lr2, our score matrix always have "#" mark, which indicates they are in the ambiguous relation, even when we don't know Lr1 and Lr2. (**) We never know what Lr1 and Lr2 are, but for every possible such Lr1 and Lr2, we have "#" mark on our score matrix. That's what I will prove. Let's denote the tree, which is selected by our cat, by T. All we know about T is its size is finite. And this the only thing we need to prove. When the size of tree is finite, we always have at least one "leaf", which is a node without any children (it may have children that do not have different labels). Consider the set of all such nodes and name it N0. Again we don't know what nodes are in N0 at all, but we do know that such a node actually exists. So name it n. n has two different labels, so name it I(n) and I'(n) respectively. We do know that both I(n) and I'(n) are members of the grammar, but that's the only thing we know. n may have some children. Even so, All of its children have only one label assigned to it, since n is a leaf of T. Although we don't know I(n) and I'(n), since our algorithm tests every possible pair of two labels, we test I(n) and I'(n) in this process, without knowing so. And our algorithm that tests the ambiguity of two labels (detectTwoLabelsAmbiguity, dTLA in short ) is capable of detecting ambiguity of these two even if all the score matrix are blank. dTLA algorithm will be explained in the next post in better detail. In short, the fact that each children of n has only one label, while n has two different labels, are so decisive that dTLA cannot miss this situation. In summary, although only the cat knows the exact members of N0, we know that for every such node n in N0 and for the two different labels I(n) and I'(n), our score matrix always have "#" mark for them. So if the height of T is 1, (**) is proved. Do you think he may select a tree of height 2 or more? No problem. If the height of T is 2 or more, we always have at least one node that satisfies the following requirement. * every children of n is either (1) a member of N0, or (2) has only one label. There may be more than one nodes that satisfy this requirement. But there always at least one such node. Consider a set of all such nodes and name it N1. Again, we don't know anything about N1, except its size is not zero. So let him pick one and name it n. And again we name n's two different labels I(n) and I'(n) respectively. We only know that these are the members of the grammar G. However, again we tests every possible pair in the algorithm, we tests I(n) and I'(n) without knowing. And (again), I'll prove that our dTLA algorithm is wise enough to decide these two labels are in the ambiguous relation if only one condition is met. That condition is, we have # mark for every label pair that can be assigned to nodes in N0. As I mentioned earlier, those label pairs are so apparently ambiguous that dTLA cannot miss it. So after the first round of testing, we have # mark for every such possible label pairs. So, in the second round of testing, this condition is always met. What happens to dTLA if some information are missing? I'll prove in the next post that dTLA never makes mistake. So the possible result from dTLA under insufficient information is either (1) it cannot decide ambiguity (2) it detects these two are in the ambiguous relation. And as you see, both are fine. Some of you may think why dTLA can decide those two labels are in the ambiguous relation if this condition is met. But to cut the long story short, I'll skip this proof. (It's trivial, isn't it?) As a whole, if the tree selected by him has 2 or less height, we know that our score matrix always have "#" mark for every possible Lr1 and Lr2, after certain iterative testing. Now I can say that I proved (**) for height 2 or less. In general, when the height of T is h or more, we can predict the existence of n(h-1) such that * every child of n(h-1) is either (1) a member of N0, or (2) a member of N1, or (3) ... (4) a member of N(h-2), or (5) has only one label And by the induction on the height of tree, I prove that for every tree of finite height, (**) holds. We don't know what he selects as T, but for whatever T he selects, (**) is guaranteed. Next, consider the parent of T. Because we've "normalized" the grammar, the top level label is always S, the "virtual" root of XML instance may not be a member of T. Therefore, T has always a parent node. So name it p. Again (oh, again!) we have absolutely no idea about what p is like. Its existence is the only knowledge that we have. p may have children. Name it c1,c2, ... We don't know if it exists, nor how many. The cat knows everything, but we don't. However, we can say that c1 has * assigned the same label by both I and I' * assigned two different label by I and I' If the second is the case, such a child can be a root of T, so our score matrix already has "#". And since by the definition of tree T, p has only one label. Name it Lp. We know that Lp is by definition a catalyst. The algorithm that checks if a label is a catalyst (name it detectCatalyst "dC" which I'll explain in the fourth post) is capable of telling that such Lp is a catalyst. Intuitively this is because our score matrix has enough information (for every possible label pairs that can be assigned to children of p, our score matrix has a "#".) As you see, the cat didn't tell me anything about X or T or p from beginning to end. But the algorithm nonetheless detects that the grammar is ambiguous. So, in short I proved "if the grammar is in fact ambiguous, the algorithm always detects it". Furthermore, it is trivial to prove that the grammar is ambiguous if the algorithm detects the ambiguity. By these two proof, I can say that this algorithm is a perfect (I mean, complete as well as sound) algorithm to detect the ambiguity. regards, ---------------------- K.Kawaguchi E-Mail: k-kawa@b...
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|