[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Schema ambiguity detection algorithm for RELAX (4/4)
This is the final leg of our long journey. And fortunately, this is the easiest of the four. In this post, the algorithm that detects if a label is a catalyst is proposed. This algorithm is named "detectCatalyst" and "dC" in short. In the first mail, the definition of catalyst is somewhat abstract. (for your reminder) > So, what did we learn? We learned that for a grammar to be ambiguous, we > need the third label, which acts as catalyst. This is the key factor > that makes a grammar ambiguous. > From now on, I call this third label as "catalyst". First, I'll define catalyst in more formal way. Then, I'll prove that this definition complies the following requirement of catalyst. > the grammar is ambiguous if and only if it has at least one catalyst. After that, I'll propose an algorithm to detect a catalyst. Finally, I'll prove that the algorithm is correct. 1. Formal definition of catalyst OK. The first part will be the formal definition of catalysts. Say we have two DIFFERENT sequences of labels, and say their lengths are the same. { s1 , s2, ... , sn } { s1', s2', ... , sn' } Say for every pair (si,si'), either si=si' or si and si' are in the ambiguous relation. Since they are two different sequences, we have at least one pair such that si!=si' (but they are in the ambiguous relation). The formal definition of catalysts follows: The label L is a catalyst if and only if (1) there exists two such sequences, and (2) label L can accepts both of them 2. Consistency of the formal definition with respect to the requirement The second part is to prove that this definition satisfies the following requirement. > the grammar is ambiguous if and only if it has at least one catalyst. Say the label L is a catalyst. From the definition, we know that there actually exists two such sequences. Every pair (si,si') of two sequences, there exists a tree that can be interpreted by both si and si' (if si=si' then this tree is trivial. if si!=si' then the existence is guaranteed by the ambiguous relation). By horizontally combining those trees, we can say that there exists a hedge that can be interpreted by both sequences. And by definition of catalyst, label L can accept both interpretation. Therefore, the entire grammar is ambiguous, by the definition of grammar ambiguity. So, I just proved that if there exists a catalyst in a grammar, it is an ambiguous grammar. The rest is to prove that if a grammar is ambiguous, there exists a catalyst in it. Say the grammar G is ambiguous. In the first post, I proved that if G is ambiguous, there exists at least one self-ambiguous label. Name it L. By definition of self-ambiguous label, we can say that there exists a hedge X such that X can be interpreted in more than one way (I and I'), and both I and I' are acceptable by L. These I and I' satisfies the two requirements of being "two sequences" that I described in the first part of this post. Therefor, L is a catalyst. As a whole, I proved that the formal definition of catalyst is in fact well-defined one, and consistent with the requirement of the first post. 3. "detectCatalyst" algorithm. The third part proposes actual algorithm to detect if a label is a catalyst or not. I name it "detectCatalyst" or "dC" in short. dC receives the followings as input. (1) Label L, which is going to be inspected. (2) Score matrix A, which is (maybe not-fully) completed by iterative execution of dTLA. We know that some fields of the score matrix can be still blank. dC produces one of the following output and always stops. (1) Label L is a catalyst (2) Label L is not a catalyst Intuitively, this algorithm finds the "two sequences" that I described in the previous section. Let the content model automaton of L be M as follows: M = ( Q, S, q0, F ) This algorithm constructs the following automaton M* from M. M* = ( QxQ, S*, T*, (q0,q0), FxF ) S* = { (s,s') | s,s' are members of S, and ( s=s' or s and s' are in the ambiguous relation) } T*( (q,q'), (s,s') ) = ( T(q,s), T(q',s') ) And let us write this M* down to some big whiteboard. When writing a edge of (s,s), use a black pen please. When writing a edge of the form of (s,s'), please use a red pen. This algorithm says "L is a catalyst" if there exists a path from (q0,q0) to a final state that uses at least one red edge. If no such path exists, the algorithm says "L is not a catalyst" and stops. Many algorithm can be possible to detect the existence of such a path. One way to do is minimize M* and see if we have any red edge in the result. If such an edge exists, such a path exists. If the edge doesn't exist, neither does the path. 4. Why does this algorithm works? Say dC said "L is a catalyst" and stopped. This only occurs if there exists a sequence or label pair { (s1,s1') , (s2,s2') , (s3,s3') , ... } And this sequence is accepted by M*. And we know that at least one of the pair is a red edge. That means si!=si' for a certain i but si and si' are in the ambiguous relation. By "unzipping" this sequence, we obtain two sequences of the same length. { s1, s2, ... } { s1', s2', ... } From the way I defined T*, we can say that both sequences can be accepted by M. So these two sequences are the "two sequences" that defines L being in fact a catalyst. In summary, I proved that if dC says "it's a catalyst", then it actually is. Next, I'll prove that if it is actually a catalyst, the algorithm always says "it's a catalyst". If L is actually a catalyst, there exists the two sequences { s1, s2, ... } { s1', s2', ... } Every pair (si,si') is either equal or in the ambiguous relation. And at least one pair is not equal. Furthermore, both of the sequence can be accepted by M by the following transition sequences. q0 --s1---> q1 --s2---> q2 .... qf q0 --s1'--> q1' --s2'--> q2' .... qf' Since both are accepted by M, both qf and qf' are members of F. Let's "zip" those two sequences into one sequence. { (s1,s1'), (s2,s2'), ... } What happens to M* if it receives this sequence? The transition sequence of M* will be as follows: (q0,q0) --(s1,s1')--> (q1,q1') --(s2,s2')--> (q2,q2') ... --> (qf,qf') As I mentioned earlier, both qf and qf' are members of F. So (qf,qf') is the final state of M*, which means this sequence is accepted by M*. And the fact that at least one pair (si,si') is not equal guarantees that this transition uses a red edge for such pair. From the second post, we know that our score matrix is never blank for those (si,si'). So, in cases like this, the algorithm always stops by saying "L is a catalyst". This proves that if it is actually a catalyst, the algorithm always says "it's a catalyst". As a whole, I proved to thing * dC says it's a catalyst -> It actually is. * It actually is a catalyst -> dC says so. And these two guarantees that dC is a correct (sound as well as complete) algorithm. regards, ---------------------- K.Kawaguchi E-Mail: k-kawa@b...
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|