# Re: Re: Regular expression functions (Was: Re: comment

 Subject: Re: Re: Regular expression functions (Was: Re: comments on December F&O draft) From: Jeni Tennison Date: Sat, 12 Jan 2002 22:49:50 +0000
Hi Dimitre,

> YACC is a parser generator for LALR(1) grammars (a subset of
> context-free grammars), which describe languages much more powerful
> than what regular expressions can describe.

Yep. For David's example, I think that's what we need.

(By the way, obviously I'm not a computer scientist and don't grasp
if you could explain terminology and acronyms like "LALR(1)" and
"context-free grammars" - all the reading that I did seemed to assume
you knew what they meant before hand :( Also, I might be using
terminology wrongly, such that you misunderstand what I'm trying to
say.)

>> Something along the lines of:
>>
>>   $row => ($frac)|($sqrt)|($expr)
>>   $frac => \\frac\{($row)\}\{($row)\} >>$sqrt     =>  \\sqrt\{($row)\} >>$expr     => ($times)|(($operand)?(\s*($operator)\s*($operand))+)
>>   $times => ($operand){2,}
>>   $operand => ($row)|($sup)|($number)|($ident) >>$sup      => ($operand)\^($operand)
>>   $operator => \\pm|\- >>$number   => -?[0-9]+(\.[0-9]+)?
>>   $ident => [a-z][a-z0-9]* > > This is not a regular grammar but a context-free grammar (although > it is not formally defined and I have to guess which exactly is the > set of terminal and non-terminal symbols, as well as which is the > starting symbol). The regular expression syntax that I was aiming for here is an extension of the regular expression syntax from XML Schema. The only extension in evidence here is that it allows you to insert named subexpressions. For that, I was using the syntax that I introduced earlier in this thread: "(" "$" subexpression-name ")" (so ($times) is a reference to a named subexpression (or non-terminal in other words) that's bound to the name 'times'). Perhaps I shouldn't be calling these "regular expressions", but they seemed roughly equivalent to Lex regular expressions from what I read at http://dinosaur.compilertools.net/lex/index.html (though there they use the syntax "{" non-terminal "}" to refer to a non-terminal). I was aiming for something where the terminals are regular expressions that don't contain references (so$operator, $number and$ident).

> It seems that $row,$frac, $sqrt,$expr and $operand are the > non-terminal symbols. In this case this is a context-free grammar > that does not describe a regular language (a language that can be > described/generated by a set of regular expressions). This is so, > because in a set of rules describing a regular language the rules > can only have terminal symbols and then one variable. The rule for >$expr is in clear violation of this principle.

Sorry, I'm having difficulty keeping up. Do you mean that you cannot
articulate something like, for example:

AndExpr  :=  Expr "and" Expr

if you want to describe a regular language, because it contains more
than one variable (which I take to mean a reference to a
non-terminal)?  (Again, finding a readable description of a "regular
language" is hard to do.)

>> I am fairly convinced that it's possible to create an
>> implementation to do this, based on the fact that there are plenty
>> of lexers out there that do roughly the same thing.
>
> As shown above, what would be needed is not a "lexer" but a parser
> for CF grammars.
>
> In case a parser generator is used, it has to be fed with the rules
> of the grammar. It then generates a table that guides the parsing
> process. This table typically shows the action to be taken and the
> next state of the parser, if it is in a given state, the top of the
> stack contains a given symbol and the input contains a given
> terminal symbol. This table-driven parser will need a lexical
> analizer to call when each next non-terminal from the input is
> needed.
>
> You're suggesting this to be done dynamically at transformation time.

Or that the table is built up during compilation of the stylesheet
from static declarations in the stylesheet.

> Not that this is impossible, but this is something much more complex
> and less efficient than matching regular expressions.

For sure. I was trying to give a possible method of handling David's
example, in which there's a language that has nesting structures,
which can't, I think, be handled by normal regular expressions.

If there's a better alternative, that'd be great. If there's an
alternative that would mean a bit more work on the user side, but
better performance, then that'd be great too. It it's just totally
impractical whatever way we look at it, I guess David will have to
stick with using XSLT string manipulation functions to parse his LaTeX
mathematical equations.

Cheers,

Jeni

---
Jeni Tennison
http://www.jenitennison.com/

XSL-List info and archive:  http://www.mulberrytech.com/xsl/xsl-list



### PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!