[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Backtracking and eternal loops caused by regular

Subject: Re: Backtracking and eternal loops caused by regular expressions matching: what to expect from implementations?
From: Abel Braaksma <abel.online@xxxxxxxxx>
Date: Wed, 24 Jan 2007 11:43:41 +0100
Re:  Backtracking and eternal loops caused by regular
Michael Kay wrote:
(b) The specs have nothing to say about performance.

I understand. Sounds reasonable and logical.


(c) Saxon relies entirely on the regex engines in the underlying platform
(Java or .NET). I dare say there are cases where one regex engine will find
an optimization that another one misses.

I cannot speak for .NET, but in the thread on the Saxon list I go deeper into this. It's not just about a spurious optimization feat or mismatch, it is about quite a common case.


(d) I was under the impression that regular expression evaluation will
always terminate, though of course it's possible to construct cases where
that might require extreme patience to observe.

In technical terms, you are right, however, not if you include the definition the astronomers have given about time: it ends some day. Jeffrey Friedl gives a calculation on page 227 of his 2nd ed book that a 50 char string with a bad regular expression, may run for a years. Here's the same with Java:


25 char:  2.5 seconds
26 char: 5.3 seconds
27 char: 10.7 seconds
28 char: 21.4 seconds
35 char: 2739 seconds
40 char: 24.4 hours
50 char: almost 3 years
85 char: 97 * 10^9 years (97 billion years)

Meaning that if I apply such a regex to a corpus of over 85 characters, the process will stop long after the the whole universe has died out (current thesis: it will last about 30 billion years at most). Which is pretty long ;)

Calculation of this performance penalty: 2^(stringlen) * (time-one-char-match)

Java (Sun) is just a little behind with adding this optimization, because really, there is no need in testing any string more than string-len times, usually much less even.

-- Abel Braaksma

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.