Re: why matches($title,'.*?(\.

[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: why matches($title,'.*?(\.|,)\s*$')) can perform s

Subject: Re: why matches($title,'.*?(\.|,)\s*$')) can perform so much worse than matches($title,'(\.|,)\s*$'))
From: Oliver Hallam <oliver@xxxxxxxxxxx>
Date: Wed, 13 Jul 2011 13:33:53 +0100
Re:  why matches($title
The way that fn:matches usually works is that it attempts to apply the regular expression starting from each character in the string.

So, looking at the regular expression ".*?Expr" where Expr is another regular expression:

.*? is a reluctant match and so we try to match the shortest expression possible first.

Starting from the first character:
Try match the empty string for .*? and then try to mach Expr starting from the first character.
Try to match a single character for .*? and then try to match Expr starting from the second character.
Try to match two characters for .*? and then try to match Expr starting from the third character.
...


Thus in trying to find a match for ".*?Expr" starting at a particular character we try to match Expr starting at every other caharcter in the string, until a match is found.

fn:matches attempts to find a match for the regular expression starting from each character in the string. If a match isnt found starting at the first character then it will try again starting at the second and so on. So if there are no matches then Expr is tested n^2 times where n is the number of characters in the string, wheras matches applied to Expr would just test it n times.

Now, if your query was (note the '^'):

matches($title,'^.*?(\.|,)\s*$')

then I would expect perfomance to be very similar to:

matches($title,'(\.|,)\s*$')


It would be perfectly valid (and sensible) for a query processor to realise that the two expressions you gave were equivalent and so not perform n^2 tests, but I am unaware of a processor that makes these kinds of optimizations to regular expressions.



Oliver Hallam XQSharp


On 12/07/2011 17:26, Alex Muir wrote:
Hi,

I'm wondering why matches($title,'.*?(\.|,)\s*$')) can perform so much
worse than matches($title,'(\.|,)\s*$'))

I found at least in one file out of thousands that I process the first
one can take a good 30 minutes to complete and the second is quick.

No doubt the .*? causes some problem but what exactly is the problem
that is causes?

Regards

--
Alex Muir
Instructor | Program Organizer - University Technology Student Work
Experience Building
University of the Gambia
http://sites.utg.edu.gm/alex/

Low budget software development benefiting development in the Gambia,
West Africa
Experience of a lifetime, come to Gambia and Join UTSWEB -
http://sites.utg.edu.gm/utsweb/

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.