[XSL-LIST Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: Wide Finder in XSLT --> deriving new requirements

Subject: Re: Wide Finder in XSLT --> deriving new requirements for efficiency in XSLT processors.
From: "Dimitre Novatchev" <dnovatchev@xxxxxxxxx>
Date: Sat, 10 Nov 2007 17:42:29 -0800
Re:  Wide Finder in XSLT --> deriving new requirements
Thanks to Dr. Kay and Colin Adams for their comments.

> I couldn't find a clear description of the problem - your link named
> "problem" seems to lead to a book.

One has to have access to the "Beautiful Code" book. Tim Bray, is one
of the authors, and obviously wishes that the book sells well.

I read it through a Safari (O'Reily) subscription. While reproducing
the contents of Safari books is most likely illegal, here's the
problem in my own words:

Tim has a very big file consisting of web-server log records for
www.tbray.org accumulated for four years.

A fragment (as from the 2MB, 10 thousand records file he made available at:
  http://www.tbray.org/tmp/o10k.ap )
looks like this:

h-68-166-233-4.cmbrmaor.dynamic.covad.net - - [01/Oct/2006:06:34:11
-0700] "GET /ongoing/ongoing.atom
HTTP/1.1" 304 - "-" "NetNewsWire/2.1 (Mac OS X;
http://ranchero.com/netnewswire/)"

rev1.bancoestado.cl - - [01/Oct/2006:06:34:14 -0700] "GET
/tag/rddl4.html HTTP/1.1" 200 11086 "-" "Mozilla/4.0 (compatible; MSIE
5.5; Windows NT 5.0)"

163.247.45.205 - - [01/Oct/2006:06:34:16 -0700] "GET /tag/rddl4.html
HTTP/1.1" 200 11086 "-" "Mozilla/4.0 (compatible; MSIE 5.5; Windows NT
5.0)"

215.red-81-44-140.dynamicip.rima-tde.net - - [01/Oct/2006:06:34:17
-0700] "GET /ongoing/When/200x/2004/10/01/AutumnLeaves HTTP/1.1" 200
4839 "http://images.google.es/imgres?imgurl=http://www.tbray.org/ongoing/When/200x/2004/10/01/-big/IMG_2686.jpg&imgrefurl=http://www.tbray.org/ongoing/When/200x/2004/10/01/AutumnLeaves&h=720&w=960&sz=1328&hl=es&start=17&tbnid=87JPYkrmqS9cGM:&tbnh=111&tbnw=148&prev=/images%3Fq%3Dautumn%2Bleaves%26imgsz%3Dxxlarge%26svnum%3D10%26hl%3Des%26hs%3DFPg%26lr%3D%26client%3Dfirefox-a%26rls%3Dorg.mozilla:es-ES:official_s%26sa%3DG"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; es-ES; rv:1.8.0.7)
Gecko/20060909 Firefox/1.5.0.7"

215.red-81-44-140.dynamicip.rima-tde.net - - [01/Oct/2006:06:34:17
-0700] "GET /ongoing/sans.css HTTP/1.1" 200 4843
"http://www.tbray.org/ongoing/When/200x/2004/10/01/AutumnLeaves"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; es-ES; rv:1.8.0.7)
Gecko/20060909 Firefox/1.5.0.7"

215.red-81-44-140.dynamicip.rima-tde.net - - [01/Oct/2006:06:34:18
-0700] "GET /ongoing/ongoing.js HTTP/1.1" 200 6685
"http://www.tbray.org/ongoing/When/200x/2004/10/01/AutumnLeaves"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; es-ES; rv:1.8.0.7)
Gecko/20060909 Firefox/1.5.0.7"

215.red-81-44-140.dynamicip.rima-tde.net - - [01/Oct/2006:06:34:18
-0700] "GET /ongoing/serif.css HTTP/1.1" 200 4829
"http://www.tbray.org/ongoing/When/200x/2004/10/01/AutumnLeaves"
"Mozilla/5.0 (Windows; U; Windows NT 5.1; es-ES; rv:1.8.0.7)
Gecko/20060909 Firefox/1.5.0.7"

He wants to collect all parts of a log record that are of the type:

   200x/2004/10/01/AutumnLeaves

knowing that this is a postfix of:

  "GET /ongoing/When/200x/2004/10/01/AutumnLeaves "  (1)

and that this string always enda with a space.

Also, the name of the article ("AutumnLeaves" in the above records)
should not contain dot (only picture files and other auxiliary files
have dot and extension).

Some records such as the first four listed above may not contain at
all the pattern of type (1) above.


So, the problem is to find all such strings as

  "200x/2004/10/01/AutumnLeaves"

containing time of visit and article name, and to display the top 10
most frequent such strings.

Tim Bray's Ruby program that does this is   the following:

1 counts = {}
2 counts.default = 0
3
4 ARGF.each_line do |line|
5   if line =~ %r{GET /ongoing/When/\d\d\dx/(\d\d\d\d/\d\d/\d\d/[^ .]+) }
6     counts[$1] += 1
7   end
8 end
9
10 keys_by_count = counts.keys.sort { |a, b| counts[b] <=> counts[a] }
11 keys_by_count[0 .. 9].each do |key|
12   puts "#{counts[key]}: #{key}"
13 end

and produces the following result:

	~/dev/bc/ 548> zcat ~/ongoing/logs/2006-12-17.log.gz | \
                                   time ruby code/report-counts.rb
     4765: 2006/12/11/Mac-Crash
     3138: 2006/01/31/Data-Protection
     1865: 2006/12/10/EMail
     1650: 2006/03/30/Teacup
     1645: 2006/12/11/Java
     1100: 2006/07/28/Open-Data
     900: 2006/11/27/Choose-Relax
     705: 2003/09/18/NXML
     692: 2006/07/03/July-1-Fireworks
     673: 2006/12/13/Blog-PR

           13.54 real         7.49 user                      0.73 sys

when run against a 245MB file of 1.2 million records on a 1.67 GHz
Apple PowerBook.

Tim Bray also goes on to explain why he considers the above Ruby code
"beautiful".

Hope this has been a precise description of the problem as specified
in chapter 4 "Finding Things" in the "Beautiful Code" book.


I followed the advice of Dr. Kay to move the definition of the $vLines
variable inside the template making it local. This did indeed result
in some speed upThe processing time for the 200MB file dropped from
36.175 sec. tp 34.697 sec. The RAM used (as reported by Saxon
increased from 929MB to 969MB.

The style also benefitted from Dr. Kay's suggestion to replace
concat() function with just a sequence in the "select" attribute of
the <xsl:Value-of ... />

The whole transformation now is the following:

<xsl:stylesheet version="2.0" xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
 xmlns:xs="http://www.w3.org/2001/XMLSchema">
  <xsl:output method="text"/>

  <xsl:variable name="vRegEx" as="xs:string" select=
   "'^.*?GET /ongoing/When/[0-9]{3}x/([0-9]{4}/[0-9]{2}/[0-9]{2}/[^
.]+) .*$|^.+$'"/>

  <xsl:template match="/">
    <xsl:variable name="vLines" as="xs:string*" select=
     "tokenize(unparsed-text('file:///C:/Log1000000.txt'),'\n')"/>

    <xsl:for-each-group group-by="." select=
    "for $line in $vLines
         return replace($line,$vRegEx,'$1')[.]">
      <xsl:sort select="count(current-group())" order="descending" />
      <xsl:if test="not(position() > 10)">
       <xsl:value-of select=
          "count(current-group()),':',current-grouping-key(),'&#xA;'"/>
      </xsl:if>
    </xsl:for-each-group>
  </xsl:template>
</xsl:stylesheet>

I also checked with another transformation that just reading the file
takes 11.310 seconds. This means that all the rest of the processing
takes about 23 seconds -- that is tokenizing 1 million lines, scanning
each of them against the above regular expression and retrieving the
matched string, then grouping all matched strings, sorting them by the
group count and producing the result.

Could this be done better?



-- 
Cheers,
Dimitre Novatchev
---------------------------------------
Truly great madness cannot be achieved without significant intelligence.
---------------------------------------
To invent, you need a good imagination and a pile of junk
-------------------------------------
You've achieved success in your field when you don't know whether what
you're doing is work or play



On Nov 10, 2007 2:15 AM, Michael Kay <mike@xxxxxxxxxxxx> wrote:
>
> >
> > I have published this on my blog:
> >
> >
> > http://dnovatchev.spaces.live.com/Blog/cns!44B0A32C2CCF7488!385.entry
> >
> > There are two areas on which I would appreciate any feedback:
> >
> >    1. Finding a more efficient solution (there are such
> > RegExp gurus here!)
> >
> >    2. Discussing the ideas for lazy evaluation/streaming and
> > on constructs (a single extension function exslt:parMap() is
> > proposed) hinting possibilities for parallelization
>
> I managed to guess a username/password that worked(!) and made a comment,
> but it appears anonymously.
>
> I couldn't find a clear description of the problem - your link named
> "problem" seems to lead to a book.
>
> Most of the discussion suggests that the performance is going to be
> dominated by time taken to read the data off the disk. So I wouldn't have
> thought there is an enormous win for parallelization here. There's certainly
> more that can be done to reduce memory requirements by pipelining, though.
>
> I think you're right that parallelizing probably needs some kind of user
> hint in the stylesheet, but my instinct would be to make it an extension
> attribute so that the code will still work on any processor. There are
> certainly lots of opportunities. I had been thinking that probably the first
> thing to try would be
>
> <xsl:for-each select=...." xx:threads="4">
>
> and allocate the processing of the items in the input sequence to the N
> threads in a round-robin fashion. the challenge being of course how to
> marshal the output of the N threads, stitching it back together as a
> sequence in the right order, without using a lot of extra memory and
> creating a lot of extra coordination overhead. The ideal would be that if
> the input sequence is streamed, the output sequence should be streamed too.
>
> Michael Kay
> http://www.saxonica.com/

Current Thread

PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.