[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: combining XMLEvent lists

  • From: David <dlee@calldei.com>
  • To: xml-dev@lists.xml.org
  • Date: Tue, 28 Sep 2010 13:55:15 -0400

Re:  combining XMLEvent lists
  I have such a beast if your interested, or pieces of it.
In xmlsh I've experimented with StAX pipelines which are queues of StAX 
Events
These are prety much exactly as Michael describes below, although in my 
case they are multi-threaded with a reader on one end and a writer on 
the other, but the underlying techniques work well.
The source is available at sourceforge ( follow the link from www.xmlsh.org)
I can help you locate the relevant files if interested.

Interestingly though, I have found the overhead of using StAX in a 
pipeline to be *more overhead* then using text 
serialization/deserialization.  Your case may differ, but something to 
consider.  After about a month (pt) hard work to get this magic binary 
StAX pipeline to work with imaginations  of it being like 10x faster 
then text ... I was disheartened to discover it was about 20% *slower*.



David A. Lee
dlee@calldei.com
http://www.xmlsh.org


On 9/28/2010 1:46 PM, Michael Kay wrote:
>
>  On 28/09/2010 6:24 PM, David wrote:
>>  My guess would be "XMLEvent" is refering to StAX Events.
>>
>> http://woodstox.codehaus.org/javadoc/stax-api/1.0/javax/xml/stream/events/XMLEvent.html 
>>
>
> Ah yes, you're probably right. I forgot that's what they were called...
>
> If that's the case it looks fairly easy to present a List<XMLEvent> 
> via an XMLEventReader, which can be wrapped in a StaxSource and 
> supplied to any Saxon interface that expects a Source, for example a 
> DocumentBuilder.
>
> Michael Kay
> Saxonica
>
>>
>> which is a parsed XML event (startDocument, startElement  , 
>> characters ... )
>>
>>
>> David A. Lee
>> dlee@calldei.com
>> http://www.xmlsh.org
>>
>>
>> On 9/28/2010 1:17 PM, Michael Kay wrote:
>>>
>>>  On 28/09/2010 4:13 PM, Johannes.Lichtenberger wrote:
>>>> On 09/28/2010 04:33 PM, Michael Kay wrote:
>>>>> Sounds fascinating, and I wish I had time to get involved. It would
>>>>> certainly be elegant if you could have both the productivity of 
>>>>> writing
>>>>> this declaratively in XSLT and the performance of running it on 
>>>>> Hadoop
>>>>> MapReduce. Intrinsically, the two seem to fit together hand in glove,
>>>>> but I suspect some engineering effort is needed to make it work.
>>>> Hello Michael,
>>>>
>>>> I think it would be too complicated to achieve the desired grouping 
>>>> with
>>>> Java. Do you think it makes sense to first serialize the results and
>>>> then use Saxon's XSLT 2.0 processor to achieve the results? Or do you
>>>> have any direct input from a List of XMLEvents to Saxon's XSLT
>>>> processor? I assume it reads XML-data from an InputSource or some kind
>>>> of a stream.
>>>
>>> I'm not sure whether "XMLEvent" is something I'm expected to know 
>>> about: you said earlier "
>>>
>>> I've got an Iterator with Lists (Java) out of XMLEvents, which are
>>> serialized fragments
>>>
>>> so I assume they are just strings containing unparsed XML. That's 
>>> not going to be a particularly efficient representation for 
>>> processing, so the first step will be to parse each one to a tree 
>>> (for example, a Saxon TinyTree).
>>>
>>> You then said,
>>>
>>> I want to find combine Lists which have the same page id and the same
>>> revision timestamp
>>>
>>> but you left out the critical information as to whether this would 
>>> always combine elements
>>> that were adjacent in the list. If the groups are adjacent then you 
>>> could potentially devise
>>> a strategy that avoid holding all the trees in memory at the same time.
>>>
>>> Supplying a sequence of trees as input to Saxon grouping is not a 
>>> problem. Using the s9api interface,
>>> you can use a DocumentBuilder to build each tree as an XdmNode, then 
>>> a sequence can be constructed using
>>> the constructor public XdmValue(Iterable<XdmItem>  items), and then 
>>> this XdmValue can be passed as a parameter
>>> to an XsltTransformer, and a reference to the parameter can be used 
>>> in<xsl:for-each-group select="$param">.
>>> Using this approach the whole structure will be held in memory, but 
>>> there are ways of avoiding that by going
>>> to lower-level interfaces.
>>>
>>> Michael Kay
>>> Saxonica
>>>
>>>
>>>> It's a special case, where two or more revisions of one article are 
>>>> made
>>>> at the same time (in the same second). I would have to look through 
>>>> the
>>>> XML file with BaseX or Saxon, but I'm pretty sure such cases exist
>>>> somewhere in the hugh file (as of now I've only extracted a small 
>>>> subset
>>>> of articles and replaced WikiText inside text-elements with XML).
>>>>
>>>> The whole task is to sort the revisions to shredder it into our XML
>>>> datastorage system (the deltas of the revisions), which has the
>>>> capability to store and retrieve revisions compactly and 
>>>> efficiently. In
>>>> parallel I'm currently writing the import of a sorted XML file.
>>>>
>>>> My main task (master project and thesis) is or will be the 
>>>> visualization
>>>> of temporal tree structured data to gain further insights into the
>>>> evolution of the data, which are otherwise very difficult to realize.
>>>>
>>>> regards,
>>>> Johannes
>>>>
>>>
>>>
>>> _______________________________________________________________________
>>>
>>> XML-DEV is a publicly archived, unmoderated list hosted by OASIS
>>> to support XML implementation and development. To minimize
>>> spam in the archives, you must subscribe before posting.
>>>
>>> [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
>>> Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
>>> subscribe: xml-dev-subscribe@lists.xml.org
>>> List archive: http://lists.xml.org/archives/xml-dev/
>>> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
>>
>> _______________________________________________________________________
>>
>> XML-DEV is a publicly archived, unmoderated list hosted by OASIS
>> to support XML implementation and development. To minimize
>> spam in the archives, you must subscribe before posting.
>>
>> [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
>> Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
>> subscribe: xml-dev-subscribe@lists.xml.org
>> List archive: http://lists.xml.org/archives/xml-dev/
>> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php
>>
>>
>
>
> _______________________________________________________________________
>
> XML-DEV is a publicly archived, unmoderated list hosted by OASIS
> to support XML implementation and development. To minimize
> spam in the archives, you must subscribe before posting.
>
> [Un]Subscribe/change address: http://www.oasis-open.org/mlmanage/
> Or unsubscribe: xml-dev-unsubscribe@lists.xml.org
> subscribe: xml-dev-subscribe@lists.xml.org
> List archive: http://lists.xml.org/archives/xml-dev/
> List Guidelines: http://www.oasis-open.org/maillists/guidelines.php


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.