[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: MarkMail: now archiving xml-dev

  • From: "Edward C. Zimmermann" <edz@b...>
  • To: Jason Hunter <jhunter@a...>
  • Date: Wed, 28 Nov 2007 12:46:20 +0100

Re:  MarkMail: now archiving xml-dev
Quoting Jason Hunter <jhunter@a...>:

> Edward C. Zimmermann wrote:
> > Quoting Elliotte Rusty Harold <elharo@m...>:
> > 
> >> Jason Hunter wrote:
> >>
> >>> What if they start consuming 
> >>> disk or thrashing the disk IO?  When you query against hundreds of gigs 
> >>> of content, you don't have to be malicious to mess things up.
> > 
> > Its not 100s of GB. Mailing lists are not that large.
> 
> Apache's messages in raw mbox format weigh in just shy of 60 Gigs. 

If you say so--- although I'm really quite amused that the there could
be 60 GB of text in their lists..

> Converting mbox emails to enriched XML involves an expansion.

When I index mail I don't bother. Why parse and tag mail to then parse it as
XML when one can parse it directly (which makes also a lot of sense given the
observation that mail contains overlapping context structures such as lines
and sentences) into the "internal" structures that one is using anyway
(especially given that one wants to see the mail as given, noting the
use of physical position to convey meaning as-if ee.cummings)? 

There is, of course, the context of one message within the larger context
but that too is a more complex. One thread may be a part of another thread
and bits split-off going partially to completely off-topic to being
again a part of a topic with some other grand siblings.. Part of IR should
distinguish between announced part of threads (declaring with MESSAGE-ID and
References or even subject content) and information threads. Even declared
threads overlap.


> 
> So, in fact, it's 100+ Gigs of XML content.

Do you index it in one big lump or is it segmented?

> 
> -jh-
> 


-- 
  E. Zimmermann, BSn/Munich R&D Unit
  Leopoldstrasse 53-55, D-80802 Munich,
  Federal Republic of Germany
  http://www.nonmonotonic.net


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.