[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message]

Re: MarkMail: now archiving xml-dev

  • From: Jason Hunter <jhunter@a...>
  • To: "Edward C. Zimmermann" <edz@b...>
  • Date: Wed, 28 Nov 2007 11:35:07 -0800

Re:  MarkMail: now archiving xml-dev
Edward C. Zimmermann wrote:
> Quoting Jason Hunter <jhunter@a...>:
> 
>> Edward C. Zimmermann wrote:
>>> Quoting Elliotte Rusty Harold <elharo@m...>:
>>>
>>>> Jason Hunter wrote:
>>>>
>>>>> What if they start consuming 
>>>>> disk or thrashing the disk IO?  When you query against hundreds of gigs 
>>>>> of content, you don't have to be malicious to mess things up.
>>> Its not 100s of GB. Mailing lists are not that large.
>> Apache's messages in raw mbox format weigh in just shy of 60 Gigs. 
> 
> If you say so--- although I'm really quite amused that the there could
> be 60 GB of text in their lists..

If you divide 60 Gigs by 4,000,000 emails that's 15k per email.  That's 
bigger than I would have guessed an average email to be, but you have to 
take into account the full headers and the influence of the (relatively 
few) binary attachments.

>> Converting mbox emails to enriched XML involves an expansion.
> 
> When I index mail I don't bother.

Well, we probably have different goals and infrastructure technologies. 
  I want to have access to the hierarchical internal structure of each 
email body, and to help me accomplish that I have a tool that thinks in 
XML so it's a natural representation.

Of course with MarkLogic you don't store XML files on disk, any more 
than Oracle stores CSV files on disk.  XML is just the representation 
data model.

> Why parse and tag mail to then parse it as
> XML when one can parse it directly (which makes also a lot of sense given the
> observation that mail contains overlapping context structures such as lines
> and sentences) into the "internal" structures that one is using anyway
> (especially given that one wants to see the mail as given, noting the
> use of physical position to convey meaning as-if ee.cummings)? 

If you only fetched mail by id, then I could parse it on the fly for 
rendering.  But if I'm to use the structure in the query, it needs to 
exist in the database in its enriched format.

>> So, in fact, it's 100+ Gigs of XML content.
> 
> Do you index it in one big lump or is it segmented?

It operates in many ways like a database.  Every new email that arrives 
is incorporated into the index immediately.  The index model is are able 
to do that while also keeping performance up, using an index merging model.

-jh-


[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index]


PURCHASE STYLUS STUDIO ONLINE TODAY!

Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!

Buy Stylus Studio Now

Download The World's Best XML IDE!

Accelerate XML development with our award-winning XML IDE - Download a free trial today!

Don't miss another message! Subscribe to this list today.
Email
First Name
Last Name
Company
Subscribe in XML format
RSS 2.0
Atom 0.3
 

Stylus Studio has published XML-DEV in RSS and ATOM formats, enabling users to easily subcribe to the list from their preferred news reader application.


Stylus Studio Sponsored Links are added links designed to provide related and additional information to the visitors of this website. they were not included by the author in the initial post. To view the content without the Sponsor Links please click here.

Site Map | Privacy Policy | Terms of Use | Trademarks
Free Stylus Studio XML Training:
W3C Member
Stylus Studio® and DataDirect XQuery ™are products from DataDirect Technologies, is a registered trademark of Progress Software Corporation, in the U.S. and other countries. © 2004-2013 All Rights Reserved.