[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: MarkMail: now archiving xml-dev
Edward C. Zimmermann wrote: > Quoting Jason Hunter <jhunter@a...>: > >> Edward C. Zimmermann wrote: >>> Quoting Elliotte Rusty Harold <elharo@m...>: >>> >>>> Jason Hunter wrote: >>>> >>>>> What if they start consuming >>>>> disk or thrashing the disk IO? When you query against hundreds of gigs >>>>> of content, you don't have to be malicious to mess things up. >>> Its not 100s of GB. Mailing lists are not that large. >> Apache's messages in raw mbox format weigh in just shy of 60 Gigs. > > If you say so--- although I'm really quite amused that the there could > be 60 GB of text in their lists.. If you divide 60 Gigs by 4,000,000 emails that's 15k per email. That's bigger than I would have guessed an average email to be, but you have to take into account the full headers and the influence of the (relatively few) binary attachments. >> Converting mbox emails to enriched XML involves an expansion. > > When I index mail I don't bother. Well, we probably have different goals and infrastructure technologies. I want to have access to the hierarchical internal structure of each email body, and to help me accomplish that I have a tool that thinks in XML so it's a natural representation. Of course with MarkLogic you don't store XML files on disk, any more than Oracle stores CSV files on disk. XML is just the representation data model. > Why parse and tag mail to then parse it as > XML when one can parse it directly (which makes also a lot of sense given the > observation that mail contains overlapping context structures such as lines > and sentences) into the "internal" structures that one is using anyway > (especially given that one wants to see the mail as given, noting the > use of physical position to convey meaning as-if ee.cummings)? If you only fetched mail by id, then I could parse it on the fly for rendering. But if I'm to use the structure in the query, it needs to exist in the database in its enriched format. >> So, in fact, it's 100+ Gigs of XML content. > > Do you index it in one big lump or is it segmented? It operates in many ways like a database. Every new email that arrives is incorporated into the index immediately. The index model is are able to do that while also keeping performance up, using an index merging model. -jh-
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|