[Home] [By Thread] [By Date] [Recent Entries]
Quoting Jason Hunter <jhunter@a...>: > If you divide 60 Gigs by 4,000,000 emails that's 15k per email. That's > bigger than I would have guessed an average email to be, but you have to > take into account the full headers and the influence of the (relatively > few) binary attachments. Even with "full headers" I think 15k average message size (excluding attachments) is suspect. A chunk of email headers could-- if one is bothering to clean things up-- be excluded as about the path of email transmission and not content. In a service its not really of interest to anyone how the mail arrived and got bounced around in one's own network--- and often we don't want to even publish such information. > > >> Converting mbox emails to enriched XML involves an expansion. > > > > When I index mail I don't bother. > > Well, we probably have different goals and infrastructure technologies. > I want to have access to the hierarchical internal structure of each > email body, and to help me accomplish that I have a tool that thinks in > XML so it's a natural representation. Who says that one can't have access to the "hierarchical internal structure of each email body"? When I parse emails I "identify" on-the-fly (and model as internal structure) header meta-data (including parsing the special types such as the dates, arrival times, content length, priority etc.) and typical body structure: lines, sentences, paragraphs (and pages). Mailing lists are a bit more hierarchical--- especially digest formats-- with sub-messages, in turn with their own meta-data bits and lines, sentences, paragraphs. Through the same process that one would auto-tag a mail folder to create a glob of XML one could just as well go directly to the internal data representation and save a parse-puke-parse (aside from the observation that some of the structures in mail are overlapping and have other characteristics that demand much "arm twisting" in XML). --- and since I have the structure I can, should I so desire, puke on-the-fly XML (since we can select at search whatever unit of retrieval we desire, not just as message, I think we have even a lot more to gain). Mail is constantly in flux as new messages define new .... > Of course with MarkLogic you don't store XML files on disk, any more > than Oracle stores CSV files on disk. XML is just the representation > data model. My philosophy is to try to tackle whatever representation model is thrown at me. Mail is a model. This way I can throw XML, mail and all kinds of other inputs into a big heap, search them (exploiting their structure), retrieve bits (exploiting their structure for unit of retrieval) and, should I desire, convert on the fly into other representations.. With a semantic crosswalk one can do some really really wacky things :-) > > > Why parse and tag mail to then parse it as > > XML when one can parse it directly (which makes also a lot of sense given > the > > observation that mail contains overlapping context structures such as > lines > > and sentences) into the "internal" structures that one is using anyway > > (especially given that one wants to see the mail as given, noting the > > use of physical position to convey meaning as-if ee.cummings)? > > If you only fetched mail by id, then I could parse it on the fly for > rendering. But if I'm to use the structure in the query, it needs to > exist in the database in its enriched format. Absolutely not. I do have records (for example, an email message) but I'm not bound by it as unit of retrieval. I can fetch mail by context. One might want to fetch mail by id as its a legitimate activity but one might be interested in a single message in a digest as the "relevant" bit of information to a query.. or for that matter a relevant "bit" might be a whole thread of a locus of messages. > > >> So, in fact, it's 100+ Gigs of XML content. > > > > Do you index it in one big lump or is it segmented? > > It operates in many ways like a database. Every new email that arrives > is incorporated into the index immediately. The index model is are able > to do that while also keeping performance up, using an index merging model. > Sure. Mailing lists are easy since its add/merge (we throw things in a queue to not start an index for each and every mail that arrives to keep system impact to a minimum without loss of effective functionality). We're even doing the same with RSS/Atom/CAPs feeds and to keep things synchronized its a bit wackier since delete/add/merge and garbage collect. We're indexing about 600 active News feeds and many update or change their stories at very frequent rates (we also keep track of these changes since they can be interesting over time). > -jh- -- E. Zimmermann, BSn/Munich R&D Unit Leopoldstrasse 53-55, D-80802 Munich, Federal Republic of Germany http://www.nonmonotonic.net
[Date Prev] | [Thread Prev] | [Thread Next] | [Date Next] -- [Date Index] | [Thread Index] |

Cart



