[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: [OT] Looking for a text algorithm
David Megginson wrote, > The problem I was considering (in the shower) was detecting spam > messages with minor variations, such as the insertion of the > recipient's e-mail address in the body or the substitution of > Zimbabwe for Nigeria. Assume that I have a database containing many > millions of known spam messages, and that I want to check an incoming > e-mail message against it. If I can narrow the field down to, say, > 50 candidates after a very inexpensive operation, then my system will > be much more efficient; I can then use edit-distance against the > closest matches to see if the message really is likely spam. OK, that makes more sense ... but it's a different problem from the one suggested by your original question. Rather than generic closeness you're after a witness to closeness relative to a given corpus. > That said, based on private e-mail from another list member, I suspect > that there may be nothing original about the algorithm I came up with; > nevertheless, here it is for anyone who would care to take a peek: > > http://www.megginson.com/private/megginson-index-00.zip Interesting. Ignoring the thresholding, I think your pre-test is roughly equivalent to edit distance over the strings considered as unordered sets of characters. A small string edit distance will imply a small unordered character set edit distance, so I guess the latter could be a useful witness to the former. Cheers, Miles
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|