|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: [OT] Looking for a text algorithm
Miles Sabin writes: > Judging from your examples it looks like you're after a closeness > criterion derived from longest common subsequences. But I don't see how > you could use that to usefully construct a single characteristic number > for _any_ string of _any_ length: with only 32 or 64 bits to play with, > many many completely unrelated (on any criterion) strings will collide > on the same code. Quite right. However, if the alternative is a linear search (say, using an edit-distance algorithm), then reducing the number of candidates by a few orders of magnitude would not necessarily be a bad thing. The problem I was considering (in the shower) was detecting spam messages with minor variations, such as the insertion of the recipient's e-mail address in the body or the substitution of Zimbabwe for Nigeria. Assume that I have a database containing many millions of known spam messages, and that I want to check an incoming e-mail message against it. If I can narrow the field down to, say, 50 candidates after a very inexpensive operation, then my system will be much more efficient; I can then use edit-distance against the closest matches to see if the message really is likely spam. That said, based on private e-mail from another list member, I suspect that there may be nothing original about the algorithm I came up with; nevertheless, here it is for anyone who would care to take a peek: http://www.megginson.com/private/megginson-index-00.zip All the best, David -- David Megginson, david@m..., http://www.megginson.com/
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








