Re: [OT] Looking for a text algorithm
Don't know about the numeric approach to your problem, but it sounds a lot like matching DNA strings while accounting for frameshift errors. Dan Gusfield's book, Algorithms on Strings, Trees, and Sequences devotes a lot of space to that and similar problems. - Mitch David Megginson wrote: > I'm looking for references to a specific kind of text algorithm -- the > algorithm should generate a number (say, 32 or 64 bits) for any text > string of any length, similar to a hash. However, it should be > possible to compare the numbers for different strings to tell how > close they are to each other. For example, the numbers for > > 1. To be or not to be. > > 2. Two bees or not two bees. > > 3. I don't know whether to be or not to be. > > should indicate that three strings are relatively close to each other > (while a hash number would give no indication at all). > > I'm asking only out of interest, because I came up with a simple > algorithm to do this while I was in the shower yesterday, and it would > be fun to see how close it is to what the pros use for spam detection > and so on. > > Note that I'm not looking for algorithms based on edit-distance, > bag-of-words, and so on. > > > Thanks in advance, > > > David >
PURCHASE STYLUS STUDIO ONLINE TODAY!
Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced!
Download The World's Best XML IDE!
Accelerate XML development with our award-winning XML IDE - Download a free trial today!
Subscribe in XML format