RE: [OT] Looking for a text algorithm

Cart

XML Editor - Download a Free Trial >

See What's New >

Buy Now >

[Home] [By Thread] [By Date] [Recent Entries]

To: <xml-dev@l...>
Subject: RE: [OT] Looking for a text algorithm
From: "Don Park" <donpark@d...>
Date: Sun, 9 Mar 2003 14:52:45 -0800
Importance: Normal
In-reply-to: <15979.19755.44779.643100@m...>
Organization: Docuverse

This is an interesting problem.  One obvious yet pretty much useless
algorithm is the length() function and its variations: word count, unique
word count, etc.  It works similar to the genome size: measure of
information entropy potential.  All the song lyrics will be pretty close to
each other and so will novels.  While useless, it can handle all the
languages. <g>

Don Park
Docuverse

> -----Original Message-----
> From: David Megginson [mailto:david@m...]
> Sent: Sunday, March 09, 2003 6:18 AM
> To: xml-dev@l...
> Subject: Re:  [OT] Looking for a text algorithm
> 
> Miles Sabin writes:
> 
>  > Judging from your examples it looks like you're after a closeness
>  > criterion derived from longest common subsequences. But I don't see how
>  > you could use that to usefully construct a single characteristic number
>  > for _any_ string of _any_ length: with only 32 or 64 bits to play with,
>  > many many completely unrelated (on any criterion) strings will collide
>  > on the same code.
> 
> Quite right.  However, if the alternative is a linear search (say,
> using an edit-distance algorithm), then reducing the number of
> candidates by a few orders of magnitude would not necessarily be a bad
> thing.
> 
> The problem I was considering (in the shower) was detecting spam
> messages with minor variations, such as the insertion of the
> recipient's e-mail address in the body or the substitution of Zimbabwe
> for Nigeria.  Assume that I have a database containing many millions
> of known spam messages, and that I want to check an incoming e-mail
> message against it.  If I can narrow the field down to, say, 50
> candidates after a very inexpensive operation, then my system will be
> much more efficient; I can then use edit-distance against the closest
> matches to see if the message really is likely spam.
> 
> That said, based on private e-mail from another list member, I suspect
> that there may be nothing original about the algorithm I came up with;
> nevertheless, here it is for anyone who would care to take a peek:
> 
>   http://www.megginson.com/private/megginson-index-00.zip
> 
> 
> All the best,
> 
> 
> David
> 
> --
> David Megginson, david@m..., http://www.megginson.com/
> 
> -----------------------------------------------------------------
> The xml-dev list is sponsored by XML.org <http://www.xml.org>, an
> initiative of OASIS <http://www.oasis-open.org>
> 
> The list archives are at http://lists.xml.org/archives/xml-dev/
> 
> To subscribe or unsubscribe from this list use the subscription
> manager: <http://lists.xml.org/ob/adm.pl>

References:
- Re: [OT] Looking for a text algorithm
  - From: David Megginson <david@m...>

Prev by Date: RE: Schemas as Promises and Expectations
Next by Date: Re: [OT] Looking for a text algorithm
Previous by thread: Re: [OT] Looking for a text algorithm
Next by thread: Re: [OT] Looking for a text algorithm
Index(es):
- Date
- Thread

XML Editor - Download a 15 Day Free Trial Now >

See What's New in Stylus Studio >

Buy Stylus Studio - XML Editor - Now >