[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] RE: Something altogether different?
Salton's approach makes it easy to know when things are similar. Then the human sorts out the noise. That is fine for things that run at human speed. That is not fine for things that run at machine speed and have quick system-wide effects. The power laws of crap feeding back to crap have not been suspended. Take the vector measures and tie them together with URIs across multiple notations for the same observations and that is an interesting system for machine learning as has been shown time and time again. They aren't as useful for targeting munitions; they can be useful for fusing multiple systems and giving a human a short list, or better, a space of solutions, and that is what we see from Google et al. The web works because human smarts take up the slack for computer dumb. Google is fine until you try to dispatch an emergency system based on it's address and maps. Two problems: 1. Locations can be off by half a mile or more. 2. Satellite photos are stale (by as much as 18 months) and vary in the resolution of a given adjacent area that is less than ten miles. 3. In the investigation that follows, one isn't allowed to mix unvetted data with vetted data (by policy, the name of the neighbor can't be entered without the neighbor having a defined role in the event (eg, a witness)). Dumb things done with dumb data are fine until you need something smart and accurate fast. Relaxing reliability to get deployment scale does work. Ask any driver of a T-34. Massed deployment always beats high potential assets in smaller numbers if you can sustain high initial casualty rates. len From: Ken North [mailto:kennorth@s...] Len Bullard wrote: 2) Where one can establish a similarity metric, is that good enough, as Bosworth is claiming for human processes, for machine-processes? Bosworth is playing fast and loose with the noise problems. Cohen and Fan discuss the noise issue in the paper about the CF spider, which uses a variant of the cosine distance measure of textual similarity (used in WHIRL): "However, although the data is noisy, it seems reasonable to believe metrics based on it can be used for comparative purposes. We note also that CF systems which can learn from this sort of noisy "observational" data (e.g., [Liebermann, 1995; Perkowitz & Etzioni, 1997]) are potentially far more valuable than CF systems that require explicit noise-free ratings." The solution to the semantic web might be millions of people creating Atom/RSS, but I'm more optimistic about applying machine learning with enough hardware. Google has already shown an array of processors can crunch the web's content. If you embark on creating Google++ using technologies such as WHIRL and the CF spider, you'll need a large array of hardware. But as Bosworth noted in the Powerpoint presentation, hardware is cheap.
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|