|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: Something altogether different?
Michael Champion wrote: >> I dunno ... I can't say this vision appeals to me, but I can see the momentum for RSS and microformats converging to produce this kind of thing more easily than I can envision the Semantic Web Resurrecting a topic from five years ago: (http://lists.xml.org/archives/xml-dev/200004/msg00092.html) Domain vocabularies and search engines such as WHIRL have a lot of potential for moving us to a Semantic Web. Overview: http://www.sqlsummit.com/SearchEngineResearch.htm More Technical: http://citeseer.ist.psu.edu/cache/papers/cs/26982/http:zSzzSzwww.ai.mit.eduzSzpeoplezSzjimmylinzSzpaperszSzCohen00.pdf/cohen99whirl.pdf WHIRL implemented a measure of textual similarity that permitted similarity searching. The measure used in WHIRL (1998) was also used for the collaborative filtering (CF) spider described in this paper: Web-Collaborative Filtering: Recommending Music by Crawling The Web http://www9.org/w9cdrom/266/266.html "We show that it is possible to collect data that is useful for collaborative filtering (CF) using an autonomous Web spider. In CF, entities are recommended to a new user based on the stated preferences of other, similar users. We describe a CF spider that collects from the Web lists of semantically related entities. These lists can then be used by existing CF algorithms by encoding them as "pseudo-users". Importantly, the spider can collect useful data without pre-programmed knowledge about the format of particular pages or particular sites. Instead, the CF spider uses commercial Web-search engines to find pages likely to contain lists in the domain of interest, and then applies previously-proposed heuristics [Cohen, 1999] to extract lists from these pages. We show that data collected by this spider is nearly as effective for CF as data collected from real users, and more effective than data collected by two plausible hand-programmed spiders. In some cases, autonomously spidered data can also be combined with actual user data to improve performance." Len Bullard wrote: 2) Where one can establish a similarity metric, is that good enough, as Bosworth is claiming for human processes, for machine-processes? Bosworth is playing fast and loose with the noise problems. Cohen and Fan discuss the noise issue in the paper about the CF spider, which uses a variant of the cosine distance measure of textual similarity (used in WHIRL): "However, although the data is noisy, it seems reasonable to believe metrics based on it can be used for comparative purposes. We note also that CF systems which can learn from this sort of noisy "observational" data (e.g., [Liebermann, 1995; Perkowitz & Etzioni, 1997]) are potentially far more valuable than CF systems that require explicit noise-free ratings." The solution to the semantic web might be millions of people creating Atom/RSS, but I'm more optimistic about applying machine learning with enough hardware. Google has already shown an array of processors can crunch the web's content. If you embark on creating Google++ using technologies such as WHIRL and the CF spider, you'll need a large array of hardware. But as Bosworth noted in the Powerpoint presentation, hardware is cheap.
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








