[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] off-topic -- search engines
Greetings, I realise this is slightly off-topic. However: A) I can't find a search engine mailing list (know of any?) B) I knew I could count on my knowledgeable XML brothers. :) Indexing your content stored in XML for your content-rich site -- many articles, many white papers, etc. Should the "crawler" have access to the data layer, with rules and exceptions applied much like you would a "normal" query i.e. only crawl the <content> nodes with a value of "article" for the "type" attribute. Or should it access the content at a much higher abstraction, say through HTTP GET, like a GoogleBot or an AltaVistaBot? My concerns are based around granularity, exclusivity, and accuracy -- if an article is rendered on a page with navigation items, footer, copyright, etc., will it "skew" the results or even worse, actually return a record for "copyright mycompany"? What about an article called "How to Buy a Search Engine". This article is linked many, many times throughout the site. If I search on "Search Engine", what will the results return? All those pages that had the title text/link in it? I realise that these search engines have built-in exceptions but my concern is that these are at a high-level (post HTML rendering) not at the data layer where more specific, "limitless" control is available. Thanks for humoring me. Jason Kohls The xml-dev list is sponsored by XML.org <http://www.xml.org>, an initiative of OASIS <http://www.oasis-open.org> The list archives are at http://lists.xml.org/archives/xml-dev/ To subscribe or unsubscribe from this list use the subscription manager: <http://lists.xml.org/ob/adm.pl>
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|