|
[XML-DEV Mailing List Archive Home] [By Thread] [By Date] [Recent Entries] [Reply To This Message] Re: web crawling (was: Re: HGRAB. Syndication. Google. Greyare
> > I don't understand your point. Could you pelase > > explain? > > > > Because HGRAB, for example, is > > usually polling only home page of the website, > > they are all allowed for polling. > > Not all. Some sites "Disallow: /". None of the sites, syndicated by HGRAB, has such a robots.txt. > > Also, I'm not sure if search engines do > > really care about the robot.txt, but that's another > > story. > > Googlebot does [1], and that answers your question about the difference > between it and HGRAB. So HGRAB 'cares' as well. But you are right, HGRAB should read the robots.txt file in the polling script. Just in case. After that, to be exactly like Google, the *only* thing I need is to write on HGRAB website : "if you want me not to grab your content - protect yourself with tricky robots.txt". Because *that* is what Google does ( from your URLs ) But it would not change the actual situation. The actual situation is that HGRAB is like Google and I think that both 'look illegal'. > > Also, the interesting twist is that when the > > robot encounters the website with *no* > > robots.txt ( most of the sites have no robots.txt ) > > the robot assumes that it is *safe* for him to > > 'steal' the content. > > No twist here; "if it [robots.txt] was not present [then] all robots > will consider themselves welcome" [2]. Sure. I think that's until the first lawsuite. > > I think this is really gray area and > > robots.txt is not a solution. > > At the moment, at least. > > It isn't. It is. That's what *you* write by yourself. See below. <aside> I've looked at many robots.txt files and nobody disallows the /. Maybe there are some especial websites that *do* that, but http://www.metasystema.org/terms.mhtml looks like a *very* rare example to me. But that's irrelevant, because you make a stronger point. </aside> > It is just a machine-readable version of [3], kindly provided > by Google for your crawling convenience. robots.txt has no legal meaning > [4]; you probably can't be sued for disregarding it. But you can for > breaking sites' TOS agreements. Exactly! So *can* be Google. That was my point. There is no significant difference between HGRAB and Google. Thank you for the URLs. I'l put a few lines on a website and a few lines into the polling script. Like Google did. Rgds.Paul. PS. This all means that some company may write some very tricky TOS (that crowler would not understand), feed poor Google's robot with some pages and then just wait for those pages to - become available in the Google's cache and then start the game. PPS. I already got the situation when Google composed a short description that looked like 'Paul Tchistopolskii said: "All W3C members are morons"'. The problem was that: 1. There was a thread on some webforum, that had a subject "All W3C members are morons". 2. I participated in that thread saying that *this is not true* and then explaining why I think some things may look strange to W3C outsider. 3. Google composed it *wrong* ( because it just put together the title of the thread and my name ). 4. The original web-forum thread *has been removed*. 5. So, for a couple of months, the person, who would type my name into Googles' search engine, may think that that's what I've said. Not nice. So - how much Google should pay me for this glitch in their software? I think that lawyers will have a plenty of food next years. The ownership and operations on Web content are tricky things. > [1] http://www.google.com/webmasters/faq.html#nocrawl > > [2] http://www.robotstxt.org/wc/norobots.html#format > > [3] http://www.google.com/terms_of_service.html > > [4] http://www.robotstxt.org/wc/norobots.html#status
|
PURCHASE STYLUS STUDIO ONLINE TODAY!Purchasing Stylus Studio from our online shop is Easy, Secure and Value Priced! Download The World's Best XML IDE!Accelerate XML development with our award-winning XML IDE - Download a free trial today! Subscribe in XML format
|
|||||||||

Cart








