This appeared in one my web site’s server logs:
38.98.19.116 – – [10/Sep/2006:02:33:46 -0400] “GET /2005/08/28/postname/feed:http://www.example.com/comments/feed/ HTTP/1.0” 404 12824 “-” “Snapbot/1.0”
Ugh! Tons of 404s as this badly behaved spider bot added the site’s feed URL for comments to the end of each URL. SnapBot is apparently related to Snap.com. I’ve emailed Snap.com asking for clarification of their spider’s intentions, and what their identifer is for my robots.txt file.
The other related issue is they are using multiple servers to do their spidering. One IP address requests robots.txt, another requests a page, then yet another requests the content of the page (CSS and images). I feel like that is bad, but I’m pondering it. I’ll need to look to see how Google handles images. I think Google doesn’t do anything with images, it wants to read only the text content.
38.98.19.105 – – [07/Sep/2006:07:18:22 -0400] “GET /robots.txt HTTP/1.0” 200 – “-” “Snapbot/1.0”
38.98.19.121 – – [07/Sep/2006:07:18:22 -0400] “GET / HTTP/1.0” 200 25168 “-” “Snapbot/1.0”
ip.add.re.ss – – [07/Sep/2006:07:18:47 -0400] “GET / HTTP/1.1” 200 7399 “-” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1)”
ip.add.re.ss – – [07/Sep/2006:07:18:48 -0400] “GET /images/filename1.jpg HTTP/1.1” 304 – “http://www.example.com/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1)”
ip.add.re.ss – – [07/Sep/2006:07:18:48 -0400] “GET /images/photos/filename2.jpg HTTP/1.1” 304 – “http://www.example.com/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1)”
ip.add.re.ss – – [07/Sep/2006:07:18:48 -0400] “GET /wp-content/themes/sitename/style.css HTTP/1.1” 304 – “http://www.example.com/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1)”
ip.add.re.ss – – [07/Sep/2006:07:18:48 -0400] “GET /wp-content/themes/sitename/images/kubrickbgcolor.jpg HTTP/1.1” 304 – “http://www.example.com/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1)”
ip.add.re.ss – – [07/Sep/2006:07:18:48 -0400] “GET /wp-content/themes/sitename/images/kubrickbg.jpg HTTP/1.1” 304 – “http://www.example.com/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1)”
ip.add.re.ss – – [07/Sep/2006:07:18:48 -0400] “GET /wp-content/themes/sitename/images/kubrickheader.jpg HTTP/1.1” 304 – “http://www.example.com/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1)”
ip.add.re.ss – – [07/Sep/2006:07:18:49 -0400] “GET /wp-content/themes/sitename/images/kubrickfooter.jpg HTTP/1.1” 304 – “http://www.example.com/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1)”
It also seems very wrong to continue to spider and index the page when the server returns a 404 error. And note the multiple IP addresses.
38.98.19.80 – – [17/Sep/2006:00:28:51 -0400] “GET /2006/02/26/bad-file-name/ HTTP/1.1” 404 3552 “-” “Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1”
38.98.19.68 – – [17/Sep/2006:00:28:52 -0400] “GET /wp-content/themes/example/style.css HTTP/1.1” 200 9814 “http://www.example.com/2006/02/26/bad-file-name/” “Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1”
38.98.19.69 – – [17/Sep/2006:00:28:52 -0400] “GET /wp-content/themes/example/images/kubrickbg.jpg HTTP/1.1” 200 875 “http://www.example.com/2006/02/26/bad-file-name/” “Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1”
38.98.19.67 – – [17/Sep/2006:00:28:52 -0400] “GET /wp-content/themes/example/images/kubrickbgcolor.jpg HTTP/1.1” 200 353 “http://www.example.com/2006/02/26/bad-file-name/” “Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1”
38.98.19.84 – – [17/Sep/2006:00:28:52 -0400] “GET /favicon.ico HTTP/1.1” 200 10134 “-” “Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1”
38.98.19.85 – – [17/Sep/2006:00:28:52 -0400] “GET /wp-content/themes/example/images/kubrickheader.jpg HTTP/1.1” 200 29681 “http://www.example.com/2006/02/26/bad-file-name/” “Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1”
38.98.19.81 – – [17/Sep/2006:00:28:52 -0400] “GET /wp-content/themes/example/images/kubrickfooter.jpg HTTP/1.1” 200 3439 “http://www.example.com/2006/02/26/bad-file-name/” “Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1”
For more information on SnapBot, see Snapbot and Snap.com – My Last Word and Verdict and SnapBot and the Linux Firefox Revelation
Snap Preview Anywhere Hurting User’s Web Experience…
Snap.com is running a new service, Snap Preview Anywhere. It allows websites to have the links on their site show a preview of the linked site in a floating window. I have stumbled across several sites that are using this technology, and I find it dist…