Skip to content

Spider writers beware of other protocols

While looking through my web server logs, I see a fair number of web spiders that are making a very bad assumption: only the http is linked from a web page. The obvious counter example is https:. A web spider needs to only follow links that are http://. You can’t simply assume if the first 5 characters of a link aren’t http: it is a page link. I see lots of errors in my server log asking for pages like /blog/2004/