Skip to content

OpenOffice 2.04 is out

OpenOffice 2.04 is out, and they finally have a version for OS X. The download page for OpenOffice OS X is much more nicely designed that the old international listing. I really like the question and answer process. Do you have Intel or a PowerMac, then they have a nice link to a page telling you how to tell the difference. There is still an above average level of geekiness on that page though. “I would like a legacy build for Mac OS X 10.2”. Why don’t they simply ask what OS do you have? 10.2, 10.3, 10.4? Then they would only need to give the options that are relevant. For example, if you are running 10.2, you can’t have an Intel Mac. They could have even used browser detection to guess what the OS is. Safari versions correspond pretty closely to the operating system version.

Regardless, I’m looking forward to trying out OpenOffice. I’m really looking forward to an Aqua version of OpenOffice from NeoOffice when they release a final version.

Eudora is dead

Qualcomm is releasing Eudora to become a colloration with the open source project Thunderbird. Read the press releases from Qualcomm and the press release from Mozilla.

So much for the exalted Real Soon update of Eudora. Now to figure out how to migrate tons of ancient mailboxes to Thunderbird.

Verizon’s Policy Blog, the “PoliBlog”

Verizon is now blogging at PoliBlog. The URL is a really nice one, very easy to remember. http://poliblog.verizon.com/PoliBlog/blogs/poliblog/default.aspx. What is it with using technology or a design that requires three directories to get to the real content?

Don’t click on the blog author’s name. They are using some kind of javascript abomination to clear out your history, so you can’t hit the back button. Actually it looks like any link on the site disables the back button. Bad move Verizon. Why would you want to disable a user’s browser buttons? I really have to wonder about the tech guys at big companies. Did no one there try to explain to the bosses how things work?

Bitacle Appears to Be Editing Posts

I believe that Bitacle is editing posts before they are “archived” onto their web site. It looks like the Bitacle scraper is not displaying anything on their web site that is in an RSS feed after a horizontal rule. Hmmm, why would they do that? Oh yeah, the copyright feed plug in adds a HR before the copyright info. Simple enough to test, I removed the HR command from the plug in. I guess we’ll have to wait a couple days for them to scrape PlanetMike illegally again.

Added Akismet Spam Count

I’ve just added the Akismet Spam Count plugin to my WordPress. It shows how many spam have been aught by Akismet. Very nice.

SnapBot Appears to be a Broken, Bad Spider

This appeared in one my web site’s server logs:

38.98.19.116 – – [10/Sep/2006:02:33:46 -0400] “GET /2005/08/28/postname/feed:http://www.example.com/comments/feed/ HTTP/1.0” 404 12824 “-” “Snapbot/1.0”

Ugh! Tons of 404s as this badly behaved spider bot added the site’s feed URL for comments to the end of each URL. SnapBot is apparently related to Snap.com. I’ve emailed Snap.com asking for clarification of their spider’s intentions, and what their identifer is for my robots.txt file.

The other related issue is they are using multiple servers to do their spidering. One IP address requests robots.txt, another requests a page, then yet another requests the content of the page (CSS and images). I feel like that is bad, but I’m pondering it. I’ll need to look to see how Google handles images. I think Google doesn’t do anything with images, it wants to read only the text content.

38.98.19.105 – – [07/Sep/2006:07:18:22 -0400] “GET /robots.txt HTTP/1.0” 200 – “-” “Snapbot/1.0”
38.98.19.121 – – [07/Sep/2006:07:18:22 -0400] “GET / HTTP/1.0” 200 25168 “-” “Snapbot/1.0”
ip.add.re.ss – – [07/Sep/2006:07:18:47 -0400] “GET / HTTP/1.1” 200 7399 “-” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1)”
ip.add.re.ss – – [07/Sep/2006:07:18:48 -0400] “GET /images/filename1.jpg HTTP/1.1” 304 – “http://www.example.com/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1)”
ip.add.re.ss – – [07/Sep/2006:07:18:48 -0400] “GET /images/photos/filename2.jpg HTTP/1.1” 304 – “http://www.example.com/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1)”
ip.add.re.ss – – [07/Sep/2006:07:18:48 -0400] “GET /wp-content/themes/sitename/style.css HTTP/1.1” 304 – “http://www.example.com/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1)”
ip.add.re.ss – – [07/Sep/2006:07:18:48 -0400] “GET /wp-content/themes/sitename/images/kubrickbgcolor.jpg HTTP/1.1” 304 – “http://www.example.com/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1)”
ip.add.re.ss – – [07/Sep/2006:07:18:48 -0400] “GET /wp-content/themes/sitename/images/kubrickbg.jpg HTTP/1.1” 304 – “http://www.example.com/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1)”
ip.add.re.ss – – [07/Sep/2006:07:18:48 -0400] “GET /wp-content/themes/sitename/images/kubrickheader.jpg HTTP/1.1” 304 – “http://www.example.com/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1)”
ip.add.re.ss – – [07/Sep/2006:07:18:49 -0400] “GET /wp-content/themes/sitename/images/kubrickfooter.jpg HTTP/1.1” 304 – “http://www.example.com/” “Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; SV1; .NET CLR 1.1.4322; .NET CLR 2.0.50727; InfoPath.1)”

It also seems very wrong to continue to spider and index the page when the server returns a 404 error. And note the multiple IP addresses.

38.98.19.80 – – [17/Sep/2006:00:28:51 -0400] “GET /2006/02/26/bad-file-name/ HTTP/1.1” 404 3552 “-” “Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1”
38.98.19.68 – – [17/Sep/2006:00:28:52 -0400] “GET /wp-content/themes/example/style.css HTTP/1.1” 200 9814 “http://www.example.com/2006/02/26/bad-file-name/” “Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1”
38.98.19.69 – – [17/Sep/2006:00:28:52 -0400] “GET /wp-content/themes/example/images/kubrickbg.jpg HTTP/1.1” 200 875 “http://www.example.com/2006/02/26/bad-file-name/” “Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1”
38.98.19.67 – – [17/Sep/2006:00:28:52 -0400] “GET /wp-content/themes/example/images/kubrickbgcolor.jpg HTTP/1.1” 200 353 “http://www.example.com/2006/02/26/bad-file-name/” “Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1”
38.98.19.84 – – [17/Sep/2006:00:28:52 -0400] “GET /favicon.ico HTTP/1.1” 200 10134 “-” “Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1”
38.98.19.85 – – [17/Sep/2006:00:28:52 -0400] “GET /wp-content/themes/example/images/kubrickheader.jpg HTTP/1.1” 200 29681 “http://www.example.com/2006/02/26/bad-file-name/” “Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1”
38.98.19.81 – – [17/Sep/2006:00:28:52 -0400] “GET /wp-content/themes/example/images/kubrickfooter.jpg HTTP/1.1” 200 3439 “http://www.example.com/2006/02/26/bad-file-name/” “Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.0.1) Gecko/20060124 Firefox/1.5.0.1”

For more information on SnapBot, see Snapbot and Snap.com – My Last Word and Verdict and SnapBot and the Linux Firefox Revelation

Spamhaus Lawsuit spam

I received some spam on Wednesday afternoon from someone really ticked at Spamhaus. It came to my abuse email address. It came from a throwaway Yahoo email address. Comments are supposed to go to yet another yahoo email address. I reported the addresses to Yahoo. The last line of the message was “Committee to stop Spamhaus censorship and Blackmail” If the CTSSCAB is so concerned by this, they’d include some real contact info. A domain name. A phone number. Typical spammer method of operation. Idiots. The message came from VS-516_VDS-102 ([64.46.36.172]).

Bitacle’s User-agent String

The Bitacle web thief was using this web agent identifier up through 25/Sep/2006:04:05:00 -0400. After that point, they identify their RSS crawler as “Mozilla/5.0 (X11; U; Linux i686; en-EN; rv:1.8.0.4) Gecko/20060614 Fedora/1.5.0.4-1.2.fc5 Firefox/1.5.0.4 pango-text” This is based on them using this IP address: 81.172.117.28.

Copyright Information

Just so everyone is clear, the blog entries and other content on PlanetMike.com is copyrighted by Michael Boyd Clark, and should not be posted to any other web site. The RSS feeds I provide are for the use of the readers of my blog, not to make it easier to steal my content and put it on your own web site.

Please look carefully at the web address in the URL field of your browser. It should read ‘http://www.planetmike.com’. In case you see a web address containing the word ‘bitacle’ or ‘bitacle.org’, you’re not looking at the original page on which this text was posted. If this is the case, the text you are reading right now might be incorrect or out of date. After I place a post on my weblog, I always try to keep published information up to date, or incorporate additional information, which I receive from readers. You will never find this information on bitacle.org.

Bitacle.org copies the content of weblogs without permission of the author, the holder of copyrights or the licensee. By visiting bitacle.org, you create income for the people who run bitacle.org, at the expense of me and other owners of a weblog, without permission and often without respecting copyrights and/or terms of use as in a license. So please, next time you want to view my posts, do so by using the web address of my weblog, which is ‘http://www.planetmike.com’. Please make a bookmark of my weblog’s address, if you would like to visit it again.

The Value of e-SocietyRobot?

One of my web sites has been spidered by the e-SocietyRobot spider. It’s web site is at http://www.yama.info.waseda.ac.jp/~yamana/es/, slightly more legible using Babelfish. e-SocietyRobot is not a search engine. e-SocietyRobot hit 4,549 pages, no MP3 files luckily, but still used 51MB of traffic. But it is some unknown research project attempting to spider the web. They have no plans on making their indexed pages available. So should I try to block off that robot? I of course want the search engines to spider my sites. But I don’t want to help some anonymous “research” project. Maybe they are spammers. Maybe they are going to use my site to feed into some splogs.

On a related issue, I wish that spiders would give an accurate referrer. Even if the referrer was another page in my own site, it would be useful to know where they are coming from. Does anyone know why they don’t?