Net: Search Engines – Page 2 – Richy's Random Ramblings

Search: New Microsoft Search

April 20, 2003

At the start of April, a Reuters news article came out which quoted Bob Visse, director of Marketing for MSN:

We do view Google more and more as a competitor. We believe that we can provide consumers with a better product and a better user experience.

Sounds ominous doesn’t it? Many people expected Microsoft to therefore create their own search engine (instead of just using Looksmart, Inktomi and Direct Hit), but it seems things have happened a bit quicker than expected!

Yep, a few people have noticed a new robot or crawler indexing the internet and all signs point back to Microsoft at the moment.

Whilst it hasn’t yet hit my blog, I have been hit by it on one of my other sites with the following details:

131.107.163.49 – – [20/Apr/2003:12:54:56 +0100] “GET /robots.txt HTTP/1.1” 200 763 “-” “MicrosoftPrototypeCrawler (please report obnoxious behaviour to newbiecrawler@hotmail.com)”

The IP address 131.107.163.49 falls within the 131.107.0.0-131.107.255.255 (in otherwords a 131.107.0.0/16) netblock which is allocated to a certain Microsoft Corp of One Microsoft Way, Redmond, Washington, 98052, USA.

Using that information, I was then able to look at the logs again and saw quite a few page requests (I stopped counting after the 200th request made in the first 9 hours of today) from the IP address 131.107.65.225 (also owned by Microsoft) with the “Browser User-Agent” of “Mozilla/4.0+(compatible;+MSIE+6.0;+Windows+NT+5.2;+.NET+CLR+1.1.4322)”.

So, it would appear Microsoft has launched a new spider/robot out on the Internet and its name is MicrosoftPrototypeCrawler, but Microsoft want to keep it slightly quiet for now by mostly hiding the user-agent string (which states what sort of computer and web browser you are using) as being Microsoft Internet Explorer 6 on Windows NT 5.2 (Windows XP claims to be Windows NT 5.1, so I would guess the new crawler is pretending to be on Windows .NET or 2003).

If the results from the crawler will be made public or not (or if they are just for internal Microsoft development for some reason), or what affect it’ll have on the Internet and the way people search – especially considering that according to Alexa Research, MSN.com is the 2nd most popular site world wide (Google is only 5th). But I’m wondering why MSN/Microsoft is so concerned about trying to semi-hide the crawler for now and why they are using a @hotmail.com address instead of a @microsoft.com one (the former doesn’t really give a lot of “respect” on the internet due to the fact anybody can get them for free).

Search: Changes, Talks and Flames

January 30, 2003

1 Comment

Warning: extremely long post (2,838) words!

Well, a lot has been happening in the “World of the Open Directory Project” (a.k.a. DMoz) in the last couple of weeks.

First of all, because of server load issues the internal editor forums have been moved to a new server (yippee!) that authenticate with the main server to ensure only valid users can log in. Good idea, but it’s had a few teething troubles (boo!).

Secondly, to help reduce the load on the main part of the ODP, editors have been given a “special” port number on which to connect to edit (hopefully reducing some of the overloading issues on the Apache webserver) – all good. Except if you are behind a corporate firewall and they block that port number 🙁

Thirdly, the “mirror server” at http://ch.dmoz.org/ (which is hosted by a fellow editor in Zurich, Switzerland) seems to have “taken off” a bit and is being used by a larger number of people now (mainly as it’s a lot faster) – it’s transferring around 19Gb of data a month.

Fourthly, the ODP staff members have managed to produce “a” copy of the RDF dump. The RDF dump is, in fact, a big big big file which contains the URLs, titles and descriptions of all the (nearly) 4million sites listed in the ODP. Due to a large number of technical issues, this dump hasn’t been correctly produced since September last year. The RDF dump is usually downloaded by organisations such as ‘Google’ to produces localised copies of the ODP (for instance the “PR enhanced” listings at the Google Directory). ODP staff have (this week in fact) managed to produce an RDF dump which is available via rdf.dmoz.org: there’s only a slight problem. It doesn’t contain “catid”s (unique category identifier numbers) – this is because these numbers got “clobbered” during the technical problems and so ODP staff are having to manually correct these database problems. Hopefully they’ll be fixed soon – but at least the ODP search has now been updated (since that uses the RDF dump) and there is an RDF dump for others to download and play with (which I’m intending to do this weekend).

Search: Do The Google Dance

January 28, 2003

6 Comments

Kuro5hin has an interesting article about what the “Google Dance” is and how it affects your ranking on the worlds most popular search engine.

Long story short: Dance equals Data. Servers. Moving. New Results.

A more complete answer is that the “Google Dance” is the nickname that has been given to the time of the month (usually around the 28th) that the data that “GoogleBot” (Google‘s little spider/robot that goes round ‘reading’ the web) is introduced into the system. However, since Google has over 10,000 servers it does take some time for the data to propagate around (“propagate” has now become my favourite and most used word for some reason). It has been long known that the start of the “Dance” can be found be watching when the data on the www2 and www3 starts ‘reading differently’ than that on the main Google server (an illustration in the Kuro5hin article is to do a query for links to Yahoo!).

Search: Meetup Arrangements Progressing

January 14, 2003

Leave a Comment

As hinted to in a previous entry the arrangements for the 2003 Open Directory Project UK editors real life meetup (phew – what a mouthful!) are progressing. We’ve already got a likely date (that I suggested 🙂 ) and voting has already started on the likely locations – Cambridge, Oxford, Leeds and Bournemouth are the most likely – Bradford, Gloucester, Exeter and Leicester are next. But it looks like no one wants to be sent to Coventry – it’s only got one vote so far..

I’ve invited the other 6 editors of Regional/Europe/United_Kingdom/England/Leicestershire/ (and, yes, I have memorised that nice long URL – hence why I keep on quoting it on places such as Resource-Zone) to the internal forum thread discussing it – bringing the total of editors contacted regarding it to over 100. The attendance last year was reasonable at 13 editors (location was the Briar Rose pub in Birmingham), but hopefully we’ll have a few more come this year.

If you are an UK ODP editor reading this (I know at least 3 people have found this blog via my private ODP editor profile – the private ones are accessible to ‘editors only’ and only contain a few little extra nuggets of information), then login to the ODP (forgot your password? Then get a password reminder!) and pop over to the internal forum “Penguin Cafe” and have a read of “A new year, a new UK editor get-together”).

Yes, there are only 6 listed editors for the whole of Leicestershire – over 1,180 sites – but there are less than half-a-dozen unreviewed sites in the whole of Leicestershire: the majority of those are “dead” sites (i.e. sites that are returning 404 errors, DNS is currently unable to resolve etc.) that have been moved to unreview until they come “alive” again. Of course, any editors of England/, UK/, Europe/ and Regional/ can also edit there – along with any editall or meta editor.

Category: Net: Search Engines

Search: Choosing a good Search Engine Optimization Company

Search: New Microsoft Search

Search: Changes, Talks and Flames

Search: Do The Google Dance

Search: Meetup Arrangements Progressing