A fellow blogger has suggested that a tag be introduced which would stop search engines such as Google from indexing certain sections of web pages. This would be extremely handy for all the blog comment spam which is currently going around (I’m personally using a combination of IP blocking [like Neil] and modification of /lib/MT/App/Comments.pm to block certain words in submitted URLs), but instead of
>!-- SearchEngine: Begin Anonymous Comment --> / <!-- SearchEngine: End Anonymous Comment -->
I would recommend something a bit more generalised such as:
<!-- robots:noindex --> / <!-- /robots:noindex -->
This tag would be used to mark sections of web page content as being “not to index/search”: so if a spammer does managed to add their URL to a website, but the URL appears in between the <!– robots:noindex –> tag then the search engines will ignore the listing making the spam useless in regards to search engine placement/promotion.
However, there’s a number of drawbacks that I can see for this introduction to the search engine world:
- First thing is backwards compatibility. It’s conceivable that several Content Management Systems (CMS) may use comment tags starting robots: for internal markup purposes. In theory, these should be parsed out before the content is sent to the end user, but in practise that’s another matter. That said, I expect the number of sites currently using something like <!– robots… to be extremely extremely low.
- Second thing is backwards compatibility. But this time it’s more relating to existing sites that should use the tag. I estimate there’s somewhere in the region of 270,000 Movable Type blog sites currently online (which compares well with SixLog’s own download figures of one quarter of a million times), but then you’ve got to take into account all the other sites which allow third party comments which you may not want search engines indexing sections of (for example, for major news sites it may be preferable to just allow search engines to index/cache the headline and the first paragraph as after a few days the article may become “pay to read” and hence the publisher may not want it archived). But getting nearly 1million webmasters to integrate the new tag in their site (and rebuild the entire site) could be problematic.
- Forth factor is the “take up rate”. It’ll be good if a major search engine such as Google were to use this tag, but ideally we need widespread saturation – ideally Altavista/AllTheWeb (both owned by Overture which is now owned by Yahoo Inc) also need to support it as well as “non search engines” such as The Web Archive
But it’s a good idea and I do hope that it’s implemented in one manner or another in the very near future…