A fellow blogger has suggested that a tag be introduced which would stop search engines such as Google from indexing certain sections of web pages. This would be extremely handy for all the blog comment spam which is currently going around (I’m personally using a combination of IP blocking [like Neil] and modification of /lib/MT/App/Comments.pm to block certain words in submitted URLs), but instead of
>!-- SearchEngine: Begin Anonymous Comment --> / <!-- SearchEngine: End Anonymous Comment -->
I would recommend something a bit more generalised such as:
<!-- robots:noindex --> / <!-- /robots:noindex -->
To try and fit in with the already existing robots.txt and robots meta tag (it also could be extended to things like <!– robots:nofollow –> for sections of content).
This tag would be used to mark sections of web page content as being “not to index/search”: so if a spammer does managed to add their URL to a website, but the URL appears in between the <!– robots:noindex –> tag then the search engines will ignore the listing making the spam useless in regards to search engine placement/promotion.
However, there’s a number of drawbacks that I can see for this introduction to the search engine world:
- First thing is backwards compatibility. It’s conceivable that several Content Management Systems (CMS) may use comment tags starting robots: for internal markup purposes. In theory, these should be parsed out before the content is sent to the end user, but in practise that’s another matter. That said, I expect the number of sites currently using something like <!– robots… to be extremely extremely low.
- Second thing is backwards compatibility. But this time it’s more relating to existing sites that should use the tag. I estimate there’s somewhere in the region of 270,000 Movable Type blog sites currently online (which compares well with SixLog’s own download figures of one quarter of a million times), but then you’ve got to take into account all the other sites which allow third party comments which you may not want search engines indexing sections of (for example, for major news sites it may be preferable to just allow search engines to index/cache the headline and the first paragraph as after a few days the article may become “pay to read” and hence the publisher may not want it archived). But getting nearly 1million webmasters to integrate the new tag in their site (and rebuild the entire site) could be problematic.
- Third item is the potential abuse factor. As a search engine optimiser, I know full well how the existing HTML tags can be abused to make certain parts of web pages “invisible” to web spiders/robots/search engines (and the flip side, how to make content only visible to those and not ‘normal browsers’). I can see how a <!– robots:noindex –> tag could be easily abused (think of Javascript redirects hidden in that section, or to ‘hide’ the bulk of the page so the keyword density on the rest of the page stays ‘just right’).
- Forth factor is the “take up rate”. It’ll be good if a major search engine such as Google were to use this tag, but ideally we need widespread saturation – ideally Altavista/AllTheWeb (both owned by Overture which is now owned by Yahoo Inc) also need to support it as well as “non search engines” such as The Web Archive
But it’s a good idea and I do hope that it’s implemented in one manner or another in the very near future…
2 Comments
It’s definitely a nice idea. The other way of implementing it would be via an XHTML namespace, thus creating a new tag. Thus, you would have: in the header, then in the content and around the markup you didn’t want indexing.
Anyway, I trackbacked you with some further thoughts on the matter so I’ll let you read those :).
Stopping crawlers from indexing your comments
Richy has started an interesting discussion about marking out blocks of text so that robots do not index them, say comments sections on weblogs, to reduce the effects of comment spam.
Comments are closed.