Templateengine and crawlers/spiders/bots

Problem

Webcrawlers have two negative impacts on our page:

  • each page is indexed a dozen times with all possible open-parameters (impressum, feedback ...) and all possible languages (cz, hu, es, sk ...) that are falling back to english.
  • most pages are also accessible via searches and so our searchengine is on the edge of collapsing, cause bots are running searches like “*:za” a few times per second.

Solution

  • we have a sitemap that contains all pages we want to have indexed. This sitemap is part of the google-sitemap-program.
  • bots are now allowed to run any search. A bot receives an error that links to the sitemap.
  • bots always gets the pretext shown (if there is any), so they can index it
  • bots will not see any link to the same page with different open-parameters
  • bots will revert any language but de and en to en on its sublinks.

Note that on the startpage the open-pages and the language-pages are shown even to bots to get the multilanguage mainpretext indexed and the impressum too.

How can I be BOT

http://en.wikipedia.org/wiki/User_agent

If you want to see what a bot sees, you can set the user-agent in your browser. I only recommend this for firefox:

  1. enter about:config as url
  2. enter agent as filter
  3. if not already existing create a new entry called general.useragent.override by rightclicking in the empty area and select new -> string and enter general.useragent.override
  4. apply the filter again to see the new entry
  5. change the value to bot or I am bot or the botsman call or fabot or ...
  6. enjoy being bot
  7. change the value to empty after you have finished to have your firefox sending its normal correct user-agent-string again

Reliability

As long as a visitor is not changing its user-agent-string or using fancy-exotic browsers like gzip the used method is very sure.

implementation (tech stuff)

This is done in tt2_lib_m6 and a variable $var->{i_am_bot} is set accordingly. tt2_lib_m6 also sets open={1=>1} on entrance. url() in funcs.pm restricts the variety of allowed url based on this variable and search1() in funcs.pm restricts the search.

 
kb/templateengine/bots.txt · Last modified: 2006/05/17 15:01 by peter