Templateengine and crawlers/spiders/bots
Problem
Webcrawlers have two negative impacts on our page:
- each page is indexed a dozen times with all possible open-parameters (impressum, feedback ...) and all possible languages (cz, hu, es, sk ...) that are falling back to english.
- most pages are also accessible via searches and so our searchengine is on the edge of collapsing, cause bots are running searches like “*:za” a few times per second.
Solution
- we have a sitemap that contains all pages we want to have indexed. This sitemap is part of the google-sitemap-program.
- bots are now allowed to run any search. A bot receives an error that links to the sitemap.
- bots always gets the pretext shown (if there is any), so they can index it
- bots will not see any link to the same page with different open-parameters
- bots will revert any language but de and en to en on its sublinks.
Note that on the startpage the open-pages and the language-pages are shown even to bots to get the multilanguage mainpretext indexed and the impressum too.
How can I be BOT
http://en.wikipedia.org/wiki/User_agent
If you want to see what a bot sees, you can set the user-agent in your browser. I only recommend this for firefox:
- enter about:config as url
- enter agent as filter
- if not already existing create a new entry called general.useragent.override by rightclicking in the empty area and select new -> string and enter general.useragent.override
- apply the filter again to see the new entry
- change the value to bot or I am bot or the botsman call or fabot or ...
- enjoy being bot
- change the value to empty after you have finished to have your firefox sending its normal correct user-agent-string again
Reliability
As long as a visitor is not changing its user-agent-string or using fancy-exotic browsers like gzip the used method is very sure.
implementation (tech stuff)
This is done in tt2_lib_m6 and a variable $var->{i_am_bot} is set accordingly. tt2_lib_m6 also sets open={1=>1} on entrance. url() in funcs.pm restricts the variety of allowed url based on this variable and search1() in funcs.pm restricts the search.



