Mastering the Web
Click here
Contents
Website Planning Tutorial
Website Design Tutorial
HTML Tutorial
HTML Tables Tutorial
CGI Tutorial
JavaScript Tutorial
Perl Tutorials
CSS Tutorial
Installing a Web Server
Security Tutorial
HTML Cookies Tutorial
Web Tracking Tutorial
Download Free Programs
F.A.Q.

Detecting robots with FWTLogstat2

If you have ever wondered how the search engines like Google and Yahoo do to find sites containing your search terms, the answer is: robots. These are not metallic robots with a human-like appearance. They are computer programs, also called spiders because they traverse the Web. They do this by issuing requests to every Web server in the world of every HTML document (page) that they contain and that the robots have knowledge of its existence. Robots know of the existence of a page when they found a link to it. Any page needs to have a link pointing to it if robots are to know of that page.

Once a robot gets a page from a server, it reads the page and extracts all the links it has and the contents of the page. The links serve to make new requests. The contents are analyzed and relevant terms are added to indexes. The way a robot reads a page, a process known as parsing, is not the same as the one used by a normal browser like Internet Explorer or Firefox. Things that affect only the visual aspect of the page are not of interest to the robot, so style sheets, pictures, etc., are not requested. Before retrieving the first page of a hitherto unexplored domain, a well-behaved robot tries to find a special document named "robots.txt" that, if exists, specifies what actions are robots allowed to perform in that domain.

The server log file of your site records the requests made by humans and by robots, but these two kinds of requests are not obviously of the same interest to you. It is important to know that robots are visiting your site, because this means that your pages are included in the search engines' indexes, but after you are certain of this fact you do not care if the robot comes one, two, or one hundred times to your site. Furthermore, you would like that these visits be not summed up with human visits because it will give inexact figures in your statistics.

FWTLogstat2 gives you the possibility of removing robot visits from your log files. It works with a list of robot signatures that is provided with the program and that you are responsible for updating. A signature is a part of the log line where the user agent (in this case, the robot) declares who it is. If FWTLogstat2 finds a log line with a signature that is in the list, it deletes that line. The file with all the robot lines removed is saved under a new name.

New robots are appearing all the time. You must update the robot list by detecting new robots in your log files. This is done by searching which requests were made of the file "robots.txt". The signatures of these requests can be added to those that are already in the robot list. The list file is "robotlist.txt", and a backup is made of it with the name "robotlist.txt.backup".

To perform the operations just described, the Robots menu must be used. It has four options: Detect, Save, Delete, and Options. "Detect" shows you which robots are in the log file that are not in the robot list. The list of new signatures appears in the bottom field of the Robots tab and is editable. It is strongly advised that you peruse this list and delete those entries that are too generic, like "-" or "Mozilla/4.0", because saving them will make a lot of valid lines to be deleted from the log file.

"Save" lets you merge the new signatures with the old ones. "Delete" deletes from the log file the lines with a signature found in the robot list. "Options" takes you to the Robots tab where, in addition to the New Robots List, you will find three check boxes that enable three different cleaning processes that are performed when you select the "Delete" option from the "Robots" menu. The first one enables the process just described. The other ones enable the deletion of two types of lines that are not produced by human visitors. One refers to what are called "feed aggregators" or "feed readers", that is, programs that search your site looking for RSS2 feeds. The other one is a kind of spam that works by including in your log file references to documents not belonging to your site.

Previous | Contents | Next

| HOME | FEEDBACK | BOOKMARK |
Build your Website
© 1999-2008 Hector Castro -- All rights reserved

www.great-web-info.com