Dynamic Robots.txt Rules for Apache SecurityBy Ken Coar
February 5, 2009
In the previous article, we devised a way to gather intelligence via a dynamic robots.txt but we're not (yet) taking any action. In order to do so, within the role of the robots.txt file, we need to have some rules pertaining to each client.
I feel another MySQL table coming on:
mysql> explain rule;
| Field | Type | Null | Key | Default | Extra |
| xid | int(11) | NO | PRI | NULL | auto_increment |
| pattern | blob | YES | | NULL | |
| alias | text | YES | | NULL | |
| field | varchar(16) | YES | | user-agent | |
| access | varchar(8) | YES | | disallow | |
| caseful | int(11) | YES | | 0 | |
| mode | varchar(12) | YES | | substring | |
| flags | text | YES | | NULL | |
| enabled | int(11) | YES | | 1 | |
| priority | int(11) | YES | | 1000000 | |
| path | blob | YES | | NULL | |
| comment | text | YES | | NULL | |
This one is rather more complex, so let's go through it field by field, and follow that with some examples.
The value of the 'pattern' field is key; it is to this value that aspects of the client request are compared to see if a particular rule matches or not. It might contain a string or a regular expression; how it is interpreted is controlled by the 'mode' field.
Ordinarily, the value of the user-agent line of the robots.txt stanza will come directly from the request's 'User-agent' header field. The 'alias' field in the table provides a means to override this. For instance, the rule may actually have matched Firefox, but you can say that it matched Opera.
The awkwardly-named 'field' field specifies which aspect of the request is to be matched against the pattern. I have found use only for the user-agent and the IP address, but there is no reason others might not be used.
The 'access' field indicates whether the output from this rule should go into an 'allow' line or a 'disallow' line.
The 'caseful' field, quite simply, says whether variations in case are permitted when matching against the pattern value.
'Mode' controls how the value is compared to the pattern; I use 'exact', 'substring', or 'regex'.
The 'flags' field contains a comma-separated list of keywords that impose conditions on how the rule should be processed, such as “process this rule only if no others have matched yet”, or “if this rule matches, don't check any others”.
'enabled' is simply a Boolean flag indicating whether this rule should be ignored or not.
The 'priority' field contains an integer value that is used to sort the enabled rules into a particular order before processing them.
The contents of the 'path' field are appended to the 'allow' or 'disallow' line if this rule gets matched.
The 'comment' field has no purpose other than to help you remember what this rule is supposed to be doing in the first place.
Page 2: Working Together