spider.pl - Example Perl program to spider web servers
spider.pl [<spider config file>] [<URL> ...]
# Spider using some common defaults and capture the output
# into a file
./spider.pl default http://myserver.com/ > output.txt
# or using a config file
spider.config:
@servers = (
{
base_url => 'http://myserver.com/',
email => 'me@myself.com',
# other spider settings described below
},
);
./spider.pl spider.config > output.txt
# or using the default config file SwishSpiderConfig.pl
./spider.pl > output.txt
# using with swish-e
./spider.pl spider.config | swish-e -c swish.config -S prog -i stdin
# or in two steps
./spider.pl spider.config > output.txt
swish-e -c swish.config -S prog -i stdin < output.txt
# or with compression
./spider.pl spider.config | gzip > output.gz
gzip -dc output.gz | swish-e -c swish.config -S prog -i stdin
# or having swish-e call the spider directly using the
# spider config file SwishSpiderConfig.pl:
swish-e -c swish.config -S prog -i spider.pl
# or the above but passing passing a parameter to the spider:
echo "SwishProgParameters spider.config" >> swish.config
echo "IndexDir spider.pl" >> swish.config
swish-e -c swish.config -S prog
Note: When running on some versions of Windows (e.g. Win ME and Win 98 SE)
you may need to tell Perl to run the spider directly:
perl spider.pl | swish-e -S prog -c swish.conf -i stdin
This pipes the output of the spider directly into swish.
spider.pl is a program for fetching documnts from a web server, and outputs the documents to STDOUT in a special format designed to be read by Swish-e.
The spider can index non-text documents such as PDF and MS Word by use of filter (helper) programs. These programs are not part of the Swish-e distribution and must be installed separately. See the section on filtering below.
A configuration file is noramlly used to control what documents are fetched from the web server(s). The configuration file and its options are described below. The is also a "default" config suitable for spidering.
The spider is designed to spider web pages and fetch documents from one host at a time -- offsite links are not followed. But, you can configure the spider to spider multiple sites in a single run.
spider.pl is distributed with Swish-e and is installed in the swish-e library directory at installation time. This directory (libexedir) can be seen by running the command:
swish-e -h
Typically on unix-type systems the spider is installed at:
/usr/local/lib/swish-e/spider.pl
This spider stores all links in memory while processing and does not do parallel requests.
The output from spider.pl can be captured to a temporary file which is then fed into swish-e:
./spider.pl > docs.txt
swish-e -c config -S prog -i stdin < docs.txt
or the output can be passed to swish-e via a pipe:
./spider.pl | swish-e -c config -S prog -i stdin
or the swish-e can run the spider directly:
swish-e -c config -S prog -i spider.pl
One advantage of having Swish-e run spider.pl is that Swish-e knows where to locate the program (based on libexecdir compiled into swish-e).
When running the spider without any parameters it looks for a configuration file called SwishSpiderConfig.pl in the current directory. The spider will abort with an error if this file is not found.
A configuration file can be specified as the first parameter to the spider:
./spider.pl spider.config > output.txt
If running the spider via Swish-e (i.e. Swish-e runs the spider) then use the Swish-e config option SwishProgParameters to specify the config file:
In swish.config:
# Use spider.pl as the external program:
IndexDir spider.pl
# And pass the name of the spider config file to the spider:
SwishProgParameters spider.config
And then run Swish-e like this:
swish-e -c swish.config -S prog
Finally, by using the special word "default" on the command line the spider will use a default configuration that is useful for indexing most sites. It's a good way to get started with the spider:
./spider.pl default http://my_server.com/index.html > output.txt
There's no "best" way to run the spider. I like to capture to a file and then feed that into Swish-e.
The spider does require Perl's LWP library and a few other reasonably common modules. Most well maintained systems should have these modules installed. See REQUIREMENTS below for more information. It's a good idea to check that you are running a current version of these modules.
Note: the "prog" document source in Swish-e bypasses many Swish-e configuration settings. For example, you cannot use the IndexOnly directive with the "prog" document source. This is by design to limit the overhead when using an external program for providing documents to swish; after all, with "prog", if you don't want to index a file, then don't give it to swish to index in the first place.
So, for spidering, if you do not wish to index images, for example, you will need to either filter by the URL or by the content-type returned from the web server. See "CALLBACK FUNCTIONS" below for more information.
By default, this script will not spider files blocked by robots.txt. In addition, The script will check for <meta name="robots"..> tags, which allows finer control over what files are indexed and/or spidered. See http://www.robotstxt.org/wc/exclusion.html for details.
This spider provides an extension to the <meta> tag exclusion, by adding a NOCONTENTS attribute. This attribute turns on the no_contents
setting, which asks swish-e to only index the document's title (or file name if not title is found).
For example:
<META NAME="ROBOTS" CONTENT="NOCONTENTS, NOFOLLOW">
says to just index the document's title, but don't index its contents, and don't follow any links within the document. Granted, it's unlikely that this feature will ever be used...
If you are indexing your own site, and know what you are doing, you can disable robot exclusion by the ignore_robots_file
configuration parameter, described below. This disables both robots.txt and the meta tag parsing. You may disable just the meta tag parsing by using ignore_robots_headers
.
This script only spiders one file at a time, so load on the web server is not that great. And with libwww-perl-5.53_91 HTTP/1.1 keep alive requests can reduce the load on the server even more (and potentially reduce spidering time considerably).
Still, discuss spidering with a site's administrator before beginning. Use the delay_sec
to adjust how fast the spider fetches documents. Consider running a second web server with a limited number of children if you really want to fine tune the resources used by spidering.
The spider program keeps track of URLs visited, so a document is only indexed one time.
The Digest::MD5 module can be used to create a "fingerprint" of every page indexed and this fingerprint is used in a hash to find duplicate pages. For example, MD5 will prevent indexing these as two different documents:
http://localhost/path/to/some/index.html
http://localhost/path/to/some/
But note that this may have side effects you don't want. If you want this file indexed under this URL:
http://localhost/important.html
But the spider happens to find the exact content in this file first:
http://localhost/developement/test/todo/maybeimportant.html
Then only that URL will be indexed.
Sometimes web page authors use too many /../
segments in relative URLs which reference documents above the document root. Some web servers such as Apache will return a 400 Bad Request when requesting a document above the root. Other web servers such as Micorsoft IIS/5.0 will try and "correct" these errors. This correction will lead to loops when spidering.
The spider can fix these above-root links by placing the following in your spider config:
remove_leading_dots => 1,
It is not on by default so that the spider can report the broken links (as 400 errors on sane webservers).
If The Perl module Compress::Zlib is installed the spider will send the
Accept-Encoding: gzip
header and uncompress the document if the server returns the header
Content-Encoding: gzip
MD5 checksomes are done on the compressed data.
MD5 may slow down indexing a tiny bit, so test with and without if speed is an issue (which it probably isn't since you are spidering in the first place). This feature will also use more memory.
Perl 5 (hopefully at least 5.00503) or later.
You must have the LWP Bundle on your computer. Load the LWP::Bundle via the CPAN.pm shell, or download libwww-perl-x.xx from CPAN (or via ActiveState's ppm utility). Also required is the the HTML-Parser-x.xx bundle of modules also from CPAN (and from ActiveState for Windows).
http://search.cpan.org/search?dist=libwww-perl
http://search.cpan.org/search?dist=HTML-Parser
You will also need Digest::MD5 if you wish to use the MD5 feature. HTML::Tagset is also required. Other modules may be required (for example, the pod2xml.pm module has its own requirementes -- see perldoc pod2xml for info).
The spider.pl script, like everyone else, expects perl to live in /usr/local/bin. If this is not the case then either add a symlink at /usr/local/bin/perl to point to where perl is installed or modify the shebang (#!) line at the top of the spider.pl program.
Note that the libwww-perl package does not support SSL (Secure Sockets Layer) (https) by default. See README.SSL included in the libwww-perl package for information on installing SSL support.
The spider configuration file is a read by the script as Perl code. This makes the configuration a bit more complex than simple text config files, but allows the spider to be configured programmatically.
For example, the config file can contain logic for testing URLs against regular expressions or even against a database lookup while running.
The configuration file sets an array called @servers
. This array can contain one or more hash structures of parameters. Each hash structure is a configuration for a single server.
Here's an example:
my %main_site = (
base_url => 'http://example.com',
same_hosts => 'www.example.com',
email => 'admin@example.com',
);
my %news_site = (
base_url => 'http://news.example.com',
email => 'admin@example.com',
);
@servers = ( \%main_site, \%news_site );
1;
The above defines two Perl hashes (%main_site and %news_site) and then places a *reference* (the backslash before the name of the hash) to each of those hashes in the @servers array. The "1;" at the end is required at the end of the file (Perl must see a true value at the end of the file).
The config file path
is the first parameter passed to the spider script.
./spider.pl F<config>
If you do not specify a config file then the spider will look for the file SwishSpiderConfig.pl in the current directory.
The Swish-e distribution includes a SwishSpiderConfig.pl file with a few example configurations. This example file is installed in the prog-bin/ documentation directory (on unix often this is /usr/local/share/swish-e/prog-bin).
When the special config file name "default" is used:
SwishProgParameters default http://www.mysite/index.html [<URL>] [...]
Then a default set of parameters are used with the spider. This is a good way to start using the spider before attempting to create a configuration file.
The default settings skip any urls that look like images (well, .gif .jpeg .png), and attempts to filter PDF and MS Word documents IF you have the required filter programs installed (which are not part of the Swish-e distribution). The spider will follow "a" and "frame" type of links only.
Note that if you do use a spider configuration file that the default configuration will NOT be used (unless you set the "use_default_config" option in your config file).
This describes the required and optional keys in the server configuration hash, in random order...
test_url
callback function. base_url => [qw! http://swish-e.org/ http://othersite.org/other/index.html !],
base_url => 'http://user:pass@swish-e.org/index.html',
max_wait_time
controls how long to wait for user entry before skipping the current URL. See also credentials
below.www.mysite.edu
but also can be reached by mysite.edu
(with or without www
) and also web.mysite.edu
then: \$serverA{base_url} = 'http://www.mysite.edu/index.html';
\$serverA{same_hosts} = ['mysite.edu', 'web.mysite.edu'];
http://web.mysite.edu/path/to/file.html
http://www.mysite.edu/path/to/file.html
host:port
against the list of host names in same_hosts
. So, if you specify a port name in you will want to specify the port name in the the list of hosts in same_hosts
: my %serverA = (
base_url => 'http://sunsite.berkeley.edu:4444/',
same_hosts => [ qw/www.sunsite.berkeley.edu:4444/ ],
email => 'my@email.address',
);
a
tags and from frame
tags: my %serverA = (
base_url => 'http://sunsite.berkeley.edu:4444/',
same_hosts => [ qw/www.sunsite.berkeley.edu:4444/ ],
email => 'my@email.address',
link_tags => [qw/ a frame /],
);
./spider.pl default <url>
/\.(?:gif|jpeg|png)$/i
@servers = (
{
email => 'swish@user.failed.to.set.email.invalid',
link_tags => [qw/ a frame /],
keep_alive => 1,
test_url => sub { \$_[0]->path !~ /\.(?:gif|jpeg|png)$/i },
test_response => \$response_sub,
use_head_requests => 1, # Due to the response sub
filter_content => \$filter_sub,
} );
./spider.pl my_own_spider.config
@servers = (
{
email => my@email.address,
use_default_config => 1,
delay_sec => 0,
},
);
1;
max_time
minutes, and move on to the next server, if any. The default is to not limit by time.max_indexed
). This count is displayted at the end of indexing as Unique URLs
.Total Docs
when spidering ends).use_head_requests
below.test_url
to look for files ending in .html
instead of using test_response
to look for a content type of text/html
if possible. Do note that aborting a request from test_response
will break the current keep alive connection.test_response
callback function. This option is also only used when keep_alive
is also enabled (although it could be debated that it's useful without keep alives).test_response
callback function (if one is defined in your config file). The test_response
callback function is a good place to test the content-type header returned from the server and reject types that you do not want to index.keep_alive
feature then rejecting a document will often (always?) break the keep alive connection.use_head_requests
option does is issue a HEAD request for every document, checks for a Content-Length header (to check if the document is larger than max_size
, and then calls your test_response
callback function. If your callback function returns true then a GET request is used to fetch the document.test_response
callback function (i.e. rejecting the document) will not break the keep alive connection.test_response
callback AND max_size
is set to zero then setting use_head_requests
will have no effect. errors, failed, headers, info, links, redirect, skipped, url
errors => general program errors (not used at this time)
url => print out every URL processes
headers => prints the response headers
failed => failed to return a 200
skipped => didn't index for some reason
info => a little more verbose
links => prints links as they are extracted
redirect => prints out redirected URLs
SPIDER_DEBUG=url,links spider.pl [....]
debug => DEBUG_URL | DEBUG_FAILED | DEBUG_SKIPPED,
debug
parameter is converted to a number. SPIDER_QUIET=1
max_depth
parameter can be used to limit how deeply to recurse a web site. The depth is just a count of levels of web pages decended, and not related to the number of path elements in a URL.base_url
. A max_depth of one will spider the base_url
page, plus all links on that page, and no more. The default is to spider all pages. http://localhost/path/to/index.html
http://localhost/path/to/
credentials => 'username:password',
get_password
callback function below. get_password
, if defined, will be called when a page requires authorization. credential_timeout => undef,
Callback functions can be defined in your parameter hash. These optional settings are callback subroutines that are called while processing URLs.
A little perl discussion is in order:
In perl, a scalar variable can contain a reference to a subroutine. The config example above shows that the configuration parameters are stored in a perl hash.
my %serverA = (
base_url => 'http://sunsite.berkeley.edu:4444/',
same_hosts => [ qw/www.sunsite.berkeley.edu:4444/ ],
email => 'my@email.address',
link_tags => [qw/ a frame /],
);
There's two ways to add a reference to a subroutine to this hash:
sub foo { return 1; }
my %serverA = (
base_url => 'http://sunsite.berkeley.edu:4444/',
same_hosts => [ qw/www.sunsite.berkeley.edu:4444/ ],
email => 'my@email.address',
link_tags => [qw/ a frame /],
test_url => \&foo, # a reference to a named subroutine
);
Or the subroutine can be coded right in place:
my %serverA = (
base_url => 'http://sunsite.berkeley.edu:4444/',
same_hosts => [ qw/www.sunsite.berkeley.edu:4444/ ],
email => 'my@email.address',
link_tags => [qw/ a frame /],
test_url => sub { reutrn 1; },
);
The above example is not very useful as it just creates a user callback function that always returns a true value (the number 1). But, it's just an example.
The function calls are wrapped in an eval, so calling die (or doing something that dies) will just cause that URL to be skipped. If you really want to stop processing you need to set \$server->{abort} in your subroutine (or send a kill -HUP to the spider).
The first two parameters passed are a URI object (to have access to the current URL), and a reference to the current server hash. The server
hash is just a global hash for holding data, and useful for setting flags as described below.
Other parameters may be also passed in depending the the callback function, as described below. In perl parameters are passed in an array called "@_". The first element (first parameter) of that array is \$_[0], and the second is \$_[1], and so on. Depending on how complicated your function is you may wish to shift your parameters off of the @_ list to make working with them easier. See the examples below.
To make use of these routines you need to understand when they are called, and what changes you can make in your routines. Each routine deals with a given step, and returning false from your routine will stop processing for the current URL.
test_url
allows you to skip processing of urls based on the url before the request to the server is made. This function is called for the base_url
links (links you define in the spider configuration file) and for every link extracted from a fetched web page. test_url => sub {
my \$uri = shift;
return 0 if \$uri->path =~ /\.(gif|jpeg|png)$/;
return 1;
},
test_url => sub { \$_[0]->path !~ /\.(gif|jpeg|png)$/ },
test_url => sub {
my \$uri = shift;
return 0 if \$uri->path =~ /\.(gif|jpeg|png)$/;
\$uri->path( lc \$uri->path ); # make all path names lowercase
return 1;
},
test_url
(compared to the other callback functions) is that it is called while extracting links, not while actually fetching that page from the web server. Returning false from test_url
simple says to not add the URL to the list of links to spider. test_url => sub {
my \$server = \$_[1];
\$server->{abort}++ if \$_[0]->path =~ /foo\.html/;
return 1;
},
no_contents
no_index
no_spider
use_head_requests
then this function is called after the spider makes a HEAD request. Otherwise, this function is called while the web pages is being fetched from the remote server, typically after just enought data has been returned to read the response from the web server. ( \$uri, \$server, \$response, \$content_chunk )
use_head_requests
the spider requests a document in "chunks" of 4096 bytes. 4096 is only a suggestion of how many bytes to return in each chunk. The test_response
routine is called when the first chunk is received only. This allows ignoring (aborting) reading of a very large file, for example, without having to read the entire file. Although not much use, a reference to this chunk is passed as the forth parameter.use_head_requests
and keep_alive
features. (Aborting a GET request kills the keep-alive session.) test_response => sub {
my \$content_type = \$_[2]->content_type;
return \$content_type =~ m!text/html!;
},
no_contents -- index only the title (or file name), and not the contents
no_index -- do not index this file, but continue to spider if HTML
no_spider -- index, but do not spider this file for links to follow
abort -- stop spidering any more files
test_response => sub {
my \$server = \$_[1];
\$server->{no_index}++ if \$_[0]->path =~ /private\.html$/;
return 1;
},
abort
server flag and returning false will abort spidering.no_contents
flag. filter_content => sub {
my \$content_ref = \$_[3];
$\$content_ref = lc $\$content_ref;
return 1;
},
filter_content => sub {
my \$uri = \$_[0];
\$uri->host('www.other.host') ;
return 1;
},
my (\$filter_sub, \$response_sub ) = swish_filter();
@server = ( {
test_response => \$response_sub,
filter_content => \$filter_sub,
[...],
} );
use_head_requests
): It tests the content type from the server to see if there's any filters that can handle the document. The \$filter_sub does all the work of filtering a document. (\$server, \$content, \$uri, \$response, \$bytecount, \$path);
output_function => sub {
my (\$server, \$content, \$uri, \$response, \$bytecount, \$path) = @_;
print STDERR "passed: uri \$uri, bytecount \$bytecount...\n";
# no output to STDOUT for swish-e
}
get_password => sub {
my ( \$uri, \$server, \$response, \$realm ) = @_;
if ( \$uri->path =~ m!^/path/to/protected! && \$realm eq 'private' ) {
return 'joe:secret931password';
}
return; # sorry, I don't know the password.
},
credentials
setting if you know the username and password and they will be the same for every request. That is, for a site-wide password.Note that you can create your own counters to display in the summary list when spidering is finished by adding a value to the hash pointed to by \$server-
{counts}>.
test_url => sub {
my \$server = \$_[1];
\$server->{no_index}++ if \$_[0]->path =~ /private\.html$/;
\$server->{counts}{'Private Files'}++;
return 1;
},
Each callback function must return true to continue processing the URL. Returning false will cause processing of the current URL to be skipped.
Swish (not this spider) has a configuration directive NoContents
that will instruct swish to index only the title (or file name), and not the contents. This is often used when indexing binary files such as image files, but can also be used with html files to index only the document titles.
As shown above, you can turn this feature on for specific documents by setting a flag in the server hash passed into the test_response
or filter_content
subroutines. For example, in your configuration file you might have the test_response
callback set as:
test_response => sub {
my ( \$uri, \$server, \$response ) = @_;
# tell swish not to index the contents if this is of type image
\$server->{no_contents} = \$response->content_type =~ m[^image/];
return 1; # ok to index and spider this document
}
The entire contents of the resource is still read from the web server, and passed on to swish, but swish will also be passed a No-Contents
header which tells swish to enable the NoContents feature for this document only.
Note: Swish will index the path name only when NoContents
is set, unless the document's type (as set by the swish configuration settings IndexContents
or DefaultContents
) is HTML and a title is found in the html document.
Note: In most cases you probably would not want to send a large binary file to swish, just to be ignored. Therefore, it would be smart to use a filter_content
callback routine to replace the contents with single character (you cannot use the empty string at this time).
A similar flag may be set to prevent indexing a document at all, but still allow spidering. In general, if you want completely skip spidering a file you return false from one of the callback routines (test_url
, test_response
, or filter_content
). Returning false from any of those three callbacks will stop processing of that file, and the file will not be spidered.
But there may be some cases where you still want to spider (extract links) yet, not index the file. An example might be where you wish to index only PDF files, but you still need to spider all HTML files to find the links to the PDF files.
\$server{test_response} = sub {
my ( \$uri, \$server, \$response ) = @_;
\$server->{no_index} = \$response->content_type ne 'application/pdf';
return 1; # ok to spider, but don't index
}
So, the difference between no_contents
and no_index
is that no_contents
will still index the file name, just not the contents. no_index
will still spider the file (if it's text/html
) but the file will not be processed by swish at all.
Note: If no_index
is set in a test_response
callback function then the document will not be filtered. That is, your filter_content
callback function will not be called.
The no_spider
flag can be set to avoid spiderering an HTML file. The file will still be indexed unless no_index
is also set. But if you do not want to index and spider, then simply return false from one of the three callback funtions.
Sending a SIGHUP to the running spider will cause it to stop spidering. This is a good way to abort spidering, but let swish index the documents retrieved so far.
List of some of the changes
Code reorganization and a few new featues. Updated docs a little tiny bit. Introduced a few spelling mistakes.
spider.pl default <some url>
Add a "get_document" callback that is called right before making the "GET" request. This would make it easier to use cached documents. You can do that now in a test_url callback or in a test_response when using HEAD request.
Save state of the spider on SIGHUP so spidering could be restored at a later date.
Copyright 2001 Bill Moseley
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
Send all questions to the The SWISH-E discussion list.
See http://sunsite.berkeley.edu/SWISH-E.