SWISH-CONFIG -- Configuration File Directives
What files Swish-e indexes and how they are indexed, and where the index is written can be controlled by a configuration file.
The configuration file is a text file composed of comments, blank lines, and configuration directives. The order of the directives is not important. Some directives may be used more than once in the configuration file, while others can only be used once (e.g. additional directives will overwrite preceding directives). Case of the directive is not important -- you may use upper, lower, or mixed case.
Comments are any line that begin with a "#".
# This is a comment
As of 2.4.3 lines may be continued by placing a backslas as the last character on the line:
IgnoreWords \
am \
the \
foo
Directives may take more than one parameter. Enclose single parameters that include whitespace in quotes (single or double). Inside of quotes the backslash escapes the next character.
ReplaceRules append "foo bar" <- define "foo bar" as a single parameter
If you need to include a quote character in the value either use a backslash to escape it, or enclose it in quotes of the other type.
For example, under unix you can use quotes to include white space in a single parameter. Here, to protect against path names (%p) that might have white space embedded use single quotes (this also protects against shell expansion or metacharacters):
FileFilter .foo foofilter "'%p'" <- parameter passed through the shell in single quotes
FileFilter .foo foofilter '"%p"' <- windows uses double-quotes
FileFilter .foo foofilter '\'%p\''<- silly example
Backslashes also have special meaning in regular expressions.
FileFilterMatch pdftotext "'%p' -" /\.pdf$/
This says that the dot is a real dot (instead of matching any character). If you place the regular expression in quotes then you must use double-backslashes.
FileFilterMatch pdftotext "'%p' -" "/\\.pdf$/"
Swish-e will convert the double backslash into a single backslash before passing the parameter to the regular expression compiler.
Commented example configuration files are included in the conf directory of the Swish-e distribution.
Some command line arguments can override directives specified in the configuration file. Please see also the SWISH-RUN for instructions on running Swish-e, and the SWISH-SEARCH page for information and examples on how to search your index.
The configuration file is specified to Swish-e by the -c
switch. For example,
swish-e -c myconfig.conf
You may also split your directives up into different configuration files. This allows you to have a master configuration file used for many different indexes, and smaller configuration files for each separate index. You can specify the different configuration files when running from the command line with the -c
switch (see SWISH-RUN), or you may include other Configuration file with the IncludeConfigFile directive below.
Typically, in a configuration file the directives are grouped together in some logical order -- that is, directives that control the source of the documents would be grouped together first, and directives that control how each document is filtered or its words index in another group of directives. (The directives listed below are grouped in this order).
The configuration file directives are listed below in these groups:
AbsoluteLinks
[yes|NO]BeginCharacters
*string of characters*BumpPositionCounterCharacters
*string*Buzzwords
[*list of buzzwords*|File: path]ConvertHTMLEntities
[YES|no]DefaultContents
[TXT|HTML|XML|TXT2|HTML2|XML2|TXT*|HTML*|XML*]Delay
*seconds*DontBumpPositionOnEndTags
*list of names*DontBumpPositionOnStartTags
*list of names*EnableAltSearchSyntax
[yes|NO]EndCharacters
*string of characters*EquivalentServer
*server alias*ExtractPath
*metaname* [replace|remove|prepend|append|regex]FileFilter
*suffix* *program* [options]FileFilterMatch
*program* *options* *regex* [*regex* ...]FileInfoCompression
[yes|NO]FileMatch
[contains|is|regex] *regular expression*FileRules
[contains|is|regex] *regular expression*FuzzyIndexingMode
[NONE|Stemming|Soundex|Metaphone|DoubleMetaphone]FollowSymLinks
[yes|NO]HTMLLinksMetaName
*metaname*IgnoreFirstChar
*string of characters*IgnoreLastChar
*string of characters*IgnoreLimit
*integer integer*IgnoreMetaTags
*list of names*IgnoreNumberChars
*list of characters*IgnoreTotalWordCountWhenRanking
[YES|no]IgnoreWords
[*list of stop words*|File: path]ImageLinksMetaName
*metaname*IncludeConfigFile
IndexAdmin
*text*IndexAltTagMetaName
*tagname*|as-textIndexComments
[yes|NO]IndexContents
[TXT|HTML|XML|TXT2|HTML2|XML2|TXT*|HTML*|XML*] *file extensions*IndexDescription
*text*IndexDir
[URL|directories or files]IndexFile
*path*IndexName
*text*IndexOnly
*list of file suffixes*IndexPointer
*text*IndexReport
[0|1|2|3]MaxDepth
*integer*MaxWordLimit
*integer*MetaNameAlias
*meta name* *list of aliases*MetaNames
*list of names*MinWordLimit
*integer*NoContents
*list of file suffixes*obeyRobotsNoIndex
[yes|NO]ParserWarnLevel
[0|1|2|3]PreSortedIndex
*list of property names*PropCompressionLevel
[0-9]PropertyNameAlias
*property name* *list of aliases*PropertyNames
*list of meta names*PropertyNamesCompareCase
*list of meta names*PropertyNamesIgnoreCase
*list of meta names*PropertyNamesNoStripChars
*list of meta names*PropertyNamesDate
*list of meta names*PropertyNamesNumeric
*list of meta names*PropertyNamesMaxLength
integer *list of meta names*PropertyNamesSortKeyLength
integer *list of meta names*ReplaceRules
[replace|remove|prepend|append|regex]ResultExtFormatName
name -x format stringSpiderDirectory
*path*StoreDescription
[XML <tag>|HTML <meta>|TXT size]"SwishProgParameters
*list of parameters*SwishSearchDefaultRule
[<AND-WORD>|<or-word>]SwishSearchOperators
<and-word> <or-word> <not-word>TmpDir
*path*TranslateCharacters
[*string1 string2*|:ascii7:]TruncateDocSize
*number of characters*UndefinedMetaTags
[error|ignore|INDEX|auto]UndefinedXMLAttributes
[DISABLE|error|ignore|index|auto]UseStemming
[yes|NO]UseSoundex
[yes|NO]UseWords
[*list of words*|File: path]WordCharacters
*string of characters*XMLClassAttributes
*list of XML attribute names*These configuration directives control the general behavior of Swish-e.
IncludeConfigFile /usr/local/swish/conf/site_config.config
-v
switch (see SWISH-RUN). 0 = no report
1 = fatal errors
2 = errors
3 = warnings
IndexFile /usr/local/swish/site.index
<meta name="robots" content="noindex">
<!-- SwishCommand noindex -->
<!-- SwishCommand index -->
<!-- noindex -->
<!-- index -->
NOTE: This following items are currently not available. These items require Swish-e to parse the configuration file while searching.
swish-e -w "+word1 +word2 -word3 word4 word5"
"+" = following word has to be in all found documents
"-" = following word may not be in any document found
" " = following word will be searched in documents
SwishSearchOperators UND ODER NICHT
SwishSearchDefaultRule
defines the default Boolean operator to use if none is specified between words or phrases. The default is AND
.SwishSearchOperators
. SwishSearchOperators UND ODER NICHT
# Make it act like a web search engine
SwishSearchDefaultRule ODER
-x
command line argument. Using ResultExtFormatName
you can assign a predefined format string to a name. ResultExtFormatName moreinfo "%c|%r|%t|%p|<author>|<publishyear>\n"
swish-e ... -x moreinfo ...
-x
switch in SWISH-RUN for more information about output formats.Swish-e stores configuration information in the header of the index file. This information can be retrieved while searching or by functions in the Swish-e C library. There are a number of fields available for your own use. None of these fields are required:
IndexName "Linux Documentation"
IndexDescription "This is an index of /usr/doc on our Linux machine."
IndexPointer http://localhost/swish/linux/index.html
IndexAdmin webmaster
These directives control what documents are indexed and how they are accessed. See also Directives for the File Access method only and Directives for the HTTP Access Method Only for directives that are specific to those access methods.
-S
command line argument is used to select the file access method. swish-e -c swish.config -S fs - file system
swish-e -c swish.config -S http - internal http spider
swish-e -c swish.config -S prog - external program of any type
IndexDir
or -i
. # Index this directory an any subdirectories
IndexDir /usr/local/home/http
# Index the docs directory in current directory
IndexDir ./docs
# Index these files in the current directory
IndexDir ./index.html ./page1.html ./page2.html
# and index this directory, too
IndexDir ../public_html
IndexDir http://www.my-site.com/index.html
IndexDir http://localhost/index.html
IndexDir ./myprogram.pl
IndexContents
or DefaultContents
) then the file will be parsed for a HTML title and that title will be indexed. Note that you must set the file's type with IndexContents
or DefaultContents
: .html
and .htm
are NOT type HTML by default. For example: IndexContents HTML* .htm .html
FileRules title
, and the file will be skipped if a match is found. See FileRules
. NoContents .gif .xbm .au .mov .mpg .pdf .ps
IndexOnly
to limit the types of files that are indexed, then you must specify in IndexOnly
the same suffixes listed in NoContents
. # Wrong!
IndexOnly .htm .html
NoContents .gif .xbm .au .mov .mpg .pdf .ps
-S prog
program may set the No-Contents:
header to enable this feature for a specific document (although it would be smarter for the -S prog
program to simply only send the pathname or title to be indexed. replace "the string you want replaced" "what to change it to"
remove "a string to remove"
prepend "a string to add before the result"
append "a string to add after the result"
regex "/search string/replace string/options"
\$0 the entire matched (sub)string
\$1-\$9 returns patterns captured in "(" ")" pairs
$` the string before the matched pattern
$' the string after the matched pattern
i ignore the case when matching
g repeat the substitution for the entire pattern
ReplaceRules replace testdir/ anotherdir/
ReplaceRules replace [a-z_0-9]*_m.*\.html index.html
ReplaceRules remove testdir/
ReplaceRules prepend http://localhost/
ReplaceRules append .html
ReplaceRules regex !^/web/(.+)/!http://\$1.domain.com/!
replaces a file path:
/web/search/foo/index.html
with
http://search.domain.com/foo/index.html
ReplaceRules regex #^#http://localhost/www#
ReplaceRules prepend http://localhost/www (same thing)
# Remove all extensions from C source files
ReplaceRules remove .c # ERROR! That "." is *any char*
ReplaceRules remove \.c # much better...
ReplaceRules remove "\\.c" # if in quotes you need double-backslash!
ReplaceRules remove "\.c" # ERROR! "\." -> "." and is *any char*
IndexContents
directive assigns one of Swish-e's document parsers to a document, based on the its extension. Swish-e currently knows how to parse TXT, HTML, and XML documents.IndexContents
will, by default, use the HTML2 parser if libxml2 is installed, otherwise will use Swish-e's internal HTML parser. The DefaultContents
directive may be used to assign a parser to documents that do not match a file extension defined with the IndexContents
directive. IndexContents HTML* .htm .html .shtml
IndexContents TXT* .txt .log .text
IndexContents XML* .xml
-t
in SWISH-RUN).FileFilter
directive) to convert documents you should include those extensions, too. For example, if using a filter to convert .pdf to .html, you need to tell Swish-e that .pdf should be indexed by the internal HTML parser: FileFilter .pdf pdf2html
IndexContent HTML .pdf
DefaultContents HTML
DefaultContents
directive should be used when spidering, as HTML files may be returned without a file extension (such as when requesting a directory and the default index.html is returned).yes
will compress the index file to save disk space. This may result in longer indexing times. The default is no
.-e
switch in SWISH-RUN for saving RAM during indexing.These directives control what information is extracted from your source documents, and how that information is made available during searching.
no
if your documents do not contain HTML entities. The default is yes
.ConvertHTMLEntities
is set no
the entities will be indexed without conversion.subjects
and then you can search your documents for the word "foo" but only return documents where "foo" is within the subjects
META tag. swish-e -w subjects=foo
-t
switch in SWISH-RUN for information about context searching in HTML documents.) MetaNames meta1 meta2 keywords subjects
UndefinedMetaTags
to specify automatic extraction of meta names from your HTML and XML documents, and also to ignore indexing content of meta tags. <META NAME="meta1" CONTENT="some content">
<meta1>
some content
</meta1>
<meta1>
Some Content
</meta1>
swish-e -w 'meta1=(apples or oranges)'
<keywords>
<tag1>
some content
</tag1>
<tag2>
some other content
</tag2>
<keywords>
swish-e -w 'keywords=(query words)'
swishdefault
, so these two queries are the same: swish-e -w foo
swish-e -w swishdefault=foo
swish-e -w foo
UndefinedMetaTags
for how to control the indexing of meta tags. MetaNames swishtitle
swish-e -w foo -- search for "foo" in title, body (and undefined meta tags)
swish-e -w swishtitle=foo -- search for "foo" in title only
MetaNames swishdocpath
swish-e -w foo swishdocpath=(manual or tutorial)
ExtractPath
. MetaNames summary
MetaNameAlias summary description overview
-w summary=foo
-w description=foo
-w overview=foo
MetaNamesRank 4 subject
MetaNamesRank 3 swishdefault
MetaNamesRank 2 author publisher
MetaNamesRank -5 wrongwords
HTMLLinksMetaName links
-w links='"home.html"'
HTMLLinksMetaName swishdefault
ImagesLinksMetaName images
-w images='beach'
ImageLinksMetaName swishdefault
IndexAltTagMetaName bar
<foo>
<img src="/someimage.png" alt="Alt text here">
</foo>
<foo>
<bar>Alt text here</bar>
</foo>
MetaNames
and PropertyNames
) apply to how that text is indexed. <foo>
<img src="/someimage.png" alt="Alt text here">
</foo>
<foo>
Alt text here
</foo>
HTMLLinksMetaName
and ImageLinksMetaName
into absolute URIs. Swish-e will use any <BASE> tag found in the document, otherwise it will use the file's pathname. The pathname used will be the pathname *after* ReplaceRules
has been applied to the document's pathname. ImageLinksMetaName images
AbsoluteLinks
is set to no, then a image within that document: <img src="beach.jpeg">
AbsoluteLinks
and Swish-e will index "http://localhost/vacations/france/beach.jpeg". You can then look for images of beaches, but only in France: -w images=(beach and france)
-w images=(france)
MetaNames
directive.UndefinedMetaTags
, but only applies to XML documents (parsed with libxml2). This allows indexing of attribute content, and provides a way to index the content under a metaname. For example, UndefinedXMLAttributes
can make <person age="23">
John Doe
</person>
<person>
<person.age>
23
</person.age>
John Doe
</person>
UndefinedXMLAttributes
:MetaNames
directive.XMLClassAttribues
. XMLClassAttributes class
<person class="first">
John
</person>
<person class="last">
Doe
</person>
<person>
<person.first>
John
</person.first>
</person>
<person>
<person.last>
Doe
</person.last>
</person>
MetaNames
and UndefinedMetaTags
.XMLClassAttributes
and UndefinedXMLAttributes
. XMLClassAttributes class
UndefinedMetaTags auto
UndefinedXMLAttributes auto
IndexContents XML2 .xml
<xml> <person class="student" phone="555-1212" age="102"> John </person>
<person greeting="howdy">Bill</person> </xml>
./swish-e -c 2 -i 1.xml -T parsed_tags parsed_text -v 0
Indexing Data Source: "File-System"
<xml> (MetaName)
<person> (MetaName)
<person.student> (MetaName)
<person.student.phone> (MetaName)
555-1212
</person.student.phone>
<person.student.age> (MetaName)
102
</person.student.age>
John
</person>
<person> (MetaName)
<person.greeting> (MetaName)
howdy
</person.greeting>
Bill
</person>
</xml>
Indexing done!
ReplaceRules
for a description of the various pattern replacement methods, but you will use the regex method. /web/sales/foo...
/web/parts/foo...
/web/accounting/foo...
ExtractPath department regex !^/web/([^/]+)/.*$!\$1!
department
. Then to limit a search to the sales department: swish-e -w foo AND department=sales
regex
method uses a substitution pattern, so to index only a sub-string match the entire document path in the regular expression, as shown above. Otherwise any part that is not matched will end up in the substitution pattern.ExtractPathDefault
option for a way to set a value if not patterns match.ExtractPath
directive. More than one directive of the same meta name will operate successively (in order listed in the configuration file) on the path. This allows you to use regular expressions on the results of the previous pattern substitution (as if piping the output from one expression to the patter of the next). ExtractPath foo regex !^(...).+$!\$1!
ExtractPath foo regex !^.+(.)$!\$1!
ExtractPath foo regex !^X(...).+$!\$1!
ExtractPath foo regex !^.+(.)$!\$1!
ExtractPath
directive with the same metaname.ReplaceRules
directive has no effect on the path used with ExtractPath
.ExtractPath
directive is used. That is, changes to the path used in ExtractPath foo
do not affect the path used by ExtractPath bar
.ExtractPath
to set a default string to index under the given metaname if none of the ExtractPath
patterns match. /web/sales/foo...
/web/parts/foo...
/web/accounting/foo...
ExtractPath department regex !^/web/([^/]+)/.*$!\$1!
ExtractPathDefault department other
-w foo department=(sales) - limit searches to the sales documents
-w foo department=(parts) - limit searches to the parts documents
-w foo department=(accounting) - limit searches to the accounting documents
-w foo department=(other) - everything but sales, parts, and accounting.
-w foo not department=(sales or parts or accounting)
-p
and -x
switches in SWISH-RUN).-s
switch). PropertyNames author subjects
PropertyNamesCompareCase
and PropertyNamesIgnoreCase
. These tell Swish-e to either ignore or compare case when sorting results. The default for PropertyNames
is to ignore the case. PropertyNamesIgnoreCase subject
PropertyNamesCompareCase keyword
swishtitle -- ignore the case
swishdocpath -- compare case
swishdescription -- compare case
PropertyNamesCompareCase
and PropertyNamesIgnoreCase
. PropertyNamesCompareCase swishtitle
PropertyNames
, but it flags the property as being a string of digits (integer value) that will be stored as binary data instead of a string. This allows sorting with -s
and limiting with -L
to sort and limit the property correctly.strtoul(3)
to convert the string into an unsigned long integer. Therefore, only positive integers can be stored.PropertyNamesNumeric
, but it also flags the number as a machine timestamp (seconds since Epoch), and will print a formatted date when returning this property. See -x
in SWISH-RUN. PropertyNameAlias swishtitle title titel título titulo
PropertyNamesMaxLength 1000 swishdescription
PropertyNameAlias swishdescription body
StoreDescription HTML <body> 1000
StoreDescription XML <body> 1000
StoreDescription HTML2 <body> 1000
StoreDescription XML2 <body> 1000
PropertyNamesMaxLength 1000 headings
PropertyNameAlias headings h1 h2 h3 h4
PreSortedIndex
it is not present in the config file (default action), all the properties will be presorted at indexing time. If it is present without any parameter, no properties will be presorted. Otherwise, only the property names specified will be presorted.title
: PropertyNames title age time
PreSortedIndex title
-x
switch is used to include the swishdescription for extended results, or by using -p swishdescription
.IndexContents
or DefaultContents
. See those directives for possible values. A common problem is using StoreDescription
yet not setting the document's type with IndexContents
or DefaultContents
. Another problem is different types: IndexContents HTML2 .html
StoreDescription HTML <body>
StoreDescription TXT 20
StoreDescription HTML <body> 20000
StoreDescription XML <desc> 40
IndexContents
or DefaultContents
to use this feature.StoreDescription
to store a large amount text in properties (or if using PropertyNames
with large property sizes). TruncateDocSize 10000000
prog
input source method. swish-e -T LIST_FUZZY_MODES
swish-e -T LIST_FUZZY_MODES
FuzzyIndexingMode Stemming_es
The Porter stemming algorithm (or Porter stemmer) is a
process for removing the commoner morphological and inflexional
endings from words in English. Its main use is as part of a
term normalisation process that is usually done when setting up
Information Retrieval systems.
Lawrence Philips' Metaphone Algorithm is an algorithm which returns
the rough approximation of how an English word sounds.
DoubleMetaphone
mode will sometimes generate two different metaphones for the same word. This is supposed to be useful when a word may be pronounced more than one way. UseStemming no
UseStemming yes
yes
every word is stemmed before placing it in to the index.FuzzyIndexingMode
.yes
every word is converted to a Soundex code before placing it in to the index.FuzzyIndexingMode
. IgnoreTotalWordCountWhenRanking no
MinWordLimit 5
WordCharacters abde
-v 4
, -D
and -k
searching switches). WordCharacters .abcdefghijklmnopqrstuvwxyz
BeginCharacters abcdefghijklmnopqrstuvwxyz
EndCharacters abcdefghijklmnopqrstuvwxyz
IgnoreFirstChar .
IgnoreLastChar .
Please visit http://www.example.com/path/to/file.html.
please
visit
http
www.example.com
path
to
file.html
www.example.com
as a single word, but searching for just example
will not find the document.0123456789&#;
to index entities. See also ConvertHTMLEntities
File:filename
is used then the Buzzwords will be read from an external file during indexing. Buzzwords C++ TCP/IP
Buzzwords File: ./buzzwords.lst
Buzzwords C++ TCP/IP web=http
./swish-e -w 'web\=http'
IgnoreFirstChar
and IgnoreLastChar
characters from the word, and then comparing with the list of Buzzwords
. Therefore, if adding Buzzwords
to an index you will probably want to define IgnoreFirstChar
and IgnoreLastChar
settings.IgnoreFirstChar
and IgnoreLastChar
may be used in the future.File:filename
is used then the stop words will be read from an external file during indexing. IgnoreWords swishdefault - obsolete!
IgnoreWords www http a an the of and or
IgnoreWords File: ./stopwords.de
UseWords
directive in a config file), and/or use the File:
form to specify a path to a file containing the words: UseWords perl python pascal fortran basic cobal php
UseWords File: /path/to/my/wordlist
IgnoreLimit 50 1000
IgnoreMetaTags
defines a list of metatags to ignore while indexing XML files (and HTML files if using libxml2 for parsing HTML). All text within the tags will be ignored -- both for indexing (MetaNames
) and properties (PropertyNames
). To still parse properties, yet do not index the text, see UndefinedMetaTags
. <person>
<first_name>
William
</first_name> <last_name>
Shakespeare
</last_name> <updated_date>
April 25, 1999
</updated_date>
</person>
-w 'person=(April)'
IgnoreMetaTags updated_date
UndefinedMetaTags
.WordCharacters
settings. In other words, the "word" checked is a word that Swish-e would otherwise index. IgnoreNumberChars 0123456789\$.,
123
123,456.78
\$123.45
IgnoreNumberChars 0123456789abcdef
IndexComments yes
# This will index a_b as a-b and ámo as amo
TranslateCharacters _á -a
TranslateCharacters :ascii7:
is a predefined set of characters that will translate eight bit characters to ascii7 characters. Using the :ascii7: rule will translate "Ääç" to "aac". This means: searching "Çelik", "çelik" or "celik" will all match the same word. <subjects>
computer programming | apple computers
</subjects>
BumpPositionCounterCharacters |
DontBumpPositionOnEndTags
and DontBumpPositionOnStartTags
disables this feature for the listed metanames. <person>
<first_name>
William
</first_name>
<last_name>
Shakespeare
</last_name>
<updated_date>
April 25, 1999
</updated_date>
</person>
DontBumpPositionOnEndTags first_name
DontBumpPositionOnStartTags last_name
-w 'person=("william shakespeare")'
-w 'person=("shakespeare april")'
Some directives have different uses depending on the source of the documents. These directives are only valid when using the File system method of indexing.
# Only index .html .htm and .q files
IndexOnly .html .htm .q
IndexOnly
checks that the file end in the characters listed. It does not check "extensions". IndexOnly
is tested right before FileRules
is processed. FollowSymLinks no
FollowSymLinks yes
no
extra stat(2) system calls must be made for each file. For large number of files you may see a small reduction in indexing time by setting this to yes
.-l
switch in SWISH-RUN.FollowSymLinks
) you will typically only use FileRules
to exclude files or directories. FileMatch
is useful in a few cases, for example, to override the behavior of IndexOnly
. Some examples are included below.FileRules title ...
, this feature is only available for file access method (-S fs), which is the default indexing mode. Also, any pathname modification with ReplaceRules
happens after the check for FileRules
. (It's unlikely that you would exclude files with FileRules
based on text you added with ReplaceRules
!)contains
or is
. is
simply forces the regular expression to match at the start and end of the string (by internally prepending "^" and appending "$" to the regular expression).regex
option requires delimiter characters: FileRules title regex /^private/i
regex
is if you want to do case insensitive matches, or simply like your regular expressions to look like perl regular expressions. You must use matching delimiters; (), {}, and [], are not currently supported for no good reason other than laziness. FileRules title is hello
FileRules title contains ^hello$
FileRules title regex /^hello$/
FileRules title is "hello there"
FileRules title contains "^hello there$"
FileRules title regex "!^hello there$!"
FileRules filename regex /\.pdf/
FileRules filename regex "/\\.pdf/"
FileRules filename regex !hello\\there! # need double for real backslash
FileRules filename regex "!hello\\\\there!" # need double-double inside of quotes
FileRules pathname
FileRules dirname
FileRules filename
FileRules directory
FileRules title
FileMatch pathname
FileMatch filename
FileMatch dirname
FileMatch directory
IndexDir
. # Don't index paths that contain private or hidden
FileRules pathname contains (private|hidden)
# Same thing
FileRules pathname regex /(private|hidden)/
# Don't index exe files
FileRules pathname contains \.exe$
# Same as last example - don't index *.exe files.
FileRules filename contains \.exe$
# Don't index any file called test.html files
FileRules filename contains ^test\.html$
# Same thing
FileRules filename is test\.html
# Don't index any directories that contain "old" (/usr/local/myold/docs)
FileRules dirname contains old
# Don't index any directories that contain the path segment "old" (/usr/local/old/foo)
FileRules dirname contains /old/
# Index only .htm, .html, plus any all-digit file names
IndexOnly .htm .html
FileMatch filename contains ^\d+$
# Same as previous, but maybe a little slower
FileRules filename regex not !\.(htm|html)$!
FileMatch filename contains ^\d+$
pathname
, dirname
, and filename
, and FileMatch
patterns are checked before FileRules
, in general. This allows you to exclude most files with FileRules
, yet allow in a few special cases with FileMatch
. For example: # Exclude all files of .exe, .bin, and .bat
FileRules filename contains \.(exe|bin|bat)$
# But, let these two in
FileMatch filename is baseball\.bat incoming_mail\.bin
# Same, but as a single pattern
FileMatch filename is (baseball\.bat|incoming_mail\.bin)
directory
type is somewhat unique. When Swish-e recurses into a directory it will compare all the files in the directory with the pattern and then decide if that entire directory should or should not be indexed (or recursed). Note that you are matching against file names in a directory -- and some of those names may be directory names.FileRules directory
match will cause Swish-e to ignore all files and sub-directories in the current directory.FileMatch directory
says to index everything in the *current* directory and ignore any FileRules for this directory. # Don't index any directories (and sub directories) that contain
# a file (or sub-directory) called "index.skip"
FileRules directory contains ^index\.skip$
# Don't index directories that contain a .htaccess file.
FileRules directory contains ^\.htaccess
IndexDir
or -i
.title
checks for a pattern match in an HTML title. FileRules title contains construction example pointers
# This example says to ignore case
FileRules title regex "/^Internal document/i"
FileRules title
works for any input method (fs, prog, or http) that is parsed as HTML, and where a title was found in the document. FileRules dirname - reject entire directory if matches
FileRules directory - reject entire dir if *any* files match
FileMatch directory - accept entire dir if *any* files match
FileMatch directory
matched, each file is tested with FileMatch. A match says to index the file without further testing (i.e. overrides FileRules and IndexOnly): FileMatch pathname \
FileMatch dirname - file is accepted if any match
FileMatch filename /
IndexOnly - file is checked for the correct file extension
FileRules pathname \
FileRules dirname - file is rejected if any match
FileRules filename /
IndexDir
or -i
are processed in a similar way: FileMatch pathname \
FileMatch dirname - file is accepted if any match
FileMatch filename /
IndexOnly - file is checked for the correct file extension
FileRules pathname \
FileRules dirname - file is rejected if any match
FileRules filename /
-T regex
trace option to see how file names are checked. Start with very simple tests!The HTTP Access method is enabled by the "-S http" switch when indexing. It works by running a Perl program called SwishSpider which fetches documents from a web server.
Only text files (content-type of "text/*") are indexed with the HTTP Access Method. Other document types (e.g. PDF or MSWord) may be indexed as well. The SwishSpider will attempt to make use of the SWISH::Filter module (included with the Swish-e distribution) to convert documents into a format that Swish-e can index.
Note: The -S prog method of spidering (using spider.pl) can be a replacement for the -S http method. It offers more configuration options and better spidering speed.
These directives below are available when using the HTTP Access Method of indexing.
MaxDepth 5
Delay 1
-e
switch causes Swish-e to use this directory while indexing. There is no default. TmpDir /tmp/swish
TMPDIR
, TMP
, and TEMP
(in that order) will override this setting. SpiderDirectory /usr/local/swish
EquivalentServer http://library.berkeley.edu http://www.lib.berkeley.edu
EquivalentServer http://sunsite.berkeley.edu:2000 http://sunsite.berkeley.edu
This section details the directives that are only available for the "prog" document source feature of Swish-e. The "prog" access method runs an external program that "feeds" documents to Swish-e. This allows indexing and filtering of documents from any source.
See prog - general purpose access method in the SWISH-RUN man page for more information.
A number of example programs for use with the "prog" access method are provided in the prog-bin directory. Please see those example if you have questions about implementing a "prog" input program.
SwishProgParameters /path/to/config hello there
IndexDir /path/to/program.pl
swish-e -c config -S prog
/path/to/program.pl
and pass /path/to/config hello there
as three command line arguments to the program. This directive makes it easy to pass settings from the Swish-e configuration file to the external program.spider.pl
program (included in the prog-bin
directory) uses the SwishProgParameters
to specify what file to read for configuration information. SwishProgParameters spider.config
IndexDir ./spider.pl
spider.pl
program also has a default action so you can avoid using a configuration file: SwishProgParameters default http://www.swishe.org/ http://some.other.site/
IndexDir ./spider.pl
./spider.pl spider.conf | ./swish-e -S prog -i stdin
Notes when using MS Windows
You should use unix style path separators to specify your external program. Swish will convert forward slashes to backslashes before calling the external program. This is only true for the program name specified with IndexDir
or the -i
command line option.
In addition, Swish-e will make sure the program specified actually exists, which means you need to use the full name of the program.
For example, to run the perl spider program spider.pl you would need a Swish-e configuration file such as:
IndexDir e:/perl/bin/perl.exe
SwishProgParameters prog-bin/spider.pl default http://swish-e.org
and run indexing with the command:
swish-e -c swish.cfg -S prog -v 9
The IndexDir
command tells Swish-e the name of the program to run. Under unix you can just specify the name of the script, since unix will figure out the program from the first line of the script.
The SwishProgParameters
are the parameters passed to the program specified by IndexDir
(perl.exe in this case). The first parameter is the perl script to run (prog-bin/spider.pl). Perl passes the rest of the parameters directly to the perl script. The second parameter default tells the spider.pl program to use default settings for spidering (or you could specify a spider config file -- see perldoc spider.pl
for details), and lastly, the URL is passed into the spider program.
Internally, Swish-e knows how to parse only text, HTML, and XML documents. With "filters" you can index other types of documents. For example, if all your web pages are in gzip format a filter can uncompress these on the fly for indexing.
You may wish to read the Swish-e FAQ question on filtering before continuing here. How Do I filter documents?
There are two suggested methods for filtering.
The Swish-e distribution includes a Perl module called SWISH::Filter and individual filters located in the filters directory. This system uses plug-in filters to extend the types of documents that Swish-e can index. The plug-in filters do not actually do the filtering, but rather provide a standard interface for accessing programs that can filter or convert documents. The programs that do the filtering are not part of the Swish-e distribution; they must be downloaded and installed separately.
The advantage of this method is that new filtering methods can be installed easily.
This system is designed to work with the -S http and -prog methods, but may also be used with the FileFilter
feature and -S fs indexing method. See \$prefix/share/doc/swish-e/examples/filter-bin/swish_filter.pl for an example.
See the filters/README file for more information.
A filter is an external program that Swish-e executes while processing a document of a given type. Swish-e will execute the filter program for each file that matches the file suffix (extension) set in the FileFilter or FileFilterMatch directives. FileFilterMatch matches using regular expressions and is described below.
Filters may be used with any type of input method (i.e. -S fs, -S http, or -S prog). But because
Swish-e calls the external program passing as default arguments:
Swish-e can also pass other parameters to the filter program. These parameters can be defined using the FileFilter or FileFilterMatch directives. See Filter Options below.
The filter program must open the file, process its contents, and return it to Swish-e by printing to STDOUT.
Note that this can add a significant amount of time to the indexing process if your external program is a perl or shell script. If you have many files to filter you should consider writing your filter in C instead of a shell or perl script, or using the "prog" Access Method.
FileFilterMatch
directive. FilterDir /usr/local/swish/filters
Default: "'%p' '%P'"
Which means: pass "workfile path" and "documentfile path" to filter (each quoted).
%% = %
%P = Full document pathname (e.g. URL, or path on filesystem)
%p = Full pathname to work file (maybe a tmpfile or the real document path on filesystem)
%F = Filename stripped from full document pathname
%f = Filename stripped from "work" pathname
%D = Directoryname stripped from full document pathname
%d = Directoryname stripped from full "work" pathname
%P = document pathname: http://myserver/path1/mydoc.txt
%p = work pathname: /tmp/tmp.1234.mydoc.txt
%F = mydoc.txt
%f = tmp.1234.mydoc.txt
%D = http://myserver/path1
%d = /tmp
e.g. "'%f'" --> 'file name with spaces.doc'.
'"%f"' --> "file name with spaced.doc"
FileFilter .mydoc c:/some/path/mydocfilter.exe '-d "%d" -example -url "%P" "%f"'
FileFilter .doc /usr/local/bin/catdoc "-s8859-1 -d8859-1 '%p'"
FileFilter .pdf pdftotext "'%p' -"
FileFilter .html.gz gzip "-c '%p'"
FileFilter .mydoc "/some/path/mydocfilter" "-d '%d' -example -url '%P' '%f'"
FileFilter .pdf pdf2html.sh
FileFilter .ps ghostscript-filter.pl
-S prog
access method where the script will only be compiled once, instead of for each document.-S prog
program. Which you decide to use depends on your requirements. Examples of filter scripts can be found in the filter-bin directory, and examples of -S prog
programs can be found in the prog-bin directory.FileMatch
except uses regular expressions to match against the file name. *filter-prog* is the path to the program. Unlike FileFilter
this does not use the FilterDir
option. Also unlike FileFilter
you must specify the *filter-options*. FileFilterMatch ./pdftotext "'%p' -" /\.pdf$/
FileFilterMatch ./pdftotext "'%p' -" /.\.pdf$/
FileFilterMatch ./check_title.pl "%p" /\.html$/ /\.htm$/
FileFilterMatch ./check_title.pl %p /\.(html|html)$/
FileFilterMatch ./check_title.pl %p /\.html?$/
FileFilterMatch ./check_title.pl %p /\.html?$/i
FileFilterMatch ./convert "%p %P" not /\..+$/
\$Id: SWISH-CONFIG.pod,v 1.81 2004/10/25 13:57:17 karman Exp $
.