quaneko User Manual

# 4 Filters

Before you can start using quaneko, you will most likely want to add support for your favorite file formats (e.g. doc, html, pdf, ..). quaneko allows you to install and configure individual filters for file types you want to index. This section describes how to do this. If you want to index plain text files with the extension "txt" only, you can skip this section.

## Adding Support For Various File Formats

Adding support for a new file format means configuring a new filter in quaneko. By default, quaneko only supports plain text files with the extension "txt". If you want to index other file types, you need to configure filters for them.
If you want to index a file format that is not listed here, you will also find a generic description about how to add support for any file format at the end of this section.

The example screenshot in figure 1 shows how to configure a utility called "gettext.exe" under Windows for parsing Word documents. The exact settings may vary for your system. In these settings we assume that you've installed "gettext.exe" in "C:\Program files\gettext\" and that the file "wordpad.exe" is located in "C:\Program files\Windows NT\Accessories":
 Filter extensions: doc Parse Command: C:\Program files\gettext\gettext.exe "%f" "%o" Open Command: C:\Program files\Windows NT\Accessories\wordpad.exe
Once configured like described here, the Word doc filter should be ready to use.

Figure 1: Configuration of the Word doc filter.

You can configure the other format filters in the same way. The following sections list some possible filter configuration settings for Windows and Linux.

### Configuration Under Windows

For Windows there is one filter available which works for many common formats:
 doc Microsoft Word document format xls Microsoft Excel spreadsheet format ppt Microsoft PowerPoint presentation format pdf Adobe portable document format html, htm Hypertext Markup Language txt Plain text rtf Rich Text Format wpd Corel WordPerfect® document format hlp Microsoft Help format

The utility to convert all these formats into plain text is called "GetText" and is available from:
http://www.kryltech.com/freestf.htm.

Word
 Filter extensions: doc Parse Command: C:\Program files\gettext\gettext.exe "%f" "%o" Open Command: C:\Program files\Windows NT\Accessories\wordpad.exe

Excel
 Filter extensions: xls Parse Command: C:\Program Files\gettext\gettext.exe "%f" "%o" Open Command: C:\Program Files\Microsoft Office\Office10\excel.exe

PowerPoint
 Filter extensions: ppt Parse Command: C:\Program Files\gettext\gettext.exe "%f" "%o" Open Command: C:\Program Files\Microsoft Office\Office10\powerpnt.exe

 Filter extensions: pdf Parse Command: C:\Program Files\gettext\gettext.exe "%f" "%o" Open Command: C:\Program Files\Adobe\Acrobat 5.0\Reader\AcroRd32.exe

Hypertext Markup Language (HTML)
 Filter extensions: html htm Parse Command: C:\Program Files\gettext\gettext.exe "%f" "%o" Open Command: C:\Program Files\Internet Explorer\iexplore.exe

Plain Text
 Filter extensions: txt Parse Command: C:\Program Files\gettext\gettext.exe "%f" "%o" Open Command: C:\Program files\Windows NT\Accessories\wordpad.exe

Rich Text (RTF)
 Filter extensions: rtf Parse Command: C:\Program Files\gettext\gettext.exe "%f" "%o" Open Command: C:\Program files\Windows NT\Accessories\wordpad.exe

Word Perfect®
 Filter extensions: wpd Parse Command: C:\Program Files\gettext\gettext.exe "%f" "%o" Open Command:

Help Files
 Filter extensions: hlp Parse Command: C:\Program Files\gettext\gettext.exe "%f" "%o" Open Command: winhlp32

### Configuration Under Linux

There are numerous filters available to convert file formats into plain text. Some might already come with your favorite distribution, for others you might have to download the sources and compile them.

Word
 Filter extensions: doc Parse Command: antiword "%f" Open Command: OOo Download Filter: http://www.winfield.demon.nl/

MP3 Description (ID3 Tags)
 Filter extensions: mp3 Parse Command: id3info "%f" Open Command: xmms Download Filter: http://id3lib.sourceforge.net/

Hypertext Markup Language (HTML)
 Filter extensions: htm html xml Parse Command: html2text "%f" Open Command: konqueror Download Filter: http://www.linux.org/apps/AppId_7912.html

## Adding Support for Other File Formats

If you want to index file types that are not mentioned in the previous sections, you need to configure your own filters for them. The following steps are required to configure a new filter:

1. Download an appropriate converter application. The utility must be able to produce a plain text file from a file in an other format. Further, it should neither show a GUI nor require any user interaction.
2. Install the application on your system.
3. Configure the converter as a filter in quaneko.
4. After this procedure, the filter is ready to be used with quaneko.

### What Is A Filter?

A filter configuration for one filter consists of:

• A list of file types this filter supports (e.g. "htm html").
• A string which specifies the application call for converting those types into plain text (e.g. `html2text "%f" "%o"`). We refer to this string as 'Filter Conversion String'.
• Optionally the name of the application which can be used to open that document type (e.g. "mozilla")

### Filter Conversion Strings

The filter command to parse a file and convert it into plain text can be configured as a string which contains %f for the data file that is handed from quaneko to the converter and %o for the output file.
Example: The string
`pdf2text "%f" "%o"`
is converted at runtime to:
`pdf2text "/home/tux/file.pdf" "/home/tux/.qnk_tmp.txt"`
If %o is omitted, quaneko assumes that the filter streams the plain text to standard output (internally it adds ">%o" to the command).

It's usually recommendable to add quotation marks around %f and %o. Otherwise you will experience problems with spaces in file names.

[ Home | Contents | Index ]
[ 1 | 2 | 3 | 4 | 5 | 6 | 7 ]
This Manual was created with ManStyle.