Contents First Prev Home Next Last
quaneko User Manual
 

4 Filters

Before you can start using quaneko, you will most likely want to add support for your favorite file formats (e.g. doc, html, pdf, ..). quaneko allows you to install and configure individual filters for file types you want to index. This section describes how to do this. If you want to index plain text files with the extension "txt" only, you can skip this section.

Adding Support For Various File Formats

Adding support for a new file format means configuring a new filter in quaneko. By default, quaneko only supports plain text files with the extension "txt". If you want to index other file types, you need to configure filters for them.
If you want to index a file format that is not listed here, you will also find a generic description about how to add support for any file format at the end of this section.

The example screenshot in figure 1 shows how to configure a utility called "gettext.exe" under Windows for parsing Word documents. The exact settings may vary for your system. In these settings we assume that you've installed "gettext.exe" in "C:\Program files\gettext\" and that the file "wordpad.exe" is located in "C:\Program files\Windows NT\Accessories":
Filter extensions: doc
Parse Command: C:\Program files\gettext\gettext.exe "%f" "%o"
Open Command: C:\Program files\Windows NT\Accessories\wordpad.exe
Once configured like described here, the Word doc filter should be ready to use.

Figure 1: Configuration of the Word doc filter.

You can configure the other format filters in the same way. The following sections list some possible filter configuration settings for Windows and Linux.

Configuration Under Windows

For Windows there is one filter available which works for many common formats:
doc Microsoft Word document format
xls Microsoft Excel spreadsheet format
ppt Microsoft PowerPoint presentation format
pdf Adobe portable document format
html, htm Hypertext Markup Language
txt Plain text
rtf Rich Text Format
wpd Corel WordPerfect® document format
hlp Microsoft Help format

The utility to convert all these formats into plain text is called "GetText" and is available from:
http://www.kryltech.com/freestf.htm.

Word
Filter extensions: doc
Parse Command: C:\Program files\gettext\gettext.exe "%f" "%o"
Open Command: C:\Program files\Windows NT\Accessories\wordpad.exe

Excel
Filter extensions: xls
Parse Command: C:\Program Files\gettext\gettext.exe "%f" "%o"
Open Command: C:\Program Files\Microsoft Office\Office10\excel.exe

PowerPoint
Filter extensions: ppt
Parse Command: C:\Program Files\gettext\gettext.exe "%f" "%o"
Open Command: C:\Program Files\Microsoft Office\Office10\powerpnt.exe

Adobe Portable Document Format (PDF)
Filter extensions: pdf
Parse Command: C:\Program Files\gettext\gettext.exe "%f" "%o"
Open Command: C:\Program Files\Adobe\Acrobat 5.0\Reader\AcroRd32.exe

Hypertext Markup Language (HTML)
Filter extensions: html htm
Parse Command: C:\Program Files\gettext\gettext.exe "%f" "%o"
Open Command: C:\Program Files\Internet Explorer\iexplore.exe

Plain Text
Filter extensions: txt
Parse Command: C:\Program Files\gettext\gettext.exe "%f" "%o"
Open Command: C:\Program files\Windows NT\Accessories\wordpad.exe

Rich Text (RTF)
Filter extensions: rtf
Parse Command: C:\Program Files\gettext\gettext.exe "%f" "%o"
Open Command: C:\Program files\Windows NT\Accessories\wordpad.exe

Word Perfect®
Filter extensions: wpd
Parse Command: C:\Program Files\gettext\gettext.exe "%f" "%o"
Open Command:

Help Files
Filter extensions: hlp
Parse Command: C:\Program Files\gettext\gettext.exe "%f" "%o"
Open Command: winhlp32

Configuration Under Linux

There are numerous filters available to convert file formats into plain text. Some might already come with your favorite distribution, for others you might have to download the sources and compile them.

Word
Filter extensions: doc
Parse Command: antiword "%f"
Open Command: OOo
Download Filter: http://www.winfield.demon.nl/

Adobe Portable Document Format (PDF)
Filter extensions: pdf
Parse Command: pdftotext "%f" "%o"
Open Command: acroread
Download Filter: http://www.foolabs.com/xpdf/

MP3 Description (ID3 Tags)
Filter extensions: mp3
Parse Command: id3info "%f"
Open Command: xmms
Download Filter: http://id3lib.sourceforge.net/

Hypertext Markup Language (HTML)
Filter extensions: htm html xml
Parse Command: html2text "%f"
Open Command: konqueror
Download Filter: http://www.linux.org/apps/AppId_7912.html

Adding Support for Other File Formats

If you want to index file types that are not mentioned in the previous sections, you need to configure your own filters for them. The following steps are required to configure a new filter:

  1. Download an appropriate converter application. The utility must be able to produce a plain text file from a file in an other format. Further, it should neither show a GUI nor require any user interaction.
  2. Install the application on your system.
  3. Configure the converter as a filter in quaneko.
  4. After this procedure, the filter is ready to be used with quaneko.

What Is A Filter?

A filter configuration for one filter consists of:

  • A list of file types this filter supports (e.g. "htm html").
  • A string which specifies the application call for converting those types into plain text (e.g. html2text "%f" "%o"). We refer to this string as 'Filter Conversion String'.
  • Optionally the name of the application which can be used to open that document type (e.g. "mozilla")

Filter Conversion Strings

The filter command to parse a file and convert it into plain text can be configured as a string which contains %f for the data file that is handed from quaneko to the converter and %o for the output file.
Example: The string
pdf2text "%f" "%o"
is converted at runtime to:
pdf2text "/home/tux/file.pdf" "/home/tux/.qnk_tmp.txt"
If %o is omitted, quaneko assumes that the filter streams the plain text to standard output (internally it adds ">%o" to the command).

It's usually recommendable to add quotation marks around %f and %o. Otherwise you will experience problems with spaces in file names.


  Contents First Prev Home Next Last

[ Home | Contents | Index ]
[ 1 | 2 | 3 | 4 | 5 | 6 | 7 ]
This Manual was created with ManStyle.