watching the world go by

Alternative to Index Server

Thursday 10th April, 2008

Some time ago I put up a simple guide to getting PHP talking to Microsoft Index Server on a windows machine. Index server is pretty neat, it ranks results, indexes common document types and generally just works but most servers I work with now are some form of *nix so a re-think was needed. Really I wanted to be able to index document contents, to have some better form of searching than simply doing WHERE data LIKE '%[mykeyword]%' and to provide some kind of score to how accurate a match a document might be.

A bunch of great tools available (mostly as opensource projects which run under *nix, cygwin or sometimes windows) which provide a means of extracting plain text from various document formats, the ones I've used to date are

  • catdoc, xls2csv and catppt
    A collection of simple utilities which extract text from Microsoft Word, Excel and Powerpoint documents
  • pdftotext
    Supplied as part of Xpdf this application is able to extract text from Adobe Acrobat files

Although I didn't spend long looking, I'm pretty sure similar tools can be found for extracting text from HTML or RTF files without all the formatting.

Now we have content, we need to search it. The plain text content of these files can be stored in a PostgreSQL text column, for example a simple table such as

CREATE TABLE fulltext
(
  filename varchar(100),
  contents text
)

would permit a very basic level of searching using the SQL LIKE operator which isn't great. Fortunately there is an extension for PostgreSQL called TSearch2 which ships with the official distributions and adds full text searching (FTS), much like SQL Server and mySQL. Adding TSearch2 is pretty simple and covered on the official TSearch2 site for *nix, Windows users get the option when installing PostgreSQL to automatically enable the extension for each new database created or you can simply run share\contrib\tsearch2.sql to enable the extension on a database by database basis.

TSearch2 needs to know how language works so it's important to configure the system properly, I've opted to edit the default configuration for testing instead of setting something new up. To do this, grab a copy of the ISpell British dictionaries (see official TSearch2 site for links to ISpell dictionaries). In order to tell TSearch to use this dictionary you need to add a row into the pg_ts_dict table, assuming the med and aff files are stored in /test/

INSERT INTO pg_ts_dict (dict_name, dict_init, dict_initoption, 
  dict_lexize, dict_comment) 
VALUES ('en_ispell', 'spell_init(internal)', 
  'DictFile="/test/english.med",'
    || 'AffFile="/test/english.aff",'
    || 'StopFile="/test/english.stop"', 
  'spell_lexize(internal,internal,integer)', 
  'English ISpell interface');

Now TSearch2 needs to be told to look at the dictionary to parse querries so edit the contents of pg_ts_cfgmap and set lhword, lpart_hword and lword to {en_ispell,en_stem} for the default confirmation.

You can test this is working running

SELECT lexize('en_ispell', 'conditionally');
which should give you
           lexize
        -----------------------------
         {conditionally,conditional}
        (1 row)

I'm not giving a detailed guide to querying under TSearch, however to extend our example above

CREATE TABLE fulltext
(
  filename varchar(100),
  contents text,
  idx tsvector 
) 
CREATE INDEX ftidx ON fulltext USING gist(idx);
CREATE TRIGGER tsvectorupd BEFORE UPDATE OR INSERT ON fulltext
  FOR EACH ROW EXECUTE PROCEDURE
    tsearch2(idx, contents, filename);
gives us a table with a column idx which contains the TSearch index holding data from the contents of the document and the filename, we've also created a trigger which updates the column when a row is changed.

Now we can query using something such as

SELECT * FROM fulltext WHERE idx @@ 
  to_tsquery('default', 'keyword');

Providing rankings is a bit more involved, however lets assume we want to give more prominence to keywords in the filename to the content of the document, we need to weight the results in our index

UPDATE fulltext SET idx = 
  (setweight(to_tsvector(filename), 'A') 
  || setweight(to_tsvector(contents), 'B'));

We can query using something like

SELECT 
  filename, 
  ts_rank('{0,0,0.10,0.90}', idx, q) as therank, q
FROM 
  fulltext, 
  to_tsquery(' keyword ') as q
WHERE 
  idx @@ q
ORDER BY 
  therank DESC
where 0.1 and 0.9 providing weighting to results from B and A respectively.

Sources -

Add a comment | No comments | View entry