Some time ago I put up a simple guide to getting PHP talking to Microsoft Index Server on a windows machine. Index server is pretty neat, it ranks results, indexes common document types and generally just works but most servers I work with now are some form of *nix so a re-think was needed. Really I wanted to be able to index document contents, to have some better form of searching than simply doing WHERE data LIKE '%[mykeyword]%' and to provide some kind of score to how accurate a match a document might be.
A bunch of great tools available (mostly as opensource projects which run under *nix, cygwin or sometimes windows) which provide a means of extracting plain text from various document formats, the ones I've used to date are
Although I didn't spend long looking, I'm pretty sure similar tools can be found for extracting text from HTML or RTF files without all the formatting.
Now we have content, we need to search it. The plain text content of these files can be stored in a PostgreSQL text column, for example a simple table such as
CREATE TABLE fulltext ( filename varchar(100), contents text )
would permit a very basic level of searching using the SQL LIKE operator which isn't great. Fortunately there is an extension for PostgreSQL called TSearch2 which ships with the official distributions and adds full text searching (FTS), much like SQL Server and mySQL. Adding TSearch2 is pretty simple and covered on the official TSearch2 site for *nix, Windows users get the option when installing PostgreSQL to automatically enable the extension for each new database created or you can simply run share\contrib\tsearch2.sql to enable the extension on a database by database basis.
TSearch2 needs to know how language works so it's important to configure the system properly, I've opted to edit the default configuration for testing instead of setting something new up. To do this, grab a copy of the ISpell British dictionaries (see official TSearch2 site for links to ISpell dictionaries). In order to tell TSearch to use this dictionary you need to add a row into the pg_ts_dict table, assuming the med and aff files are stored in /test/
INSERT INTO pg_ts_dict (dict_name, dict_init, dict_initoption,
dict_lexize, dict_comment)
VALUES ('en_ispell', 'spell_init(internal)',
'DictFile="/test/english.med",'
|| 'AffFile="/test/english.aff",'
|| 'StopFile="/test/english.stop"',
'spell_lexize(internal,internal,integer)',
'English ISpell interface');
Now TSearch2 needs to be told to look at the dictionary to parse querries so edit the contents of pg_ts_cfgmap and set lhword, lpart_hword and lword to {en_ispell,en_stem} for the default confirmation.
You can test this is working running
SELECT lexize('en_ispell', 'conditionally');
which should give you
lexize
-----------------------------
{conditionally,conditional}
(1 row)
I'm not giving a detailed guide to querying under TSearch, however to extend our example above
CREATE TABLE fulltext
(
filename varchar(100),
contents text,
idx tsvector
)
CREATE INDEX ftidx ON fulltext USING gist(idx);
CREATE TRIGGER tsvectorupd BEFORE UPDATE OR INSERT ON fulltext
FOR EACH ROW EXECUTE PROCEDURE
tsearch2(idx, contents, filename);
gives us a table with a column idx which contains the TSearch index holding data from the contents of the document and the filename, we've also created a trigger which updates the column when a row is changed.
Now we can query using something such as
SELECT * FROM fulltext WHERE idx @@
to_tsquery('default', 'keyword');
Providing rankings is a bit more involved, however lets assume we want to give more prominence to keywords in the filename to the content of the document, we need to weight the results in our index
UPDATE fulltext SET idx = (setweight(to_tsvector(filename), 'A') || setweight(to_tsvector(contents), 'B'));
We can query using something like
SELECT
filename,
ts_rank('{0,0,0.10,0.90}', idx, q) as therank, q
FROM
fulltext,
to_tsquery(' keyword ') as q
WHERE
idx @@ q
ORDER BY
therank DESC
where 0.1 and 0.9 providing weighting to results from B and A respectively.
Sources -