How can I convert a Google search query to something I can feed PostgreSQL's to_tsquery() ?
If there's no existing library out there, how should I go about parsing a Google search query in a language like PHP?
For example, I'd like to take the following Google-ish search query:
("used cars" OR "new cars") -ford -mistubishi
And turn it into a to_tsquery()-friendly string:
('used cars' | 'new cars') & !ford & !mistubishi
I can fudge this with regexes, but that's the best I can do. Is there some robust lexical analysis method of going about this? I'd like to be able to support extended search operators too (like Google's site: and intitle:) that will apply to different database fields, and thus would need to be separated from the tsquery string.
UPDATE: I realize that with special operators this becomes a Google to SQL WHERE-clause conversion, rather than a Google to tsquery conversion. But the WHERE clause may contain one or more tsqueries.
For example, the Google-style query:
((color:blue OR "4x4") OR style:coupe) -color:red used
Should produce an SQL WHERE-clause like this:
WHERE to_tsvector(description) MATCH to_tsquery('used')
AND color <> 'red'
AND ( (color = 'blue' OR to_tsvector(description) MATCH to_tsquery('4x4') )
OR style = 'coupe'
);
I'm not sure if the above is possible with regex?
Best Solution
Honest, I think regular expressions are the way to go with something like this. Just the same, this was a fun exercise. The code below is very prototypal - in fact, you'll see that I didn't even implement the lexer itself - I just faked the output. I'd like to continue it but I just don't have more spare time today.
Also, there definitely a lot more work to be done here in terms of supporting other types of search operators and the like.
Basically, the idea is that a certain type of query is lexed then parsed into a common format (in this case, a QueryExpression instance) which is then rendered back out as another type of query.