R – How to index and search .doc files


I have an application that needs to have .doc files uploaded to it. These documents should then be index and the whole collection of documents should be searchable. This will run on a Windows Server, without Word installed, using IIS and SqlServer, but I'd rather not be tied to SqlServer's full text indexing.

I was thinking of using Lucene.Net for the indexing part and was wondering what the best way to get the text out of the .doc files would be. I could probably extract the text by reading in the whole stream and then using a regEx to pull out any regular characters, but that seems hefty and prone to error.

I saw an article on using iFilters that sounds promising, but I thought I'd put this out there since it's not something I'm familiar with.

P.S. If it matters, these .doc files will have mail-merge fields in them and there's no other current alternative for the .doc format.

Best Solution

As far as a solution that didn't require an external program, it looks like the iFilter solution is the way to go (even though you might count that as an external program).

Here's a simple CodePlex article and code on how it can be done: http://www.codeproject.com/KB/cs/IFilter.aspx