My application allows user to upload pdf files and store them on the webserver for later viewing. I store the name of the file, location, size, upload date, user name etc in an SQL server database.
I'd like to be able to programatically, just after a file is uploaded, generate a list of keywords (maybe everything except common words) and store them in the sql database as well so that subsequent users can do keyword searches…
Suggestions on how to approach this task? Does these type of routine already exist?
EDIT: Just to clarify my requirements, I wouldn't be concerned with doing OCR, I don't know the insides' of PDF's, but I understand that if it was generated by an app, such as Word->PDF Print, the text of the document is searchable…so really my first task, and the intent of my question is, how do I access the text of a PDF file from an asp.net app? OCR on scanned PDF's is probably beyond my requirements at this point.