How to spell check a website

spell-checking

I know that spellcheckers are not perfect, but they become more useful as the amount of text you have increases in size. How can I spell check a site which has thousands of pages?

Edit: Because of complicated server-side processing, the only way I can get the pages is over HTTP. Also it cannot be outsourced to a third party.

Edit: I have a list of all of the URLs on the site that I need to check.

Best Solution

Lynx seems to be good at getting just the text I need (body content and alt text) and ignoring what I don't need (embedded Javascript and CSS).

lynx -dump http://www.example.com

It also lists all URLs (converted to their absolute form) in the page, which can be filtered out using grep:

lynx -dump http://www.example.com | grep -v "http"

The URLs could also be local (file://) if I have used wget to mirror the site.

I will write a script that will process a set of URLs using this method, and output each page to a seperate text file. I can then use an existing spellchecking solution to check the files (or a single large file combining all of the small ones).

This will ignore text in title and meta elements. These can be spellchecked seperately.

Related Question