Javascript – Python library for rendering HTML and javascript


Is there any python module for rendering a HTML page with javascript and get back a DOM object?

I want to parse a page which generates almost all of its content using javascript.

Best Solution

The big complication here is emulating the full browser environment outside of a browser. You can use stand alone javascript interpreters like Rhino and SpiderMonkey to run javascript code but they don't provide a complete browser like environment to full render a web page.

If I needed to solve a problem like this I would first look at how the javascript is rendering the page, it's quite possible it's fetching data via AJAX and using that to render the page. I could then use python libraries like simplejson and httplib2 to directly fetch the data and use that, negating the need to access the DOM object. However, that's only one possible situation, I don't know the exact problem you are solving.

Other options include the selenium one mentioned by Ɓukasz, some kind of webkit embedded craziness, some kind of IE win32 scripting craziness or, finally, a pyxpcom based solution (with added craziness). All these have the drawback of requiring pretty much a fully running web browser for python to play with, which might not be an option depending on your environment.