Python – Filter out HTML tags and resolve entities in python

htmlpython

Because regular expressions scare me, I'm trying to find a way to remove all HTML tags and resolve HTML entities from a string in Python.

Best Solution

Use lxml which is the best xml/html library for python.

import lxml.html
t = lxml.html.fromstring("...")
t.text_content()

And if you just want to sanitize the html look at the lxml.html.clean module

Related Question