Html – Regex to select all image html tags conditionally on the src value

htmlregex

I need a regex to do the following (unfortunately it has to be a regex, I can't code this because it's working within a purchased product):

I'd like to select all image tags in a chunk of html where either the image tag does not contain a class attribute, or, if it does contain a class attribute, that attribute does not contain a specific string at the beginning. Basically, I want to strip (by matching) all image tags from a chunk of html EXCEPT for images with a particular class applied to them.

This could be two separate regular expressions – I just want to match them – not extract any data.

So, for example, let's say the class I want to keep is called Pretty.

I'd like the regex to match:

<img src="xx"/>
<img border="x" src="xx"/>
<img whatever other attributes src="xx"/>
<img class="ugly" src="xx"/>
<img whatever other attributes class="fugly" src="xx"/>

but not match

<img class="Pretty" src="xx"/>
<img whatever other attributes class="Pretty" src="xx"/>
<img class="Pretty subpretty" src="xx"/>

If it's easier to do in one regex (one to match all image tags without class attribute, and one to match ones with class attributes that aren't 'pretty') that's totally fine too.

Best Answer

Use XPath instead, as that's what it's for:

//img[not(contains(@class,'Pretty'))]

This XPath expression looks for every img element whose class attribute does not contain the string 'Pretty'. I think it works for elements which are missing the class attribute.

Parsing XML and HTML with regular expressions is usually a very bad idea. Of course, XPath only works if the HTML in question is strict. If it's not a valid XML document then you might want to default back to something else, but even so regex isn't the right tool for the job.

Addendum: I was wrong about getting back to this in 30 minutes. Something came up and I don't have the time to sort it out. If it doesn't work for elements lacking the class attribute, use the following expression:

//img[(not(@class)) or (not(contains(@class,'Pretty')))]