C# – regex for html parsing (in c#)


I'm trying to parse a html page and extract 2 values from a table row.
The html for the table row is as follows: –

<td title="Associated temperature in (ºC)" class="TABLEDATACELL" nowrap="nowrap" align="Left" colspan="1" rowspan="1">Max Temperature (ºC)</td>
<td class="TABLEDATACELLNOTT" nowrap="nowrap" align="Center" colspan="1" rowspan="1">6</td>
<td class="TABLEDATACELLNOTT" nowrap="nowrap" align="Center" colspan="1" rowspan="1"> 13:41:30</td>

and the expression I have at the moment is:

<tr>[\s]<td[^<]+?>Max Temperature[\w\s]*</td>[\s]

However I don't seem to be able to extract any matches.
Could anyone point me in the right direction, thanks.

Best Solution

Parsing HTML reliably using regexp is known to be notoriously difficult.

I think I would be looking for a HTML parsing library, or a "screen scraping" library ;)

If the HTML comes from an unreliable source, you have to be extra careful to handle malicious HTML syntax well. Bad HTML handling is a major source of security attacks.