Html – Best way to fetch a varying HTML tag


I'm trying to fetch some HTML from various blogs and have noticed that different providers use the same tag in different ways.

For example, here are two major providers that use the meta name generator tag differently:

  • Blogger: <meta content='blogger' name='generator'/> (content first, name later and, yes, single quotes!)
  • WordPress: <meta name="generator" content="" /> (name first, content later)

Is there a way to extract the value of content for all cases (single/double quotes, first/last in the row)?

P.S. Although I'm using Java, the answer would probably help more people if it where for regular expressions generally.

Best Solution

The answer is: don't use regular expressions.

Seriously. Use a SGML parser, or an XML parser if you happen to know it's valid XML (probably almost never true). You will absolutely screw up and waste tons of time trying to get it right. Just use what's already available.