Java – Regular expression to extract image url from html code

html-parsingimageurljavaregex

I wanted to extract Url of image from html code, e.g. html code below:

<div class="imageContainer">
   <img src="http://ecx.images-amazon.com/images/I/41%2B7N48F7JL._SL135_.jpg"
      alt="" width="135" height="94"
      style="margin-top: 21px; margin-bottom:20px;" /></div>

And I got a code from net

String regexImage = "(?<=<img (*)src=\")[^\"]*";
Pattern pImage = Pattern.compile(regexImage);
Matcher mImage = pImage.matcher(elementString);
while (mImage.find()) {
   String imagePath = mImage.group();}

which is working and has re(regular expression)

"(?<=<img src=\")[^\"]*"

But now I want to extract image url from html code like below :

<img onerror="img_onerror(this);" data-logit="true" data-pid="MOBDDDBRHVWQZHYY"
   data-imagesize="thumb"
   data-error-url="http://img1a.flixcart.com/mob/thumb/mobile.jpg"
   src="http://img8a.flixcart.com/image/mobile/h/y/y/samsung-galaxy-s-duos-s7562-125x125-imadddczzr4qhqnc.jpeg"
   alt="Samsung Galaxy S Duos S7562: Mobile"
   title="Samsung Galaxy S Duos S7562: Mobile"></img></a>
<div class="bp-offer-image image-offer"></div>

where there is code between img and src=

I'm trying the regular expression as "(?<=<img (*)src=\")[^\"]*"
but its not working. So please give me regular expression so that i can extract image url i.e. http://ecx.images-amazon.com/images/I/61xqOQ3Sj8L._SL135_.jpg from above html code.

And, first I'm using Jsoup to parse html to extract tags containing img :

doc = Jsoup.connect(urlFromBrowse).get();
            Elements elements = doc.getElementsByTag("img");

            for (Element element : elements) {
                String elementString = element.toString();

and passed this elementString to matcher() meathod. And from the tag(element) that I'm getting, I'm using regular expression to parse image url, name etc things.

Best Answer

This post is an answer to the question, not a guideline.

The question was not "RegExp vs DOM", the question was "Regular expression to extract image url from html code".

Here it is:

String htmlFragment =
   "<img onerror=\"img_onerror(this);\" data-logit=\"true\" data-pid=\"MOBDDDBRHVWQZHYY\"\n" + 
   "   data-imagesize=\"thumb\"\n" + 
   "   data-error-url=\"http://img1a.flixcart.com/mob/thumb/mobile.jpg\"\n" + 
   "   src=\"http://img8a.flixcart.com/image/mobile/h/y/y/samsung-galaxy-s-duos-s7562-125x125-imadddczzr4qhqnc.jpeg\"\n" + 
   "   alt=\"Samsung Galaxy S Duos S7562: Mobile\"\n" + 
   "   title=\"Samsung Galaxy S Duos S7562: Mobile\"></img></a>";
Pattern pattern =
   Pattern.compile( "(?m)(?s)<img\\s+(.*)src\\s*=\\s*\"([^\"]+)\"(.*)" );
Matcher matcher = pattern.matcher( htmlFragment );
if( matcher.matches()) {
   System.err.println(
      "OK:\n" +
      "1: '" + matcher.group(1) + "'\n" +
      "2: '" + matcher.group(2) + "'\n" +
      "3: '" + matcher.group(3) + "'\n" );
}

and the ouput:

OK:
1: 'onerror="img_onerror(this);" data-logit="true" data-pid="MOBDDDBRHVWQZHYY"
   data-imagesize="thumb"
   data-error-url="http://img1a.flixcart.com/mob/thumb/mobile.jpg"
   '
2: 'http://img8a.flixcart.com/image/mobile/h/y/y/samsung-galaxy-s-duos-s7562-125x125-imadddczzr4qhqnc.jpeg'
3: '
   alt="Samsung Galaxy S Duos S7562: Mobile"
   title="Samsung Galaxy S Duos S7562: Mobile"></img></a>'