I am scraping XML in R using xpathSApply (in the XML package) and having trouble pulling attributes out.
First, a relevant snippet of XML:
<div class="offer-name">
<a href="http://www.somesite.com" itemprop="name">Fancy Product</a>
</div>
I have successfully pulled the 'Fancy Product' (i.e. element?) using:
Products <- xpathSApply(parsedHTML, "//div[@class='offer-name']", xmlValue)
That took some time (I'm a n00b), but the documentation is good and there are several answered questions here I was able to leverage. I can't figure out how to pull the "http://www.somesite.com" out though (attribute?). I've speculated that it involves changing the 3rd term from 'xmlValue' to 'xmlGetAttr' but I could be totally off.
FYI (1) There are 2 more parent < div> above the snippet I pasted and (2) here is the abbreviated complete-ish code (which I don't think is relevant but included for the sake of completeness) is:
library(XML)
library(httr)
content2 = paste(readLines(file.choose()), collapse = "\n") # User will select file.
parsedHTML = htmlParse(content2,asText=TRUE)
Products <- xpathSApply(parsedHTML, "//div[@class='offer-name']", xmlValue)
Best Solution
The
href
is an attribute. You can select the appropriate node//div/a
and use thexmlGetAttr
function withname = href
: