Fun with hpricot

May 15th, 2007

Those who know me personally know that I’m a pretty huge video game nerd. I’ve been wanting to learn hpricot better, so I decided to combine the two interests. I give you vgdod.rb, which goes and scrapes info about Amazon Video Games’ Deal of the Day. It could probably be optimized, and the XPath could likely be better, but it gets the job done!

It was interesting to try to write XPath for HTML that I didn’t control… this was the first time I’ve had that particular joy, and it was a good exercise. I think we (web developers, that is) sometimes forget that other people might want to manipulate our HTML at some point. Anyway, the code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
#!/usr/bin/env ruby

%w(rubygems hpricot open-uri shorturl).each {|g| require g}
def fetch(url = '') ; return Hpricot(open("http://www.amazon.com" + url)) ; end

vg_url = fetch.at("a[text()=Video Games]")[:href]
dotd_img = fetch(vg_url).search("img").select {|e| e[:src] =~ /deal-of-the-day/}.first
dotd = fetch(dotd_img.parent[:href])

# title -- platform -- price -- url
printf "%s -- %s -- %s -- %s",
  dotd.at("//div.buying//b.sans").inner_text.strip,
  dotd.search("//b.price").last.inner_text.strip,
  dotd.search("//div[@class=buying]")[3].inner_text.gsub(/(\ |\s|Platform:)/, ''),
  WWW::ShortURL.shorten("http://www.amazon.com#{dotd_img.parent[:href]}")

3 Responses to “Fun with hpricot”

  1. Steve Says:

    You know about the XPath discovery thingie in firebug, right? It’s firefox only, but it’s awesome…

    All you need to do to get the full xpath for any element is 1) fire up firebug on the page, 2) hit CTL+SHIFT+C, 3) click the element on the page, 4) long click/right click the element in firebug, 5) Copy XPath,

    And fuggetaboutit.

    :-) Steve

  2. Ben Bleything Says:

    Steve- .... no, I did not. Jeez, Firebug really does everything, doesn’t it? I’ve been using an extension called XPather, but it kinda sucks. I’ll check out the Firebug way of doing it.

    Thanks :D

  3. Steve Says:

    @Ben – One more thing. There’s also the Hpricot methods .xpath and .css_path. Those return the fully qualified Hpricot paths for the currently selected element. Which can be helpful, since hpricot doesn’t like things like tbody and p tags. Unlike the Firebug trick, though, in this case you need to already know how to find the element.

    :-) Steve

    p.s. Never heard of XPather. Sound like I’ll keep it that way, for now. :-)

Sorry, comments are closed for this article.