homeASCIIcasts

168: Feed Parsing 

(view original Railscast)

Other translations: Cn

Below is the home page of a Rails blogging application. We’d like to make the site a little more useful by integrating information from another site into it. To do this we’ll add a list of links on the page that link to another Rails-related site, say a list of the most recent ASCIIcasts.

The home page of our blogging application

If we visit asciicasts.com we’ll see a list of the most recent episodes on the homepage, but how should we get this data on to our site? We could use screen-scraping: grabbing the HTML from the page and parsing it to get the data we want which works, but has disadvantages. For example, if the site owner changes the structure of the page then our parsing code could well stop being able to extract the right data from it.

A much better way, where it’s available, is to use an RSS feed. The ASCIIcasts site has an RSS feed containing a list of the episodes so instead of scraping the site, we can just pull the data we need from the feed.

Feedzirra

There are a number of ways of parsing an RSS feed in Ruby, but one of the best is a gem called Feedzirra. The main advantage of Feedzirra is its speed; it parses feeds very quickly, but it is also useful as it can parse many different types of feed.

To install Feedzirra we first need to make sure that http://gems.github.com is in our list of gem sources. If not we’ll need to add it.

gem sources -a http://gems.github.com

Now we can install the gem:

sudo gem install pauldix-feedzirra

Several dependencies will be installed alongside the gem. Once everything’s installed we’ll need to add a reference to the gem in our application’s /config/environment.rb file.

config.gem "pauldix-feedzirra", :lib => "feedzirra", :source => "http://gems.github.com"

That’s it. We’re ready to start parsing RSS feeds in our application.

Getting The Feed

We’re going to show the feed on the home page of our site, but we don’t want to have to get the feed every time a user visits that page as getting the feed and parsing it are expensive operations and take time to run. It would be better to cache the feed locally. There are a various ways we could cache the feed’s data; we’re going to store it in the database and create a new model to represent an entry in the feed. We’ll call this model feed_entry. The model will have four attributes to store the entry’s data: name, to store the headline, summary to store the content, url to store the entry’s link, published_at for the time the entry was created and guid to store the entry’s unique identifier so that we can check for duplicates.

We’ll generate our model with

script/generate model feed_entry name:string summary:text url:string published_at:datetime guid:string

Then migrate the database to generate the table.

rake db:migrate

The logic for parsing the feed and updating the entries will be added to the FeedEntry class. To start off we’ll need a method that parses the feed and adds any new entries to the database. For this we’ll write a class method called update_from_feed.

def self.update_from_feed(feed_url)
  feed = Feedzirra::Feed.fetch_and_parse(feed_url)
  feed.entries.each do |entry|
    unless exists? :guid => entry.id
      create!(
        :name         => entry.title,
        :summary      => entry.summary,
        :url          => entry.url,
        :published_at => entry.published,
        :guid         => entry.id
      )
    end
  end
end

The method takes one parameter: a URL for the feed which will be parsed by Feedzirra. It will fetch the feed and parse it, then it will then loop through each entry and add it to the database unless it’s already there. The method uses ActiveRecord’s exists? method to search for an entry by its guid to see if the entry is already in the database.

We can now go into the console and try out our new method to get the entries from the ASCIIcasts feed into our database.

>> FeedEntry.update_from_feed("http://asciicasts.com/episodes.xml")

There will be a delay of a few seconds while the feed is fetched and parsed and then you should see a long array of FeedZilla objects returned. Once it’s finished we should have our entries in the database.

>> FeedEntry.count
=> 61

If we were to run the command again it would only add any new entries that had been created since the last time we ran it. To keep the feed up to date we could set up a cron job to fetch the feed at a regular interval. If we were to do this we could use the Whenever gem that was covered in episode 164.

Now that we have our feed entries in the database we’ll modify our view code to show the most recent entries. At the top of the articles index view we’ll add the following code to render the ten most recent entries.

<div id="recent_episodes">
  <h3>Recent ASCIIcasts Episodes</h3>
  <ul>
  <% for entry in FeedEntry.all(:limit => 10, :order => "published_at desc") %>
    <li><%= link_to h(entry.name), entry.url %></li>
  <% end %>
  </ul>
</div>

With a little CSS we can style the div and make the list appear on the right-hand side of the articles page.

#recent_episodes { float: right; border: solid 1px #666; margin: 8px 0 16px 16px; padding: 4px; background-color: #DDD; }
#recent_episodes h3 { margin: 0; font-size: 1em; }
#recent_episodes ul { list-style: none; margin-left: 8px; padding-left: 0; }
#recent_episodes a { font-size: 0.9em; }

Our page now has a panel on it showing the most recent episodes.

The home page now shows the data from the feed.

More Frequent Updates

The code we’ve written works well when we don’t need to check for updates to the feed very often, but if we had to check every ten minutes or so then this isn’t the most efficient way to do it. We have to get the full feed every time and most of the time the data in it won’t have changed from the last time, so we’re wasting time and bandwidth by always pulling the whole feed back.

Thankfully Feedzirra provides a way of getting updates for a feed. If you look at the example code for Feedzirra you can see that there is a method that will get only the entries for the feed that have been updated since the feed was last retrieved.

# updating a single feed
updated_feed = Feedzirra::Feed.update(feed)

The update method uses ETags to determine if the feed has updated since it was last changed and will only download and reparse it if it has. To get the new entries there’s a new_entries method that will return a collection of the new entries.

I was unable to get this working while writing the test application, but I’ll show you the code that should work and enable you to get frequent updates from a feed. What we’re going to do is add another method to our FeedEntry class to go with the update_from_feed method we created earlier. This method will repeatedly poll the feed and add any updated entries to the database.

Our new method will use the code that add the entries to the database so we’ll start by extracting this code out into a method.

class FeedEntry < ActiveRecord::Base
  def self.update_from_feed(feed_url)
    feed = Feedzirra::Feed.fetch_and_parse(feed_url)
    add_entries(feed.entries)
  end
  private
  def self.add_entries(entries)
    entries.each do |entry|
      unless exists? :guid => entry.id
        create!(
          :name         => entry.title,
          :summary      => entry.summary,
          :url          => entry.url,
          :published_at => entry.published,
          :guid         => entry.id
        )
      end
    end
  end
end

Now we can write our new method, which we’ll call update_from_feed_continuously.

def self.update_from_feed_continuously(feed_url, delay_interval = 15.minutes)
  feed = Feedzirra::Feed.fetch_and_parse(feed_url)
  add_entries(feed.entries)
  loop do
    sleep delay_interval.to_i
    feed = Feedzirra::Feed.update(feed)
    add_entries(feed.new_entries) if feed.updated?
  end 
end

This method is similar to the update_from_feed method but it takes an additional parameter that specifies how often the feed should be polled. It starts by getting the full feed and adding the entries then enters an endless loop that sleeps for the specified period before checking to see if the feed has been updated and, if so, adding any new entries to the database.

So we now have two methods for getting entries from an RSS feed; one that is suitable for a cron job and another that can use a daemonized process and is more suitable for when a feed needs to be checked for updates frequently.

It should be noted that a loop is not the best way to do a daemonized process. For a better approach look at Episode 129 which shows how to use the daemons gem to create a background process.