LiveJournal recently released a streaming Atom format of their latest entries. This format allows you to connect to LiveJournal's servers and directly receive the content via a persistent connection as it flows in. This allows you to perform all kinds of nifty tricks with the data. Given the huge number of postings that LiveJournal receives -- multiples every second, sometimes up to 300 per minute -- this streaming format provides a boon for statisticians. For some people, however, it's just fun to watch the text flow by.
Using Perl modules specifically designed for parsing both Atom and the stream of data that LiveJournal provides, this hack will demonstrate how to work with Atom entries and obtain their content, using the streaming data as an example, resulting in a command line script which can deliver titles, links, and if desired, content for the entries passing through the server.
This hack uses a number of rather esoteric Perl modules. The prerequisites are:
XML::Atom XML::Atom::Stream
However, these require a significant number of dependancies, and the latest version of XML::Atom at the time of this writing (0.13) is required for XML::Atom::Stream. These modules can be installed via CPAN: for more information on how to install Perl modules, consult the CPAN documentation.
First, we'll demonstrate how to simply connect to the stream, and dump the data out using Perl's Data::Dumper. The stream runs on port 8081 under the domain updates.sixapart.com
. To see the data as it's coming down the pipe, you can connect using using telnet:
$ telnet updates.sixapart.com 8081 Trying 66.150.15.140... Connected to danga.com. Escape character is '^]'. GET /atom-stream.xml HTTP/1.0
Followed by a final return, this will start the stream of entries into a terminal. You can see that it is a series of atom Feed elements, each containing a single entry. Additionally, every second, a <time> element is sent with the current epoch time. However, the specifics of this format aren't important to us, because we will be using XML::Atom::Stream to connect, which will take care of this for us.
Using XML::Atom::Stream, we define a URL from which we're going to fetch the Atom stream, and initialize a callback function to be called whenever a new entry is received. This callback will receive an instance of XML::Atom::Feed as the source, which will allow the callback to work from that Feed and fetch entries or other aspects of the Feed presented by LiveJournal.
The code below is the basic starting point, provided as an example by XML::Atom::Stream.
use XML::Atom::Stream; my $url = "http://updates.sixapart.com:8081/atom-stream.xml"; my $client = XML::Atom::Stream->new( callback => \&callback, ); $client->connect($url); sub callback { my @atom = @_; }
This is the initial connection work -- from this point on, all work can take place in the callback
subroutine, working from the Feed provided by the Stream.
The @atom variable we created in the callback is an XML::Atom::Feed instance. These instances are generally generated from full feeds, rather than a single entry, so the method for retreiving the entry out of the feed is to loop over the $atom->entries. Each entry has title, content, and link properties we'll use:
foreach my $entry ($atom->entries) { my $title = $entry->title; my $content = $entry->content; my $link = $entry->link;
From here, we want to obtain, and print, the content of each of these elements. However, many entries on LiveJournal don't have titles, so we'll want to do a bit of logic to ensure that we're displaying some kind of title:
if (!$title) { $title = substr($content->body, 0, 50); $title =~ s/[\r\n]+/ /g; } print $title."\n";
Here, we can see that we are printing the first 50 characters of the body if there is no title - otherwise we print out the title of the post, as assigned by the user.
The last step is to print the link, so we know where it's coming from:
print $link->href."\n\n"; }
Now we have a script that will connect, and print out the titles and links that flow by our client listening to LiveJournal's entries.
One fun aspect of LiveJournal is the huge amount of content that will come down the pipe. With the huge number of words coming at you, you can evaluate content for the word used the most, for example. To access the content, you can use $content->body, as demonstrated in the case with no title above. There may also be a number of other similar tricks with displaying content as it comes in.