January 13, 2005

(Not) Recovering Data From Advogato

A couple of days ago, zeenix mentioned he had overwritten his diary (and I would have responded sooner if not for the lost account issue).

The good news is that I archive the HTML pages I have rawdog generate from RSS et al. This means I have 88MB of HTML from all the feeds I subscribe to going back to last June. So, I thought it would be a fairly trivial operation to recover the lost entries with a splash of Perl.

#!/usr/bin/perl
use strict;
use warnings;
use HTML::Parser;
use HTML::TreeBuilder;
foreach my $file_name (@ARGV) {
        my $tree = HTML::TreeBuilder->new; # empty tree
        $tree->parse_file($file_name);
        my @elements = $tree->find_by_attribute("class", "item feed-6e2302e9");
        # feed-6e2302e9 is the id of elements that a third
        # party scrapes from advogato recent log and provides
        # as an RSS feed to a few people
        # Since it has all Advogato entries in it we need to
        # Parse the HTML to look at the name of the person
        # who wrote it. Advogato puts that in the title of the
        # entry.
        foreach my $node (@elements) {
                my @title = $node->find_by_tag_name('h3');
                if ($title[0]->as_text =~ /zeenix/) {
                        print $node->as_HTML;
                }
        }
        $tree->delete;
}

It seemed like a good idea, but it looks like the missing entries never made it to my feedparser, perhaps not even onto Advogato. So this is the best I can do.

Posted at January 13, 2005 03:36 PM (TrackBack)