January 13, 2005
(Not) Recovering Data From Advogato
A couple of days ago, zeenix mentioned he had overwritten his diary (and I would have responded sooner if not for the lost account issue).
The good news is that I archive the HTML pages I have rawdog generate from RSS et al. This means I have 88MB of HTML from all the feeds I subscribe to going back to last June. So, I thought it would be a fairly trivial operation to recover the lost entries with a splash of Perl.
#!/usr/bin/perl
use strict;
use warnings;
use HTML::Parser;
use HTML::TreeBuilder;
foreach my $file_name (@ARGV) {
my $tree = HTML::TreeBuilder->new; # empty tree
$tree->parse_file($file_name);
my @elements = $tree->find_by_attribute("class", "item feed-6e2302e9");
# feed-6e2302e9 is the id of elements that a third
# party scrapes from advogato recent log and provides
# as an RSS feed to a few people
# Since it has all Advogato entries in it we need to
# Parse the HTML to look at the name of the person
# who wrote it. Advogato puts that in the title of the
# entry.
foreach my $node (@elements) {
my @title = $node->find_by_tag_name('h3');
if ($title[0]->as_text =~ /zeenix/) {
print $node->as_HTML;
}
}
$tree->delete;
}
It seemed like a good idea, but it looks like the missing entries never made it to my feedparser, perhaps not even onto Advogato. So this is the best I can do.
Posted at January 13, 2005 03:36 PM (TrackBack)
