Recovering MT entries from HTML archives

This script might come in handy some day. There are always posts on the MT support forum about how someone's lost their database, hadn't made a backup and only have the archive HTML files generated by MT to restore from. And I'm always forgetting what search terms I used to find the original post.

It won't restore everything. You still need to rebuild your templates and modules. Not sure about the comments, but at least you can recover your entries. Still, it's no substitute for having a good database backup or SQL dump of your MT database.

Don't thank me, thank Jason. He wrote it.

#!/usr/bin/perl
use Date::Manip;
# mtfix.pl - parse HTML files to import into Movable Type
# usage: mtfix.pl *html > import.mt
# Note: you _will_ need to adjust the regex
while (<>) { # for each file on the command line
# read in entire file to $content, line feeds and all
# using slurp mode
{ local $/; $content = <>;}
# locate the fields we need using regex
# some matches may include newlines
($author) =($content =~ m|<div class="posted">\s+(.+?)\s+/|s);
($title) = ($content =~ m|<span class="title">(.+)</span>|);
($text) = ($content =~ m|<p>(.+)<a name="more">|s);
($more) = ($content =~ m|<a name="more">(.+)<span class="posted">|s);
($date) = ($content =~ m|<div class="date">(.+)</div>|);
($time) = ($content =~ m|Posted\sat\s(.+)<br />|);

# Read in all the comments as one big block of text
($comments) = ($content =~ m|</a>Comments</div>\n(.+)<div class="comments-head">|s);
# Break up each comment into it's own element in an array
@comm = split (/<\/div>\n/, $comments);

# convert the date to MM/DD/YYYY hh:mm:ss
$datetime = "$date $time";
$parsed = ParseDate($datetime);
$datetime = UnixDate($parsed,"%m/%d/%Y %H:%M:%S");

# Strip out the paragraph tags, MT will add them later anyways.
$text =~ s|\<p\>||g;
$text =~ s|\</p\>||g;
$more =~ s|\<p\>||g;
$more =~ s|\</p\>||g;

# printout the fields in the proper format
print "AUTHOR: $author\n";
print "TITLE: $title\n";
print "DATE: $datetime\n";
print "-----\n";
print "BODY:\n$text\n";
print "-----\n";
print "EXTENDED BODY:\n$more\n";

foreach (@comm) { # For every comment in our aray, printout the necessary formating.
       if (length $_ > 7) { # this is here to ignore the last comment record.
               ($CText) = ($_ =~ m|<div class="comments-body">\n(.+)\n\<span|s);
               ($CDate) = ($_ =~ m|</a>\son\s(.+)</span>|);
               ($Ctemp) = ($_ =~ m|Posted\sby:\s(.+)\/a\>|);
               ($CAuthor) = ($Ctemp =~ m|\>(.+)\<|);
               ($CURL) = ($Ctemp =~ m|href=\"(.+)\"|);
               $CText =~ s|\<p\>||g;
               $CText =~ s|\</p\>||g;
               $parsed = ParseDate($CDate);
               $CDate = UnixDate($parsed,"%m/%d/%Y %H:%M:%S");

               print "-----\n";
               print "COMMENT:\n";
               print "AUTHOR: $CAuthor\n";
               print "URL: $CURL\n";
               print "DATE: $CDate\n";
               print "$CText\n\n";
       }
}
print "--------\n";
}