(also in German at http://www.drupalcenter.de/handbuch/26659)

This description describes the steps done to produce a copy of a live Drupal site that could be viewed "offline" e.g. from a CD by just using a browser accessing static html files without the need for a web-server or a database.

Such a copy could be used for archiving purposes, if the online access to the site isn't always possible, or - as in the case that triggered the creation of the script described in here - to have a portable "snapshot" version of the site's content for reviewing purposes.

Also - the approach described here is not a general purpose mirroring technique - there might be easier ways to achieve this using tools like e.g. wget (http://www.gnu.org/software/wget/), httrack (http://www.httrack.com/) or Drupal's Boost (http://drupal.org/project/boost) or HTML export (http://drupal.org/project/html_export) modules but they might not allow for as close control of the result and the mirroring process as the approach described below.

The situation:

  • an SSL secured Drupal site requiring log in to access any content
  • Drupal is installed in a subdirectory "secrets" of the webroot being pointed to by the URL https://www.mysecrets.com
  • the site's content is organized as a Drupal "book"
  • file attachments are all placed in a "private" directory
  • additional modules "private upload", private download", and "custom filter" are being used to prevent the "private" directory being spyed into.
  • the result of the "mirroring" process should be a set of HTML, CSS, image, and attachment files, that reflect the current contents of the drupal site statically
  • all URL, directory names, user IDs, passwords, etc. for this description are fictitious, any similarities to real domains are merely accidential and by no means intended.
  • this description and the Perl script snippets are provided under the conditions of the GNU Public License (GPL), which could be found in lots of places on the web and is not attached here for simplicity purposes.

Prerequisites:

In order to use the script being developed in this description you'd need Perl plus some Perl packages containing HTML::TreeBuilder for parsing and manipulating html content, LWP as the "browser", and in case https is used also libcrypt-ssleay-perl to be able to use https together with LWP.

Disclaimer:

This description is not intended to teach Perl. It is rather assumed, that the reader has some Perl knowledge already. And it might not be the best possible and most elegant or efficient way to write Perl scripts - I'm more or less an Perl newbie myself. Apologies if the coding style is offending - its not meant to hurt anyone.
The method described below might not be applicable to all kinds of Drupal sites, but should at least work for simple things like e.g. Drupal books.

The script:

Lets start with a more or less typical declarative section of the Perl script containing the more globally used variables:

#
use strict;
use warnings;
use LWP 5.64;
use HTML::TreeBuilder;

my $browser = LWP::UserAgent->new;
my %urls2retrieve;   # a hash to store what we still have to retrieve
my %urlsretrieved;   # another hash containing what we retrieved already (to avoid duplicate retrieval) 
my $outfile;
my @urlkeys;
my $nexturl2retrieve;
my $typeofnexturl;

Now - as we need to log on to the Drupal site and do the retrieval work while being logged on, we need to keep the cookies that the site sends us to maintain our "logged on" state across the duration of this script. Then we do our log-on.

# the cookie store is in memory
$browser->cookie_jar({});
# Initial contact
my $host = 'https://www.mysecrets.com'; # the host
my $url = $host . '/secrets';           # the subdirectory Drupal is installed in
my $response = $browser->get($url);
  die "Can't get $url -- ", $response->status_line
   unless $response->is_success;
my $html = $response->content;
# now post the login
 $response = $browser->post( $url . "/user",
   [
     'name'    => 'JamesBond',    # the log-in ID
     'pass'    => 'agent-007',    # the password for the above ID
     'op'      => 'Log%20in',
     'form_id' => 'user_login'
   ],
 );
$html = $response->content;

Now the script is logged in and can start retrieving the contents. The site is set up so that node 3 is the starting screen, that you get, when logged in, so we start with retrieving node 3 and go down all we can find from there.
The hash %urls2retrieve is used in a way, that the key contains the partial URL and the corresponding value contains the type of what the URL point to. We have types of "node" and "css" where we not only need to "get" them from the site, but also need to do some processing, and a type of "file", which describes something that we just need to retrieve and store as it is.

# initialize the urls to retrieve
$urls2retrieve{"/secrets/?q=node/3"} = 'node';

# iteratively retrieve everything
while (%urls2retrieve) {
        @urlkeys = keys(%urls2retrieve);
        $nexturl2retrieve = $urlkeys[0];
        $typeofnexturl = $urls2retrieve{$nexturl2retrieve};
        if ($typeofnexturl eq "node") {
                RetrieveNode();
                ProcessNode();
        } elsif ($typeofnexturl eq "css") {
                RetrieveCss();
                ProcessCss();
        } elsif ($typeofnexturl eq "file") {
                RetrieveFile();
        }
        $urlsretrieved{$nexturl2retrieve} = 1; # record this url as retrieved
        delete $urls2retrieve{$nexturl2retrieve}; # don't handle this anymore
} # end while (%urls2retrieve)

That is basically it. We're done and have retrieved the entire site. The real work is being done in the subroutines described below.

This routine simply retrieves a node and puts the entire html page into the $response variable.

sub RetrieveNode {
	print "retrieving: $nexturl2retrieve --------\n";
	my $url = $host . $nexturl2retrieve;
	$response = $browser->get($url);
	die "Can't get $url -- ", $response->status_line unless $response->is_success;
} # end of RetrieveNode()

Now the a bit more complicated routine to go through the retrieved page and it's html structure, extracting things that we will have to retrieve next and modify the related references to our intended (flat) target structure.

sub ProcessNode {
print "processing: $nexturl2retrieve --------\n";
$nexturl2retrieve =~ /.*\/(.*)/;
$html = $response->decoded_content; # to get e.g. all the non-standard-ascii characters like german umlauts right
my $root = HTML::TreeBuilder->new();
$root->parse_content($html);

So the entire page is parsed and now we can start looking for things that point to something, that we need to retrieve too or modify. We're starting with the "link" tags.

my @links = $root->look_down('_tag', 'link'); # get a list of "links"

Now we have a list of all the "links" in this page, which we're going to process one by one.
Three different types of links are recognized in the script below. If the "link" points to another node, then first we check, whether this other node has been retrieved already or has been noted for retrieval earlier, if not, then we add this other node onto our hash %urls2retrieve. Then we change the reference to the node from the ".../?q=node/" pattern to a simple "node.html" filename.
In a similar way link references to CSS files and other files are handled.

foreach my $link (@links) { 
   my $linkhref = $link->attr('href');
   if ($link->attr('href') =~ /\?q=node/ ) {
      # push the link to the to-download list
      if (! defined $urlsretrieved{$linkhref} && ! defined $urls2retrieve{$linkhref}) { $urls2retrieve{$linkhref} = 'node'; }
      # change the link for local filesystem use
      $link->attr('href') =~ /\?q=node\/(.*)/;
      my $nodenum = $1;
      $link->attr('href', "node" . $nodenum . ".html" ); #set a new href
   } elsif ( $link->attr('href') =~ /\.css\?y/ ) {
      # push the link to the to-download list
      if (! defined $urlsretrieved{$linkhref} && ! defined $urls2retrieve{$linkhref}) { $urls2retrieve{$linkhref} = 'css'; }
      # change the link for local filesystem use
      $link->attr('href') =~ /.*\/(.*)\.css\?y/;
      my $css = $1;
      $link->attr('href', $css . ".css?y" ); #set a new href
   } else {
      # push the link to the to-download list
      if (! defined $urlsretrieved{$linkhref} && ! defined $urls2retrieve{$linkhref}) { $urls2retrieve{$linkhref} = 'file'; }
      # change the link for local filesystem use
      $link->attr('href') =~ /.*\/(.*)/;
      my $file = $1;
      $link->attr('href', $file ); #set a new href
   }
}

Now that we have handled all "link" tags we do the same kind of processing for anchor tags. Due to the use of some additional Drupal modules (Private Upload, Private Download, Custom Filter) special handling needs to be done for attachment URLs pointing to the "private" folder. References to anything else than a node or a file in private will be removed. This includes all the usual links to the administrative pieces of the Drupal site as they don't make a lot of sense in a "static" copy of the site's content. Furthermore any references to the "starting page" (in our example https://www.mysecrets.com/secrets), which is used e.g. in the site's header and in the breadcrumb navigation, is changed to point to our main page (node3.html).
We're also not following any reference pointing to somewhere else than the site being mirrored - i.e. external links.

my @as = $root->look_down('_tag', 'a'); # get a list of "as"
foreach my $a (@as) { 
   my $ahref = $a->attr('href');
   next if (! defined $ahref);
   if ($a->attr('href') =~ /\?q=node/ ) {
      # push the ahref to the to-download list
      if (! defined $urlsretrieved{$ahref} && ! defined $urls2retrieve{$ahref}) { $urls2retrieve{$ahref} = 'node'; }
      # change the a for local filesystem use
      $a->attr('href') =~ /\?q=node\/(.*)/;
      my $nodenum = $1;
      $a->attr('href', "node" . $nodenum . ".html" ); #set a new href
   } elsif ( $a->attr('href') =~ /\?q=system\/files/ ) {
      # files in private 
      if (! defined $urlsretrieved{$ahref} && ! defined $urls2retrieve{$ahref}) { $urls2retrieve{$ahref} = 'file'; }
      # change the a for local filesystem use
      $a->attr('href') =~ /.*\/(.*)/;
      my $file = $1;
      $a->attr('href', $file ); #set a new href
    } elsif ( $a->attr('href') =~ /\?q=/ ) {
      # ?q=anythingotherthannode will be removed
      $a->delete();
    } else {
      # check for external link
      if ($ahref =~ /$host/ ) {
         # push the link to the to-download list
   	 if (! defined $urlsretrieved{$ahref} && ! defined $urls2retrieve{$ahref}) { $urls2retrieve{$ahref} = 'file'; }
   	 # change the link for local filesystem use
   	 $a->attr('href') =~ /.*\/(.*)/;
   	 my $file = $1;
   	 $a->attr('href', $file ); #set a new href
      } elsif ($ahref eq '/secrets/') {
         # redirect it to our starting page, which is node 3
         $a->attr('href', 'node3.html' ); #set a new href
      } else {
         print "--- external link: $ahref\n"; 
      }
   }
} # end foreach @as

Now that we have handled the anchor tags, we need to look for "img" tags and process any reference to images we find there. Again - as this is a secured site where only logged-in users are allowed to access any content (including images), such content needs to be retrieved from the "private" folder.

my @imgs = $root->look_down('_tag', 'img'); # get a list of "img"
foreach my $img (@imgs) { 
   my $src = $img->attr('src');
   if ($src =~ /\?q=system\/files/ ) {
      # files in private
      if (! defined $urlsretrieved{$src} && ! defined $urls2retrieve{$src}) { $urls2retrieve{$src} = 'file'; }
      # change the a for local filesystem use
      $src =~ /.*\/(.*)/;
      my $file = $1;
      $img->attr('src', $file ); #set a new src
    } elsif ( $src =~ /\/secrets\/sites\/default\/files/ ) {
      if (! defined $urlsretrieved{$src} && ! defined $urls2retrieve{$src}) { $urls2retrieve{$src} = 'file'; }
      # change the a for local filesystem use
      $src =~ /.*\/(.*)/;
      my $file = $1;
      $img->attr('src', $file ); #set a new src
    } else {
      # check for external link
      if ($src =~ /$host/ ) {
         if (! defined $urlsretrieved{$src} && ! defined $urls2retrieve{$src}) { $urls2retrieve{$src} = 'file'; }
         # change the link for local filesystem use
         $src =~ /.*\/(.*)/;
         my $file = $1;
         $img->attr('src', $file ); #set a new href	
      } else {
         print "--- unhandled img src: $src\n"; 
      }
   }
} # end foreach @imgs

As now all the contents of this node / page has been scanned for references to other places and modified appropriately we can save it to a file. The file name is choosen to be "node" followed by the number of the node and then the extension ".html".

$outfile = '';
#determine the name of the file to be written
if ($nexturl2retrieve =~ /\?q=node\//) {
   $nexturl2retrieve =~ /\?q=node\/(.*)/ ;
   $outfile = "node" . $1 . ".html";
   unlink($outfile);
   open (OUTPUT, ">$outfile") or die "cannot open $outfile for output \n";
   my $x = $root->as_HTML();
   print OUTPUT $x;
   close OUTPUT;	
}
$root->delete(); # clear this html tree
} # end of ProcessNode()

This concludes the ProcessNode routine. The next is the RetrieveCss routine, which is similar to the RetrieveNode routine except that we directly retrieve the CSS file to a local file.

sub RetrieveCss {
   print "retrieving: $nexturl2retrieve --------\n";
   my $url = $host . $nexturl2retrieve;
   $nexturl2retrieve =~ /.*\/(.*)\.css/;
   my $cssfile = $1 . ".css";
   $response = $browser->get($url, ':content_file' => $cssfile  );
   die "Can't get $url -- ", $response->status_line unless $response->is_success;
} # end of RetrieveCss()

The retrieved CSS file could contain url references to graphics being used (e.g. for menu bullets or theme-coloring), which need to be retrieved too. Processing needs to consider the relative placement of the referred graphic, which is relative to the CSS file's location.

sub ProcessCss {
   print "processing: $nexturl2retrieve --------\n";
   my $line;
   my $cssurl;
   my $url = $host . $nexturl2retrieve;
   $nexturl2retrieve =~ /.*\/(.*)\.css/;
   my $cssfile = $1 . ".css";
   open (CSSFILE, $cssfile);
   while (<CSSFILE>) {
      chomp; 
      $line = $_;
      if ($line =~ /url\(.*\)/ ) {
         $line =~ /url\((.*)\)/ ;
         $cssurl = $1;
         $nexturl2retrieve =~ /(\/.*\/)/ ;
         my $retfromdir = $1;
         $cssurl = $retfromdir . $cssurl;
         # push the link to the to-download list
         if (! defined $urlsretrieved{$cssurl} && ! defined $urls2retrieve{$cssurl}) { $urls2retrieve{$cssurl} = 'file'; }
      }	
   }
   close (CSSFILE);
} # end of ProcessCss()

The only remaining part is to retrieve plain attachment files and graphics files, which is handled in the following routine.

sub RetrieveFile {
   print "retrieving: $nexturl2retrieve --------\n";
   my $url;
   if ($nexturl2retrieve =~ /$host/) {
      $url = $nexturl2retrieve;
   } else {
      $url = $host . $nexturl2retrieve;
   }
   return if ($nexturl2retrieve eq "/secrets/");
   $nexturl2retrieve =~ /.*\/(.*)/;
   my $file = $1;
   $response = $browser->get($url, ':content_file' => $file  );
   print "Can't get $url -- \n", $response->status_line unless $response->is_success;	
} # end of RetrieveFile()

This concludes my description on how the contents of a Drupal site could be mirrored for offline viewing - or at least how I did it for some sites I take care of.

Feel free to adapt it to any similar situation, where you'd like to have some control on the mirroring process. Any comments are welcome.
(Richard)