Creating a static archive of a Drupal site

Last modified: January 27, 2010 - 02:00

Over time, some Drupal sites may no longer have any user interactivity or require any updating of content. They have essentially become static. Because these sites still require security administration, an administrator has to continue to upgrade the site with patches or consider removing the site all together. Alternatively, you may want to produce an offline copy for archiving or convenient reference when you don't have access to the Internet. Before simply removing the site, consider a third alternative: produce a static HTML mirror archive of all public pages on the site and replace the Drupal installation.

There are also publishing workflows (sometimes referred to as a DMZ) where a Drupal site is maintained inside the firewall and then the output of the site is periodically cached to static HTML files and copied to public servers.

  1. Create a custom block and/or post a node to the front page that notes that the site has been archived from Drupal to static HTML. Be sure to include the date of the archiving.
  2. Disable interactive elements which will be nonfunctional in the static HTML version. For instance, make sure the following are disabled:
    • login block
    • who's online block
    • registration
    • anonymous commenting
    • links to the search module and/or any search boxes in the header
    • comment controls which allow the user to select comment display format
  3. Update all nodes by setting their comments to read only. This will eliminate the login or register to post comments link that would otherwise accompany each of your posts. You can do this through phpMyAdmin by running the following SQL command from the node table:
    update node set comment = '1';
  4. It can also be a good idea to disable any third party dynamically generated blocks; once the site is archived, it would be difficult to remove these blocks if the 3rd party services are no longer available.
  5. Use one of the following applications to produce the mirror:
    • HTTrack. The Windows GUI client version will produce the mirror with almost no configuration on your part.
    • Site Sucker. This is a Mac GUI option for downloading a site.
    • Wget is generally available on almost any 'nix machine and can produce the mirror from the command line. However, wget seems to have problems converting the relative style sheet URL's properly with many Drupal site pages. Modify your theme template to produce hardcoded absolute links to the stylesheets and try the following command:
      wget -q --mirror -p --html-extension --base=./ -k -P ./ http://example.com
  6. Check out the HTML Export project - This is a Drupal module that dumps a working HTML version of your site.
  7. Verify that the offline version of your site works in your browser. Test to make sure that you properly turned off any interactive elements in Drupal that will now confuse site users.
  8. Remove your Drupal site (except for the style sheets if they were hardcoded in step 4) and replace with your archives.

HTTrack... and robots.txt pitfalls

HTTrack is much better at interpreting @import CSS directives, as well as images referenced in stylesheets, than wget. With HTTrack properly configured, you don't have to hack on common.inc to get all of your stylesheets to work correctly.

However, with the default robots.txt settings in Drupal 5 and the "good citizen" default HTTrack settings, you won't get any module or theme CSS files or JavaScript files.

So, if you're working from a local installation of Drupal and want to grab ALL of your files in a way that you can just copy them up to a server, try the following command:

httrack http://localhost/ -W -O "~/static_cache"  -%v --robots=0

Note: Of course, you should only use this on your own sites!

Where do i key the following

iceaxe - May 6, 2009 - 13:13

Where do i key the following command??

httrack http://localhost/ -W -O "~/static_cache"  -%v --robots=0

Interesting

mike stewart - September 1, 2009 - 18:39

another way of accomplishing this type of scenario would be using the boost module: http://drupal.org/project/boost

in a nutshell, it'll turn your site into static html

Michael Stewart
www.MediaDoneRight.com

 
 

Drupal is a registered trademark of Dries Buytaert.