Creating a static archive of a Drupal site

Over time, some Drupal sites may no longer have any user interactivity or require any updating of content. They have essentially become static. Because these sites still require security administration, an administrator has to continue to upgrade the site with patches or consider removing the site all together. Alternatively, you may want to produce an offline copy for archiving or convient reference when you don't have access to the Internet.

Before simply removing the site, consider a third alternative: produce a static HTML mirror archive of all public pages on the site and replace the Drupal installation:

  1. Create a custom block and/or post a node to the front page that notes that the site has been archived from Drupal to static HTML. Be sure to include the date of the archiving.
  2. Disable interactive elements which will be nonfunctional in the static HTML version. For instance, make sure the following are disabled:
    • login block
    • who's online block
    • registration
    • anonymous commenting
    • links to the search module and/or any search boxes in the header
    • comment controls which allow the user to select comment display format
  3. Update all nodes by setting their comments to read only. This will eliminate the login or register to post comments link that would otherwise accompany each of your posts. You can do this through phpMyAdmin by running the following SQL command from the node table:
    update node set comment = '1';
  4. It can also be a good idea to disable any third party dynamically generated blocks; once the site is archived, it would be difficult to remove these blocks if the 3rd party services are no longer available.
  5. Use one of the following applications to produce the mirror:
    • HTTrack. The Windows GUI client version will produce the mirror with almost no configuration on your part.
    • Site Sucker. This is a Mac GUI option for downloading a site.
    • Wget is generally available on almost any 'nix machine and can produce the mirror from the command line. However, wget seems to have problems converting the relative style sheet URL's properly with many Drupal site pages. Modify your theme template to produce hardcoded absolute links to the stylesheets and try the following command:
      wget -q --mirror -p --html-extension --base=./ -k -P ./ http://example.com
  6. Verify that the offline version of your site works in your browser. Test to make sure that you properly turned off any interactive elements in Drupal that will now confuse site users.
  7. Remove your Drupal site (except for the style sheets if they were hardcoded in step 4) and replace with your archives.

httrack... and robots.txt pitfalls

bollwyvl - July 19, 2007 - 20:52

httrack is much better at interpreting @import css directives, as well as images referenced in stylesheets, than wget. With httrack properly configured, you don't have to hack on common.inc to get all of your stylesheets to work correctly.

However, with the default robots.txt settings in Drupal 5 and the "good citizen" default httrack settings, you won't get any module or theme css files or javascript files.

So, if you're working from a local installation of Drupal and want to grab ALL of your files in a way that you can just copy them up to a server, try the following command:

httrack http://localhost/ -W -O "~/static_cache"  -%v --robots=0

Of course, you should only use this on your own sites!

HTML Export module

scedwar - May 6, 2008 - 21:21

I've not personally tried this, but looks promising:
http://drupal.org/project/html_export

 
 

Drupal is a registered trademark of Dries Buytaert.