Last updated September 13, 2013. Created by doomed on July 31, 2005.
Edited by pwolanin, forestmonster, drumm, greggles. Log in to edit this page.

How to produce a static mirror of a Drupal website?

Note: You should certainly only use this on your own sites...

Prepare the Drupal website

Create a custom block and/or post a node to the front page that notes that the site has been archived from Drupal to static HTML. Be sure to include the date of the archiving. Consider including a link to the future versions of the site (e.g. if you are archiving a 2008 event, link to the URL of the next event).

Disable interactive elements which will be nonfunctional in the static HTML version.

Use the Disable All Forms module to disable all forms.

  • login block
  • who's online block
  • registration
  • anonymous commenting
  • links to the search module and/or any search boxes in the header
  • comment controls which allow the user to select comment display format
  • Disable ajax requests such as views pagers.
  • Remove Views exposed filters
  • Update all nodes by setting their comments to read only. This will eliminate the login or register to post comments link that would otherwise accompany each of your posts. You can do this through phpMyAdmin by running the following SQL command from the node table:
    update node set comment = '1';
  • It can also be a good idea to disable any third party dynamically generated blocks; once the site is archived, it would be difficult to remove these blocks if the 3rd party services are no longer available.

Create a static clone

Wget (UNIX, Linux, OSX, ...)

Wget is generally available on almost any 'nix machine and can produce the mirror from the command line. However, wget seems to have problems converting the relative style sheet URLs properly with many Drupal site pages. Modify your theme template to produce hardcoded absolute links to the stylesheets and try the following command:

wget -q --mirror -p --html-extension -e robots=off --base=./ -k -P ./ http://example.com

wget respects the robots.txt files, so might not download some of the files in /sites/ or elsewhere. To disable this, include the option -e robots=off in your command line.

HTTrack (UNIX and Windows and Mac/homebrew)

HTTrack. The Windows GUI client version will produce the mirror with almost no configuration on your part. One potential command to use is:

httrack http://2011.example.com -K -w -O . -%v --robots=0 -c1 -%e0

Note the -K option creates absolute links - this is only sometimes useful if you are hosting a public mirror on the same domain. Otherwise omit -K to produce relative links

The -c1 options makes only 1 request at a time so this becomes rather slow. The default is -c10, so you might considering something more like this value when archiving your own site.

With HTTrack properly configured, you don't have to hack on common.inc to get all of your stylesheets to work correctly. However, with the default robots.txt settings in Drupal 5 and the "good citizen" default HTTrack settings, you won't get any module or theme CSS files or JavaScript files.

If you're working from a local installation of Drupal and want to grab ALL of your files in a way that you can just copy them up to a server, try the following command:

httrack http://localhost/ -W -O "~/static_cache"  -%v --robots=0

Sitesucker (OSX)

Site Sucker. This is a Mac GUI option for downloading a site.

HTML Export module

Check out the HTML Export project - This is a Drupal module that dumps a working HTML version of your site.

Verify that the offline version of your site works

Verify that the offline version of your site works in your browser. Test to make sure that you properly turned off any interactive elements in Drupal that will now confuse site users.

Why create a static site archive?

  • Perhaps over time your website have essentially become static. Because these sites still require security administration, an administrator has to continue to upgrade the site with patches or consider removing the site all together.
  • You want to ensure that the site is preserved on Drupal.org infrastructure (without direct cost to you)
  • Alternatively, you may want to produce an offline copy for archiving or convenient reference when you don't have access to the Internet. Before simply removing the site, consider another alternative: a Drupal site is maintained inside a firewall, and then the output of the site is periodically cached to static HTML files and copied to public servers.

Looking for support? Visit the Drupal.org forums, or join #drupal-support in IRC.

Comments

I've always found wget pretty reliable when you figure out the right flags ;)

Before I found this documentation, I also had problems with site-relative stylesheet links, but not all of them. If there's a pattern, it seems wget successfully converted @href values in the "/sites/[mysite]" directory to "sites/...", but actually converted others to full URLs including protocol and domain (http://example.com/...). For example, "/modules/node/node.css?b" came out as "http://example.com/modules/node/node.css?b". Hmmph??! I guess the problem links came from sites/all on the filesystem, but I expect this to be irrelevant to a client (like wget).

Here are the flags I used, -Erkp are probably the only relevant ones:

wget -w 3 --random-wait --user-agent=hugh -Erkp http://example.com -o example.wget.log

I can work around this with other tools for now. Just on the off-chance, does anyone happen to know why this even happens? I think -np ("no parent") is default behaviour just in case this has anything to do with it, though it shouldn't.

wget uses the full URL to save the file -- e.g. modules/node/node.css?b becomes the literal filename, including the ?b. When you try to fetch the page that contains a reference to that file, the browser will stop before the ?, and request the file modules/node/node.css, which will not match with modules/node/node.css?b. Your options are:

  1. Alter Drupal (temporarily?) to not include the ?b in css links
  2. Post-process the files downloaded by wget, renaming all foo?bar to foo
  3. Use a different tool, like httrack

I think it's because the robots.txt file disallows crawlers from going into the /sites/ directory, among others. I included -e robots=off in my command line and that seemed to pull in all the expected files. I've edited the wget section above to add this info.

My full command line was wget -w 1 -Erkp -e robots=off [URL] -o wget.log

Thanks a lot for this command line options! This command works fine for me for delivering a local development version of the site as a zip archive to my client for viewing purposes only.

If you had the problem like me, that sitesucker is not downloading css or the site is not being displayed correctly try to enable the "ignore robots exclusions" option in the settings. After that the site was downloaded and shown as expected.

Does anyone have any suggestions of how to deal with filtered views on a site to archive?

I would imagine they would have to be disabled. Httrack handles sorted table views but doesn't handle the exposed filters as they require a form response.

--
G

There most likely won't be anything to handle the callback. Ensure you disable Views Ajax pagers in particular.

Is it possible to keep url unchanged? Current with the default httrack options, node/12 will be changed to node/12.html, test will be changed to test.html, query string (the one added in JS files for versionning) is merged into filename etc.

Ideally, we should be able to:
- node/12 is saved as node/12/index.html instead of node/12.html
- Remove simple query string: misc/drupal.js?m4kqgj is kept as misc/drupal.js instead of misc/drupald4c4.js. Currently we can use -N "%h%p/%n.%t"

I also wanted to keep the URLs unchanged and I found this nice article by KarenS where she writes:

One of the biggest problems of transforming a dynamic site into static pages is that the urls must change. The 'real' url of a Drupal page is 'index.php?q='/news', or 'index.php?q=/about', i.e. there is really only one HTML page that dynamically re-renders itself depending on the requested path. A static site has to have one HTML page for every page of the site, so the new url has to be '/news.html' or '/news/index.html'. The good thing about the second option is that incoming links to '/news' will automatically be routed to /news/index.html' if it exists, so that second pattern is the one I want to use.

The -N flag in the command will rewrite the pages of the site, including pager pages, into the pattern "/about/index.html". Without the -N flag, the page at "/about" would have been transformed into a file called "about.html".

I followed her instructions using
httrack http://example.com -O . -N "%h%p/%n/index%[page].%t" -WqQ%v --robots=0

And it worked, at least with the correction she also suggestst:
find . -name "*.html" -type f -print0 | xargs -0 perl -i -pe "s/\/index.html/\//g"

The downside: images and other files are also put to their respective [filename]/index.[filetype] directories, so their URLs do change.

I researched the wget command and preferred the following instead:

wget -q --mirror -p --no-check-certificate --html-extension -e robots=off --base=./ -nd -k -P ./ <URL>

Here's what each argument means:

-q                      Don't write any wget output messages
--mirror                Turn on options suitable for mirroring, i.e. -r -N -l info --no-remove-listing
-p                      Download images, scripts, & stylesheets so that everything works offline
--no-check-certificate  Ignore certificate warnings
--html-extension        Append .html to any downloaded files so that they can be viewed offline. E.g. www.example.com/example becomes example.html
-e robots=off           Disable robot exclusion so that you get everything Drupal needs
--base=./               Set the base URL to best resolve relative links
-nd                     Do not create a hierarchy of directories
-k                      Convert links to make them suitable for local viewing
-P ./                   Download here