How to produce a static mirror of a Drupal website?

Note: You should certainly only use this on your own sites...

Prepare the Drupal website

Create a custom block and/or post a node to the front page that notes that the site has been archived from Drupal to static HTML. Be sure to include the date of the archiving. Consider including a link to the future versions of the site (e.g. if you are archiving a 2008 event, link to the URL of the next event).

Disable interactive elements which will be nonfunctional in the static HTML version.

Use the Disable All Forms module to disable all forms.

  • login block
  • who's online block
  • registration
  • anonymous commenting
  • links to the search module and/or any search boxes in the header
  • comment controls which allow the user to select comment display format
  • Disable ajax requests such as views pagers.
  • Remove Views exposed filters
  • Update all nodes by setting their comments to read only. This will eliminate the login or register to post comments link that would otherwise accompany each of your posts. You can do this through phpMyAdmin by running the following SQL command from the node table:
    update node_revision set comment = '1';
    Or with Drush:
    drush sql:query "UPDATE node_revision SET comment = '1';"
  • It can also be a good idea to disable any third party dynamically generated blocks; once the site is archived, it would be difficult to remove these blocks if the 3rd party services are no longer available.

Create a static clone

HTTrack (Linux, macOS, Windows)

KarenS has created a very helpful article Sending a Drupal Site Into Retirement Using HTTrack (2014, updated 2020) where she suggests the following code (on a Linux console):

httrack https://LOCALSITE -O DESTINATION -N "%h%p/%n/index%[page].%t" -WqQ%v --robots=0 --footer ''

(LOCALSITE is the URL of the site that's being copied, and DESTINATION is the path to the directory where the static pages should go)
In the latter, she furtherly suggests to run a regex on all files to fix link issues with index.html:

find . -name "*.html" -type f -print0 | xargs -0 perl -i -pe "s/\/index.html/\//g"

This way it would leave Drupal's non-trailing-space paradigma intact and avoid "duplicate content" issues while preserving absolute paths. Note that this only works with a web server configured to add the necessary trailing slashes again and resolve to the actual index.html file. Use for example DDEV or Lando to quickly spin up a LAMP instance to test it with:

lando init --recipe lamp --name lamp --source cwd --webroot .

You also need to fix the root index.html by copying /index/index.html to /index.html, and remove ../ from paths in that file:

mv index/index.html .
sed -i 's|\.\./||g' index.html

The final task is to search and replace the link to the root index.html in all files, by updating the string href="../index/" to simply href="/":

find . -name "*.html" -type f -print0 | xargs -0 perl -i -pe "s|\.\./index/|/|g"

For a walk-through, see Convert Drupal to static site using HTTrack and deploy to GitHub Pages and for more clean up tips there's Drupal on Mothballs - Convert Drupal 6 or 7 sites to static HTML.

Wget (UNIX, Linux, OSX, ...)

Wget is generally available on almost any 'nix machine and can produce the mirror from the command line. However, wget seems to have problems converting the relative style sheet URLs properly with many Drupal site pages. Modify your theme template to produce hardcoded absolute links to the stylesheets and try the following command:

wget -q --mirror -p --adjust-extension -e robots=off --base=./ -k -P ./ http://example.com

wget respects the robots.txt files, so might not download some of the files in /sites/ or elsewhere. To disable this, include the option -e robots=off in your command line.

wget includes all query strings such as image file "?itok=qRoiFlnG". Recursively remove all query strings with:

find -name "*.*\?*" | while read filename; do mv "$filename" "${filename%%\?*}"; done

Sitesucker (MAC OSX)

Site Sucker. This is a Mac GUI option for downloading a site.

Drupal modules

You can use a Drupal module to export some or all of your site as static HTML.

Verify that the offline version of your site works

Verify that the offline version of your site works in your browser. Test to make sure that you properly turned off any interactive elements in Drupal that will now confuse site users.

Why create a static site archive?

  • Perhaps over time your website have essentially become static. Because these sites still require security administration, an administrator has to continue to upgrade the site with patches or consider removing the site all together.
  • You want to ensure that the site is preserved on Drupal.org infrastructure (without direct cost to you)
  • Alternatively, you may want to produce an offline copy for archiving or convenient reference when you don't have access to the Internet. Before simply removing the site, consider another alternative: a Drupal site is maintained inside a firewall, and then the output of the site is periodically cached to static HTML files and copied to public servers.

Comments

brentratliff’s picture

There most likely won't be anything to handle the callback. Ensure you disable Views Ajax pagers in particular.

chyatt’s picture

I researched the wget command and preferred the following instead:

wget -q --mirror -p --no-check-certificate --html-extension -e robots=off --base=./ -nd -k -P ./ <URL>

Here's what each argument means:

-q                      Don't write any wget output messages
--mirror                Turn on options suitable for mirroring, i.e. -r -N -l info --no-remove-listing
-p                      Download images, scripts, & stylesheets so that everything works offline
--no-check-certificate  Ignore certificate warnings
--html-extension        Append .html to any downloaded files so that they can be viewed offline. E.g. www.example.com/example becomes example.html
-e robots=off           Disable robot exclusion so that you get everything Drupal needs
--base=./               Set the base URL to best resolve relative links
-nd                     Do not create a hierarchy of directories
-k                      Convert links to make them suitable for local viewing
-P ./                   Download here
bwooster47’s picture

This was a very helpful page and helpful comments, have now completed transition of a site to an archive.
Here are some other useful links:
Park your old Drupal site
and
Creating a static Drupal site
The latter shows how to handle the tricky Acidfree module, as well as shows .htaccess rules to keep Drupal archive in a sub-dir and still allow some other package to be used at the root URL.

doitDave’s picture

Since my actual workflow for a larger site (60k nodes) was a mix of multiple links/howtos I found here, I'd like to share a log of it. Maybe it helps one or two of you.

Thanks again to all helpers I already had!

jimafisk’s picture

Thanks for this guide! I used HTTrack to deploy a former D7 site to GitHub Pages. Here's a quick video tutorial I made in case it's helpful to someone: https://www.youtube.com/watch?v=SDEUW4UVS8c&list=UUpGmkFt8EgnMAaZ2eJ8mRi...

ressa’s picture

Thanks @jimafisk I added it under the HTTrack section. Have a great day!