How to produce a static mirror of a Drupal website?
Note: You should certainly only use this on your own sites...
Prepare the Drupal website
Create a custom block and/or post a node to the front page that notes that the site has been archived from Drupal to static HTML. Be sure to include the date of the archiving. Consider including a link to the future versions of the site (e.g. if you are archiving a 2008 event, link to the URL of the next event).
Disable interactive elements which will be nonfunctional in the static HTML version.
Use the Disable All Forms module to disable all forms.
- login block
- who's online block
- registration
- anonymous commenting
- links to the search module and/or any search boxes in the header
- comment controls which allow the user to select comment display format
- Disable ajax requests such as views pagers.
- Remove Views exposed filters
- Update all nodes by setting their comments to read only. This will eliminate the login or register to post comments link that would otherwise accompany each of your posts. You can do this through phpMyAdmin by running the following SQL command from the node table:
update node_revision set comment = '1';
Or with Drush:
drush sql:query "UPDATE node_revision SET comment = '1';"
- It can also be a good idea to disable any third party dynamically generated blocks; once the site is archived, it would be difficult to remove these blocks if the 3rd party services are no longer available.
Create a static clone
HTTrack (Linux, macOS, Windows)
KarenS has created a very helpful article Sending a Drupal Site Into Retirement Using HTTrack (2014, updated 2020) where she suggests the following code (on a Linux console):
httrack https://LOCALSITE -O DESTINATION -N "%h%p/%n/index%[page].%t" -WqQ%v --robots=0 --footer ''
(LOCALSITE is the URL of the site that's being copied, and DESTINATION is the path to the directory where the static pages should go)
In the latter, she furtherly suggests to run a regex on all files to fix link issues with index.html:
find . -name "*.html" -type f -print0 | xargs -0 perl -i -pe "s/\/index.html/\//g"
This way it would leave Drupal's non-trailing-space paradigma intact and avoid "duplicate content" issues while preserving absolute paths. Note that this only works with a web server configured to add the necessary trailing slashes again and resolve to the actual index.html file. Use for example DDEV or Lando to quickly spin up a LAMP instance to test it with:
lando init --recipe lamp --name lamp --source cwd --webroot .
You also need to fix the root index.html by copying /index/index.html to /index.html, and remove ../ from paths in that file:
mv index/index.html .
sed -i 's|\.\./||g' index.html
The final task is to search and replace the link to the root index.html in all files, by updating the string href="../index/" to simply href="/":
find . -name "*.html" -type f -print0 | xargs -0 perl -i -pe "s|\.\./index/|/|g"
For a walk-through, see Convert Drupal to static site using HTTrack and deploy to GitHub Pages and for more clean up tips there's Drupal on Mothballs - Convert Drupal 6 or 7 sites to static HTML.
Wget (UNIX, Linux, OSX, ...)
Wget is generally available on almost any 'nix machine and can produce the mirror from the command line. However, wget seems to have problems converting the relative style sheet URLs properly with many Drupal site pages. Modify your theme template to produce hardcoded absolute links to the stylesheets and try the following command:
wget -q --mirror -p --adjust-extension -e robots=off --base=./ -k -P ./ http://example.com
wget respects the robots.txt files, so might not download some of the files in /sites/ or elsewhere. To disable this, include the option -e robots=off
in your command line.
wget includes all query strings such as image file "?itok=qRoiFlnG". Recursively remove all query strings with:
find -name "*.*\?*" | while read filename; do mv "$filename" "${filename%%\?*}"; done
Sitesucker (MAC OSX)
Site Sucker. This is a Mac GUI option for downloading a site.
Drupal modules
You can use a Drupal module to export some or all of your site as static HTML.
Verify that the offline version of your site works
Verify that the offline version of your site works in your browser. Test to make sure that you properly turned off any interactive elements in Drupal that will now confuse site users.
Why create a static site archive?
- Perhaps over time your website have essentially become static. Because these sites still require security administration, an administrator has to continue to upgrade the site with patches or consider removing the site all together.
- You want to ensure that the site is preserved on Drupal.org infrastructure (without direct cost to you)
- Alternatively, you may want to produce an offline copy for archiving or convenient reference when you don't have access to the Internet. Before simply removing the site, consider another alternative: a Drupal site is maintained inside a firewall, and then the output of the site is periodically cached to static HTML files and copied to public servers.
Comments
Disable Ajax
There most likely won't be anything to handle the callback. Ensure you disable Views Ajax pagers in particular.
Twitter: http://twitter.com/brentratliff
Blog: http://laminarlogic.com
http://mediacurrent.com
I researched the wget command
I researched the wget command and preferred the following instead:
Here's what each argument means:
Even more details, with .htaccess examples
This was a very helpful page and helpful comments, have now completed transition of a site to an archive.
Here are some other useful links:
Park your old Drupal site
and
Creating a static Drupal site
The latter shows how to handle the tricky Acidfree module, as well as shows .htaccess rules to keep Drupal archive in a sub-dir and still allow some other package to be used at the root URL.
How I did it
Since my actual workflow for a larger site (60k nodes) was a mix of multiple links/howtos I found here, I'd like to share a log of it. Maybe it helps one or two of you.
Thanks again to all helpers I already had!
HTTrack worked well
Thanks for this guide! I used HTTrack to deploy a former D7 site to GitHub Pages. Here's a quick video tutorial I made in case it's helpful to someone: https://www.youtube.com/watch?v=SDEUW4UVS8c&list=UUpGmkFt8EgnMAaZ2eJ8mRi...
Thanks @jimafisk I added it
Thanks @jimafisk I added it under the HTTrack section. Have a great day!