Last updated May 9, 2013. Created by doomed on July 31, 2005.
Edited by forestmonster, drumm, greggles, coltrane. Log in to edit this page.
How to produce a static mirror of a Drupal website?
Note: You should certainly only use this on your own sites...
Prepare the Drupal website
Create a custom block and/or post a node to the front page that notes that the site has been archived from Drupal to static HTML. Be sure to include the date of the archiving. Consider including a link to the future versions of the site (e.g. if you are archiving a 2008 event, link to the URL of the next event).
Disable interactive elements which will be nonfunctional in the static HTML version.
Use the Disable All Forms module to disable all forms.
- login block
- who's online block
- registration
- anonymous commenting
- links to the search module and/or any search boxes in the header
- comment controls which allow the user to select comment display format
- Disable ajax requests such as views pagers.
- Remove Views exposed filters
- Update all nodes by setting their comments to read only. This will eliminate the login or register to post comments link that would otherwise accompany each of your posts. You can do this through phpMyAdmin by running the following SQL command from the node table:
update node set comment = '1'; - It can also be a good idea to disable any third party dynamically generated blocks; once the site is archived, it would be difficult to remove these blocks if the 3rd party services are no longer available.
Create a static clone
Wget (UNIX, Linux, OSX, ...)
Wget is generally available on almost any 'nix machine and can produce the mirror from the command line. However, wget seems to have problems converting the relative style sheet URLs properly with many Drupal site pages. Modify your theme template to produce hardcoded absolute links to the stylesheets and try the following command:
wget -q --mirror -p --html-extension -e robots=off --base=./ -k -P ./ http://example.comwget respects the robots.txt files, so might not download some of the files in /sites/ or elsewhere. To disable this, include the option -e robots=off in your command line.
HTTrack (UNIX and Windows)
HTTrack. The Windows GUI client version will produce the mirror with almost no configuration on your part. One potential command to use is:
httrack http://2011.example.com -K -w -O . -%v --robots=0 -c1 %e0With HTTrack properly configured, you don't have to hack on common.inc to get all of your stylesheets to work correctly. However, with the default robots.txt settings in Drupal 5 and the "good citizen" default HTTrack settings, you won't get any module or theme CSS files or JavaScript files.
If you're working from a local installation of Drupal and want to grab ALL of your files in a way that you can just copy them up to a server, try the following command:
httrack http://localhost/ -W -O "~/static_cache" -%v --robots=0 Sitesucker (OSX)
Site Sucker. This is a Mac GUI option for downloading a site.
HTML Export module
Check out the HTML Export project - This is a Drupal module that dumps a working HTML version of your site.
Verify that the offline version of your site works
Verify that the offline version of your site works in your browser. Test to make sure that you properly turned off any interactive elements in Drupal that will now confuse site users.
Why create a static site archive?
- Perhaps over time your website have essentially become static. Because these sites still require security administration, an administrator has to continue to upgrade the site with patches or consider removing the site all together.
- You want to ensure that the site is preserved on Drupal.org infrastructure (without direct cost to you)
- Alternatively, you may want to produce an offline copy for archiving or convenient reference when you don't have access to the Internet. Before simply removing the site, consider another alternative: a Drupal site is maintained inside a firewall, and then the output of the site is periodically cached to static HTML files and copied to public servers.
Comments
curious why wget has issues
I've always found wget pretty reliable when you figure out the right flags ;)
Before I found this documentation, I also had problems with site-relative stylesheet links, but not all of them. If there's a pattern, it seems wget successfully converted @href values in the "/sites/[mysite]" directory to "sites/...", but actually converted others to full URLs including protocol and domain (http://example.com/...). For example, "/modules/node/node.css?b" came out as "http://example.com/modules/node/node.css?b". Hmmph??! I guess the problem links came from sites/all on the filesystem, but I expect this to be irrelevant to a client (like wget).
Here are the flags I used,
-Erkpare probably the only relevant ones:wget -w 3 --random-wait --user-agent=hugh -Erkp http://example.com -o example.wget.logI can work around this with other tools for now. Just on the off-chance, does anyone happen to know why this even happens? I think
-np("no parent") is default behaviour just in case this has anything to do with it, though it shouldn't.wget file-saving behavior
wget uses the full URL to save the file -- e.g.
modules/node/node.css?bbecomes the literal filename, including the ?b. When you try to fetch the page that contains a reference to that file, the browser will stop before the ?, and request the filemodules/node/node.css, which will not match withmodules/node/node.css?b. Your options are:Possibly due to robots.txt
I think it's because the robots.txt file disallows crawlers from going into the /sites/ directory, among others. I included
-e robots=offin my command line and that seemed to pull in all the expected files. I've edited the wget section above to add this info.My full command line was
wget -w 1 -Erkp -e robots=off [URL] -o wget.logThat's it
Thanks a lot for this command line options! This command works fine for me for delivering a local development version of the site as a zip archive to my client for viewing purposes only.
Sitesucker ignores css
If you had the problem like me, that sitesucker is not downloading css or the site is not being displayed correctly try to enable the "ignore robots exclusions" option in the settings. After that the site was downloaded and shown as expected.
Deadling with filtered views
Does anyone have any suggestions of how to deal with filtered views on a site to archive?
I would imagine they would
I would imagine they would have to be disabled. Httrack handles sorted table views but doesn't handle the exposed filters as they require a form response.
--
G
Disable Ajax
There most likely won't be anything to handle the callback. Ensure you disable Views Ajax pagers in particular.
Twitter: http://twitter.com/brentratliff
Blog: http://laminarlogic.com
Is it possible to keep url
Is it possible to keep url unchanged? Current with the default httrack options, node/12 will be changed to node/12.html, test will be changed to test.html, query string (the one added in JS files for versionning) is merged into filename etc.
Ideally, we should be able to:
- node/12 is saved as node/12/index.html instead of node/12.html
- Remove simple query string: misc/drupal.js?m4kqgj is kept as misc/drupal.js instead of misc/drupald4c4.js. Currently we can use -N "%h%p/%n.%t"
www.thongtincongnghe.com is powered by Drupal.