add a (drush) fossilize command [#1369242]

Some sites just *won't* be updated to Drupal 7. In fact, some will never be updated to Drupal 5 in my case (!!). The solution is to "fossilize" them, to convert them to HTML and simple static files.

I believe there is room in Aegir to support that kind of functionality, since we want to provide long term support for drupal sites, often beyond what is really possible.

This is my first retarded implementation:

wget --mirror -k -e robots=off --wait 1  http://example.com/

The robots part is actually *not* recommended, and is only required on a special site I tested that disabled all crawling from anybody.

The wait part is important. The -k will convert absolute links (common in misconfigured sites) to relative links.

The mirror bit will make sure this can be run over and over again and that we will slurp *everything*.

Now, there are a couple of things needed to improve this:

1. maybe we could tap into the Drupal database to see *what* do mirror instead of just parsing the HTML, which is what wget does (where would we look? node? path? taxonomy? menu?)
2. feedback in the frontend of the created site
3. create a subdomain to host the static site
4. register that subdomain in the frontend (this would of course depend on #1132532: Create basic vhosts)

More things to be done to make this work better:

* disable comments for anonymous users
* disable content submission for anonymous users
* disable any dynamic pages (e.g. search forms, calendars, etc)
* make an SQL dump of the site and a tarball of its source code (aka one last backup) along with a makefile describing how to recreate the source code if necessary

Comments

Comment #1

anarcat CreditAttribution: anarcat commented 13 December 2011 at 14:17

the command now is:

wget --mirror -k -K --html-extension -e robots=off --wait 1 http://example.com/

The --html-extension bit is important otherwise the node/1 pages do not get sent as HTML by Apache. -K is useful to be allowed to avoid redownloading files even though they have been reparsed.

Comment #2

omega8cc CreditAttribution: omega8cc commented 13 December 2011 at 15:19

I typically use something like:

wget -b -m -k -p -E --random-wait --user-agent=iCab -erobots=off --tries=5 --exclude-directories=calendar,users,user --domains=domain.org http://domain.org/

It fetches complete site with all CSS/JS, images, downloads etc. and converts all URLs and internal links to .html and doesn't fetch any third party stuff, so it will not try to download all the Internet :)

Note that you could specify any aliases (if used), comma separated in --domains=

Comment #3

anarcat CreditAttribution: anarcat commented 14 December 2011 at 03:44

Thanks for the feedback!

-p seems to be necessary, indeed - i missed that one. Nice also to use the --domains= thing for aliases, this would integrate well with aegir! I don't see why retries should be configured, it retries by default (although mayb 20 retries is too high?). Also, I don't feel like random-wait is necessary unless you have a weird configuration - we're leeching off a friendly site here after all.

I also found that -nv makes the whole thing much more readable. -b doesn't seem necessary - i like to run this in the foreground. So this would become:

wget -m -k -K -E -p -w 1 -nv -e robots=off -X calendar -D example.org http://example.com/

Or, in the long form:

wget --mirror --convert-links --backup-converted --html-extension --page-requisites --wait=1 --no-verbose -e robots=off --exclude-directories=calendar --domains=example.org http://example.com/

Next up is a drush command... I would very well see the "exclude-directories" and "domains" parameters be configurable, along with the "robots=off" setting (which should cover for excludes, so normally robots=on).

Comment #4

helmo CreditAttribution: helmo commented 27 December 2011 at 13:34

There is a d.o documentation page about "fossilizing": http://drupal.org/node/27882

I've used httrack in the past to do this which did a nice job. Back then wget was not really working for me, although I was probably not using all the options as described in this issue.

httrack http://example.com/ -W -O "~/static_cache" -%v --robots=0

Comment #5

Guillaume Beaulieu CreditAttribution: Guillaume Beaulieu commented 29 June 2012 at 19:42

We've made a french documentation over there:
https://wiki.koumbit.net/Drupal/CommentFossiliser

There is issues regarding event module, and other.

The wget part is yet the biggest and buggiest.
wget --mirror -e robots=off --page-requisites --html-extension -nv --base=./ --convert-links --directory-prefix=./ http://www.foo.org/

--random-wait is only useful when you don't control the environment.

Another step is to generate 301's for all the site old urls (node/1 => node/1.html)

Comment #6

Guillaume Beaulieu CreditAttribution: Guillaume Beaulieu commented 4 July 2012 at 22:13

Also drupal tends to make links to css that are not well supported by wget. You have to find the bogus ?j.cess thingy afterward (can anyone point me to where that come from ?) and then, on the directory holding all the downloaded files:

for A in `find . -iname *css\?j.css` ; do cp $A `dirname $A`/`basename $A .css\?j.css`.css ; done