Static mirror? Offline mirror?
earlax - March 2, 2005 - 07:01
I need to be able to mirror my site, can I do this with Drupal?
I can use Drupal for the real site, but I want to be able to run mirrors with static HTML that update every ${so_often}. I would also like to be able to create an offline-browsable version.
How?
Thanks in advance.

An offline-browsable version
Would be quite amusing/useful. I'd like to be able to archive a site and have it be viewable even if I don't happen to have a server to run everything on. How feasible/ridiculously impossible would something like that be?
---
Govorite po-russkij? Want to?
http://www.sobrania.net/russian/
Try wget -r
Some variation of "wget -r" ought to work. Be sure to only run it against your own site; others may take offense at you scraping their site.
(wget is a versatile download uitility provided by GNU)
Testing now (against my own)
Testing now against my own local test site to see what kind of output I get...
I've been away from computer work too long; I forget about the basic tools. =) wget never even occurred to me.
Update: 'wget -r' works quite well. Kind of disturbing how big my site became in the few months I was testing it, though...
---------
Govorite po-russkij? Want to?
http://www.sobrania.net/russian/
XML based mirror?
How about creating a drupal site on another server, which will keep track of your articles based on XML feeds created by your main server?
"offline"
While that's a good idea, but that's not offline, which is one of my primary concerns. Can I run a static version on another box if I get slashdotted? Can I burn the whole site to a CD for local browsing, etc.
Tatu Ylonen, SSH 1.2.12 README: "Beware that the most effective
way for someone to decrypt your data may be with a rubber hose."
I think you should use boost
I think you should use boost module
sina.salek.ws
Feel freedom with open source softwares
Successful capture of site with form-based authentication
My challenge was to capture the contents of a site that required a login to access the content. I had admin level control over the site, so access was not a concern but I needed to archive a static "snapshot" of the site every once in awhile for legal purposes.
I had been able to do this with a Drupal 4.7 site with HTTrack using the form-based "Catch URL" function where you get ready to login to the site but then point your browser's proxy address to the program. You submit the login and the program captures the login request, using it to emulate a valid login request when spidering the site. I was not able to get this to work with my 5.10 version of my site, so I pursued other freeware (Windows) avenues. I found no other freeware spider programs that did form-based authentication so I thought I was out of options until I discovered the IP Login module. I installed it, assigned my IP to my specially created and permissioned "archiver" user and then was able to use HTTrack. Since HTTrack is coming from the same IP, Drupal lets it in as my "archiver" user without needing to actually login - no more form-based authentication headaches. The only downfall to this is temporarily opening up the site to my one IP - which IS a shared IP since I'm using DSL at home. For a private, unpublished site as mine is, this risk seems negligible and on top of that, the "archiver" user is pretty much "read only".
I'm not sure what was wrong with HTTrack and Drupal, but I think that a new cookie kept on being created, overwriting the logged in "good" cookie. Logs indicated the login destination module was properly routing the user (implying that they were actually getting logged in) to the right page, but that page would come back as 403 Forbidden.
Cheers
There is a module called
There is a module called boost which you might be already familiar with. i guess you twick this module to work for authenticated contents as well. if you do it you easily copy the cached contents which are complete html pages to your mirror servers using rsync or similar commands
Regards
sina.salek.ws, CIO & Lead developer
Feel freedom with open source softwares
Feature Request?
I'm very interested in this feature for the exact same reason. Should we get a feature request going?
Knowing which views to generate/crawl, how paging is handled and the rest is quiet tricky for either of these suggestions, but I see two plans for this:
Plan 'A' 'wget --mirror':
This is discussed here: http://drupal.org/node/27882
Basically you have to turn off any dynamic blocks the mirror process might see as that will make your site look ugly and also confuse people. You would also have to embed a message in the page somewhere that says, "This is a read-only copy".
One could put something in settings.php to look at HTTP_USER_AGENT and set a global flag. Block on/off could be triggered by PHP inspection of this global. Not nice, but it would work.
When running wget, use the --user-agent option to give your 'magic' agent to have the read-only message and block configuration you require.
I'm not massively into this plan as the mirror would take a long time and lots of CPU to build. Refresh would not be much better.
Plan 'B' 'drupal module':
This is much better.
Create a module that could run as part of cron and (re)generate pages created/changed since last run and write static files. It should only access content the anonymous user can see. It should add the "This is a read-only copy" message.
One could configure the module so it knew which blocks were dynamic and which static, so it would disable all dynamic blocks as part of the run.