Static mirror? Offline mirror?

By earlax on 2 Mar 2005 at 07:01 UTC

I need to be able to mirror my site, can I do this with Drupal?

I can use Drupal for the real site, but I want to be able to run mirrors with static HTML that update every ${so_often}. I would also like to be able to create an offline-browsable version.

How?

Thanks in advance.

Comments

An offline-browsable version

signal9 commented 15 March 2005 at 01:11

Would be quite amusing/useful. I'd like to be able to archive a site and have it be viewable even if I don't happen to have a server to run everything on. How feasible/ridiculously impossible would something like that be?

---
Govorite po-russkij? Want to?
http://www.sobrania.net/russian/

Try wget -r

ezheidtmann commented 15 March 2005 at 01:24

Some variation of "wget -r" ought to work. Be sure to only run it against your own site; others may take offense at you scraping their site.

(wget is a versatile download uitility provided by GNU)

Testing now (against my own)

signal9 commented 15 March 2005 at 04:39

Testing now against my own local test site to see what kind of output I get...

I've been away from computer work too long; I forget about the basic tools. =) wget never even occurred to me.

Update: 'wget -r' works quite well. Kind of disturbing how big my site became in the few months I was testing it, though...

---------
Govorite po-russkij? Want to?
http://www.sobrania.net/russian/

XML based mirror?

radiobuzzer commented 26 April 2006 at 22:33

How about creating a drupal site on another server, which will keep track of your articles based on XML feeds created by your main server?

"offline"

earlax commented 12 March 2007 at 08:58

While that's a good idea, but that's not offline, which is one of my primary concerns. Can I run a static version on another box if I get slashdotted? Can I burn the whole site to a CD for local browsing, etc.

Tatu Ylonen, SSH 1.2.12 README: "Beware that the most effective
way for someone to decrypt your data may be with a rubber hose."

I think you should use boost

sinasalek commented 9 February 2008 at 13:13

I think you should use boost module

sina.salek.ws
Feel freedom with open source softwares

sina.salek.ws, Software Manager & Lead developer
Feel freedom with open source softwares

Successful capture of site with form-based authentication

nathan573 commented 22 August 2008 at 19:33

My challenge was to capture the contents of a site that required a login to access the content. I had admin level control over the site, so access was not a concern but I needed to archive a static "snapshot" of the site every once in awhile for legal purposes.

I had been able to do this with a Drupal 4.7 site with HTTrack using the form-based "Catch URL" function where you get ready to login to the site but then point your browser's proxy address to the program. You submit the login and the program captures the login request, using it to emulate a valid login request when spidering the site. I was not able to get this to work with my 5.10 version of my site, so I pursued other freeware (Windows) avenues. I found no other freeware spider programs that did form-based authentication so I thought I was out of options until I discovered the IP Login module. I installed it, assigned my IP to my specially created and permissioned "archiver" user and then was able to use HTTrack. Since HTTrack is coming from the same IP, Drupal lets it in as my "archiver" user without needing to actually login - no more form-based authentication headaches. The only downfall to this is temporarily opening up the site to my one IP - which IS a shared IP since I'm using DSL at home. For a private, unpublished site as mine is, this risk seems negligible and on top of that, the "archiver" user is pretty much "read only".

I'm not sure what was wrong with HTTrack and Drupal, but I think that a new cookie kept on being created, overwriting the logged in "good" cookie. Logs indicated the login destination module was properly routing the user (implying that they were actually getting logged in) to the right page, but that page would come back as 403 Forbidden.

Cheers

There is a module called

sinasalek commented 23 August 2008 at 08:07

There is a module called boost which you might be already familiar with. i guess you twick this module to work for authenticated contents as well. if you do it you easily copy the cached contents which are complete html pages to your mirror servers using rsync or similar commands

Regards

sina.salek.ws, CIO & Lead developer
Feel freedom with open source softwares

sina.salek.ws, Software Manager & Lead developer
Feel freedom with open source softwares

Feature Request?

raintonr commented 2 October 2008 at 01:26

I'm very interested in this feature for the exact same reason. Should we get a feature request going?

Knowing which views to generate/crawl, how paging is handled and the rest is quiet tricky for either of these suggestions, but I see two plans for this:

Plan 'A' 'wget --mirror':

This is discussed here: http://drupal.org/node/27882

Basically you have to turn off any dynamic blocks the mirror process might see as that will make your site look ugly and also confuse people. You would also have to embed a message in the page somewhere that says, "This is a read-only copy".

One could put something in settings.php to look at HTTP_USER_AGENT and set a global flag. Block on/off could be triggered by PHP inspection of this global. Not nice, but it would work.

When running wget, use the --user-agent option to give your 'magic' agent to have the read-only message and block configuration you require.

I'm not massively into this plan as the mirror would take a long time and lots of CPU to build. Refresh would not be much better.

Plan 'B' 'drupal module':

This is much better.

Create a module that could run as part of cron and (re)generate pages created/changed since last run and write static files. It should only access content the anonymous user can see. It should add the "This is a read-only copy" message.

One could configure the module so it knew which blocks were dynamic and which static, so it would disable all dynamic blocks as part of the run.

Static mirror? Offline mirror?

Comments

An offline-browsable version

Try wget -r

Testing now (against my own)

XML based mirror?

"offline"

I think you should use boost

Successful capture of site with form-based authentication

There is a module called

Feature Request?

New forum topics

News items

Our community

Documentation

Drupal code base

Governance of community