For those of you busting your brains trying to figure out why indexing does not go past n%...

If you are using any kind of access control (simple access, taxonomy_access, etc) indexing will not be able to see anything that the anonymous user cannot see; such content will NOT be indexed.

I ran cron.php as the admin user and things started moving forward again. I then ran into this thread which confirmed my suspicion:

http://drupal.org/node/5380 (look at comment #17)

...and this dates back to 2004!

Perhaps I missed it in the docs, but as far as I know and everyone told me, as long as you set up cron.php (via cron) properly indexing should take care of itself. Not so. If you have your site wide open, yes it works; once you start tagging things down and assigning access by role indexing breaks down. Wow, and I did ask quite a few times on this same topic and everyone was pointing to just running cron.php.

To those that know... why the heck is something like a php script that's accessible to anyone over the web used for such critical maintenance of Drupal sites? (cron.php) Yes, I could rename the script but still...

Ok, so now I'll continue painstakingly searching through Google (Drupal.org can't seem to be able to search itself...) for a way to make cron.php run through some other mechanism or via an authenticated user that has access to all the content I want to index.

Thought I'd share for those that are going or will at some point go through my current frustrations with Drupal search.module. (and no, Drupal 4.7 is not an option for me at this point; I tried it and it fixes some problems but introduces others)

ZoneV

Comments

sepeck’s picture

Please contribute your specific use case to the issue thread in question. It looks like after a certain point it was missed/not continued.

I believe Drupal.org uses htaccess to prevent running of cron from anywhere but specific ip addresses if that helps. (No, I do not know what's in Drupal.org's htaccess file though).

-Steven Peck
---------
Test site, always start with a test site.
Drupal Best Practices Guide -|- Black Mountain

-Steven Peck
---------
Test site, always start with a test site.
Drupal Best Practices Guide

ZoneV’s picture

I'll look at the htaccess options and continue with the old/previous post.

ZoneV

styro’s picture

Perhaps I missed it in the docs, but as far as I know and everyone told me, as long as you set up cron.php (via cron) properly indexing should take care of itself. Not so. If you have your site wide open, yes it works; once you start tagging things down and assigning access by role indexing breaks down. Wow, and I did ask quite a few times on this same topic and everyone was pointing to just running cron.php.

Strange, before we opened our site up to anonymous users indexing was working ok (I wouldn't say brilliantly though) for us with node_privacy_byrole. Maybe it is more of an issue with other modules.

--
Anton
New to Drupal? | Forum posting tips | Troubleshooting FAQ
Example Knowledge Base built using Drupal

ZoneV’s picture

We're using Taxonomy_Access and/or Simple_Access. On all of them we're tagging our content so that only authenticated users can view content. As soon as we do that, indexing breaks.

ZoneV

grcm’s picture

For external, on-the-internet Drupal sites we just use Google to index pages for us; fast, free and easy.

The gsitemap module is excellent.

-- Version Control your Drupal web site with The File High Club's Free Trial!

ZoneV’s picture

...so I can't use Google for indexing these sites I'm working on. And another thing is that search.module has integration with Swish-e via the swish.module (creates a tab on the search page for searching into common file format uploaded files like pdf, doc, xls, ppt, and others).

ZoneV

JJacobsson’s picture

I'm experiencing something similar... whenever I edit an AcidFree node, the picture disapears. It does not re-appear until I run cron.php manualy while logged in as an administrator.

Using poorman or wget from a crontab does not work.

ZoneV’s picture

http://drupal.org/node/5380 , see comment #17.

Perhaps not the most secure or elegant solution, but it ensures everything gets indexed.

ZoneV