Using Search by page to restrict search for certain users to one content type. Mostly works fine.

However I notice that there are times that Search by Page does not return a result that, when that same result is searched in the main Drupal Search (and even restricted in advanced search by type) that the result is found.

Any ideas where to start ?

CommentFileSizeAuthor
#14 search_by_page.zip8.05 KBjhodgdon

Comments

jhodgdon’s picture

Category: bug » support

Search can only find things (pages for Search by Page, or nodes for default Search) that it has indexed.

It is quite possible for a given node to be indexed within the node/content search on the main search page, but not yet indexed within Search by Page (Search by Page renders the page differently, so it cannot use the main node/content search index). This would cause that node's page to come up in the main/advanced search results, but not in Search by Page results.

You can check the main Search settings page to see how much of the site is indexed. If it reads less than 100%, that means that not all of your site is indexed. If not all your site is indexed, then it is possible certain nodes are indexed within the main node search index, but not yet indexed within Search by Page. Running cron a few times will improve indexing -- each time you run cron, a certain amount of content gets indexed (or reindexed, if it has changed).

Hopefully this will clear up your issue...

sphopkins’s picture

Thanks. I am at 100% indexed, and I have seen the problem still.

I am only searching for the title. My application based on Drupal is basically an electronic health record for a group of breast cancer patients. The title is their chart number, and I am only allowing people to search for the one type of CCK content (= Patient).

I have a cron job running (I think every 15 minutes) to ensure that indexing is done. I am at several hundred thousand nodes and I am up to date in indexing.

I will keep watching to see if the problem persists.

jhodgdon’s picture

Well, that is indeed curious. Are you using any kind of permissions module that could be restricting access to certain nodes?

The other question is: is the node title actually displayed on the page? Search by Page indexes what is displayed on the page in your theme, so anything that isn't displayed isn't indexed.

jhodgdon’s picture

Another thing to check would be to search for a different piece of text that is definitely displayed on the page you aren't seeing in search results, and see if that page shows up in the search results.

sphopkins’s picture

I do have a permissions module in place, but the issue pre-dates that.

As well, as the administrator of the CMS I have full access to everything and it is happening to me ;-) Actually just confirmed a minute ago with a search.

The node title is displayed on the page - I am currently using the ZEN theme without any customizations.

I will try searching for other things and see what happens.

Thanks

jhodgdon’s picture

If you'd like another pair of eyes to take a quick look, send me a privatte message with the site URL and an example of a page/search term that is coming up in core Search but not Search by Page.

sphopkins’s picture

Unfortunately it is on an intranet and also has confidential patient information in it ;-) But thanks for the offer!

jhodgdon’s picture

Ah...

If you want to do some more investigation, and have PHPMyAdmin or a similar database tool and know how to use it (or some other means to run queries), here are some things you can look at in the Drupal MySQL database:

a) First, verify that Search by Page realizes it needs to index the node in question. Here's a query that will tell you this, assuming the node ID number is 12345:

SELECT * FROM sbp_path WHERE page_path="node/12345"

If you have a database prefix, you would need to replace sbp_path with drupal_sbp_path (or whatever your table prefix is) in that query.

If that query returns a result, then at least we know Search by Page knows it should be indexing that page. You can check the last index time field in the result, to verify that it has been indexed (any non-zero value).

If there is no result, then we've narrowed down the problem, which is to say that the Search by Page Nodes module for some reason decided not to tell the main Search by Page module to index that page.

b) If that query returns a result, you can note the "pid" field for that record/row, and then see what the search index looks like for that page (i.e. see what words are in the search index for that page). If the PID is, for instance, 38, your query would look like this:

  SELECT * FROM search_index WHERE sid="38" and type="search_by_page"

Again, if you have a database prefix, you would need to replace search_index with drupal_search_index (or whatever your table prefix is).

If there is no result for that query, then the page didn't get indexed correctly. If there is a result, you can browse through and see if the words you expect have been indexed.

jhodgdon’s picture

Edited...
Ah... I have an idea. Search by Page is rendering each page before indexing it. What happens if you are not logged in at all, and you try to view one of your Patient pages? Do you see anything on the screen?

If you are getting a "403" error, then that is probably the problem -- Search by Page is not going to work for you, because it will only index what the generic user can see on the page. At least, I think that is the case -- it is requesting the page during the cron run, and I am pretty sure that is the same as what an anonymous user will see.

Then again, if it is just some of your pages that are having this trouble, then maybe that is not the problem... unless you indexed some of them before you put some permissions in place?

Actually, this is incorrect. I have a test site where certain users have permission to see certain pages. All the pages are indexed correctly, and when searching, the permissions are honored. So this should not be the problem.

sphopkins’s picture

I will check all of this out in the next little bit. I have phpMyAdmin and Sequel Pro to do queries directly on the database.

sphopkins’s picture

OK.

For (a)

Checked the database and every node that should be indexed has an entry in that table. Verified a node that does not come up in a search by page and does come up in a regular search.

For (b)

That returns an empty dataset. There are 2.5M rows in that table so I will pass on browsing ;-)

Any idea of where or what to do next ?

jhodgdon’s picture

If the page that showed up in (a) shows a last indexing time that is not zero in that row, that means Search by Page tried to index it....

My best guess right now is that when cron/search/search by page tries to render the page in order to index it for searching, the permissions are such that it's not getting any content, so it's not able to index the content.

What content permissions module / setup / permissions are you using? I can try to replicate a similar setup on my test box and see if I can get the same problem. ...

The odd thing is that on my test box, I have a setup with some content set to no public access, and I can see it if I search when I'm logged in, but not when I'm not logged in... On that box, I am using "Content Access", and that particular content type is set up for access on a per-node basis. I have some nodes set to anonymous and some set to no anonymous viewing. And this is working fine... I even blew out the search index completely and ran cron a few times to reindex, and it is still fine.

sphopkins’s picture

Yeah it shows up with a non-zero index time. Actually I did not find any that did not have an attempt at indexing by SBP.

It is funny that I had this problem both before and after introducing Content Access for Drupal. So I do not think that this is an access problem but you never know.

For reference I currently do not have per node access set up. I am only using per node type access control.

In my install, I have ~9300 "Patient" type CCK nodes. I only seem to have a problem with a percentage of the nodes. All nodes are indexed in the main Drupal search - if I run into a problem with SBP not returning a hit I know that I can go to the main search and get the hit (along with other node types that match - all nodes in my Drupal install use the patient's chart number as the title for each node that relates to a patient... hence the utility of SBP).

And as well I have blown out the search index and reindexed every node just in case that was the problem... took a while to get all of my nodes reindexed but I am now there ;-)

jhodgdon’s picture

StatusFileSize
new8.05 KB

Well, it's very odd.

I really do want to get to the bottom of this, so I've put in a couple of changes to Search by Page. If you're willing to keep investigating...

a) Unzip this file and replace your existing search_by_page.module file with this one.
b) Make sure you have the "Database logging" core module enabled.
c) Visit the Performance screen under Site Configuration, scroll down to the bottom, and click the "clear cache" button.
d) Visit the Search by Page Settings screen under Site Configuration, and click the "Click to reset blank pages so they will reindex at next cron run" link. You should see a message at the top of the screen saying "Blank pages have been reset to index at next cron run (###)", where ### is the number of blank pages it found in the search index.
e) Run cron (there's now a link from the Search by Page settings page to the Status Report page, where you can click on Run Cron Manually). If there were a lot of pages showing up in (d) for reindexing, you might need to run it a couple of times.

Now you can check (a) and (b) in comment #8 above to see if the pages were indexed this time. And if they ended up in the same state as before [where they were in (a) but not (b)], then Search by Page should have added an entry to the Recent Log Entries report in the Reports section.

Let me know how this turns out...

jhodgdon’s picture

Category: support » bug
sphopkins’s picture

I will look into this probably on monday. Thanks for all the work. Should not be a problem testing.

sphopkins’s picture

Had a chance to quickly do this...

Here is the result: "Blank pages have been reset to index at next cron run (723)"

I will look in on (a) & (b) early next week as I have a bunch of pages to index by cron...

jhodgdon’s picture

Well, that's encouraging -- the query to find un-indexed pages found the un-indexed pages on your site (I had tested it on my site by removing a page from the index via PHPMyAdmin). I'll be holding my breath... er, patiently waiting to see what happens with the reindex attempt, and/or the log entries if the reindexing fails again. (We should be able to get some information from the log entries in that case -- you may need to click in the log to see the full message.)

Thanks for your patience! Hopefully we'll soon figure out what the problem really is. Hard to have a solution otherwise. :)

sphopkins’s picture

So far all I am getting for the pages that needed to be indexed is the following in the log file:

"Oct 19 10:00:13 localhost drupal: http://127.0.0.1|1255960813|search_by_page|127.0.0.1|http://127.0.0.1/cron.php||0||content for PID (13157), path (node/249313) was not indexed (3)"

So far no new SBP has been indexed.

sphopkins’s picture

And as SBP tries to index one of the pages I know is there and it did not return a result before, it again fails and has information in the database:

pid last_index_time page_path from_module modid
13121 1255961712 node/249277 sbp_nodes 249277

jhodgdon’s picture

That is what we needed to know, actually. That (3) is an error code, MENU_ACCESS_DENIED. Which tells us that for those pages, during cron, when Search by Page is trying to render page (your site URL)/node/249313, it is getting an Access Denied error.

So it looks like it is NOT getting Access Denied when it tries to render most of your patient records, but it is in the case of node/249313, node/249277, and those other 700 or so nodes. What is different in the permissions for those pages?

sphopkins’s picture

There should be nothing different for the permissions for those pages. I will investigate on my end but the content access module was added a week or two ago and those nodes (and every one else) was in place long before that.

jhodgdon’s picture

Hmmm.... You might check the publication status -- make sure all of them are set to "published" (you don't have a workflow module that could be interfering, I'm assuming?). You might also look in the database at the records in the node, node revisions, and node access tables for one of the broken nodes as well as one of the working nodes, and see if there is something different.

Also, if you are using a custom module to define the content type, there could be some access checks being done in the content type module.

That's really all I can think of, but given that log message, the problem is definitely that cron is getting access denied to just those 723 nodes. So there must be something different about them. ???

sphopkins’s picture

They should be all published - the information on all of these nodes is imported using Node_Import and that is how we ensure all are available for viewing. Just checked and that one is published.

No custom modules - all are Drupal-submitted modules from Drupal.org

I do not see any difference in records between two records - one broken and one working.

Node Access table has this:

Broken:

nid gid	realm	 grant_view	grant_update	grant_delete
249277	6	content_access_rid	1	0	0
249277	3	content_access_rid	1	0	0
249277	4	content_access_rid	1	0	0

Working:

nid	gid	realm	grant_view	grant_update	grant_delete
3089	3	content_access_rid	1	0	0
3089	4	content_access_rid	1	0	0
3089	6	content_access_rid	1	0	0

Node table:
Broken:

nid	vid	type	language	title	uid	status	created	changed	comment	promote	moderate	sticky	tnid	translate
249277	252522	patient	en	CHARTNUMBER	1	1	1254165013	1255634689	0	0	0	0	0	0

Working:

nid	vid	type	language	title	uid	status	created	changed	comment	promote	moderate	sticky	tnid	translate
3089	92800	patient	en	CHARTNUMBER	1	1	1247766137	1251222387	0	0	0	0	0	0

Node Revisons:
Broken:

nid	vid	uid	title	body	teaser	log	timestamp	format
249277	250682	1	CHARTNUMBER	 	 	<p>Imported with node_import.</p>	1254165013	0
249277	252522	1	CHARTNUMBER	 	 	 	1255634689	0

Working:

nid	vid	uid	title	body	teaser	log	timestamp	format
3089	3169	1	CHARTNUMBER	 	 	Imported with node_import.	1247766137	0
3089	19619	4	CHARTNUMBER	 	 	 	1248973319	0
3089	92800	1	CHARTNUMBER	 	 	 	1251222387	0
jhodgdon’s picture

Hmmm... When I said "custom", I should have probably said "custom or contributed" Drupal modules.

Anyway, let me think this through. Here are some (technical) notes about what happens when cron is trying to index one of your patient nodes [skip this, except lines/sections starting with *** unless you are a PHP/Drupal programming geek or are interested]:

a) Search by Page calls http://api.drupal.org/api/function/menu_execute_active_handler/6 to render the page. As evidenced by the log shown in #19 above, this function is returning MENU_ACCESS_DENIED (3) on those problem nodes.

b) menu_excecute_active_handler() only returns MENU_ACCESS_DENIED when the call to http://api.drupal.org/api/function/menu_get_item/6 returns a menu router item with its access element set to FALSE.

c) menu_get_item() finds the best matching menu router item from the menu router table. I have been assuming that this would be the "node/%" menu router item defined in the node module. But maybe somehow this is getting bypassed? OK, we can check this:

*** Can you run this query on your database (you may need to change the table name from menu_router to prefix_menu_router if you have a database prefix):

SELECT * FROM menu_router WHERE path LIKE "node/%"

You should see about 30 results, depending on how many content types you have defined:
- There should be exactly one with path = node/% -- it should have access_callback = node_access and the load_functions field should look like a:1:{i:1;s:9:"node_load";} and the access arguments field should look like a:2:{i:0;s:4:"view";i:1;i:1;} and the to_arg_functions field should be blank.
- Then there should be some other entries in the results like node/%/delete, node/%/edit, node/add/page, etc.
- There should NOT be any specific entries in the results with paths such as node/249277

*** End of *** section

d) Assuming the check in (c) was correct, then the node/% path is the one getting matched and standard node loading/access is in effect. So, in http://api.drupal.org/api/function/_menu_translate/6 the node will get loaded and the access checks will be done. This will call both http://api.drupal.org/api/function/node_load/6 and http://api.drupal.org/api/function/node_access/6

e) If node_load() returns FALSE, then the access element will be set to false. But this shouldn't happen, given that the database looks the same in node_revisions and node for working and non-working nodes. I guess it is remotely possible that the access flag could be set to FALSE during something else in the node load (all the module_invoke stuff that node_load does to add information to the node). That would have to be from some contrib module that is doing something specific for some subset of your nodes. But this seems unlikely.

f) Access checks in node_access()... If there is no custom module associated with the 'patient' content type, then this will use node_content_access() to check the access rights, and then if that doesn't say anything, it will check the node access table (which we have already verified is the same for working and non-working nodes). node_content_access() never returns anything for 'view' access checks, so unless there is a custom module defined, this cannot be the problem.

*** Another thing to check:
Can you find the record for the 'patient' content type in your "node_type" table, and look at the "module" field? If it is something other than "node", then that module may have an access check that is coming into play.
*** End of this check

Well... There are two things to check, and I think both of them are long shots... Other than this, I think I am out of ideas, and I don't know what to tell you...

*** One more thing you could try...

You could try disabling custom or contrib (non-core) modules that might have some bearing on workflow (marking nodes draft/published/etc.), access control, the patient content type, or even other content types. If you want to try this, I would follow these steps:

a) Pick a single module that could potentially be causing trouble.
b) Disable this module.
c) Click the link described in comment #14 above to tell cron you want to reindex those empty nodes.
d) Run cron once.
e) Check the log and see if the nodes that were reindexed in that cron run still have (3) at the end of their log messages indicating access denied.

If that doesn't change anything, then re-enable that module and try a different one.
*** End of this one other thing to try.

sphopkins’s picture

OK I am not that much of a PHP/Drupal Geek but that is interesting to know ;-)

Here is what the table for the first query returns for the first few results:

node/%	a:1:{i:1;s:9:"node_load";}	 	node_access	a:2:{i:0;s:4:"view";i:1;i:1;}	node_page_view	a:1:{i:0;i:1;}	2	2	 	node/%	 	node_page_title	a:1:{i:0;i:1;}	4	 	 	 	0	 
node/%/access	a:1:{i:1;s:9:"node_load";}	 	content_access_node_page_access	a:1:{i:0;i:1;}	drupal_get_form	a:2:{i:0;s:19:"content_access_page";i:1;i:1;}	5	3	node/%	node/%	Access control	t	 	128	 	 	 	3	sites/all/modules/content_access/content_access.ad...

So that matches what you were hoping in the first part.

As for the Node_Type for "Patient" it is that - node.

I will try the disable and enable of modules. Who knows what it is now :)

I really appreciate your help. This is above and beyond and it is much appreciated!

sphopkins’s picture

On a positive note only 703 nodes are needed to be re-indexed as opposed to 723 ;-)

This was after disabling Content Complete module - one that was recently updated on my site.

jhodgdon’s picture

Category: bug » support

That's interesting... but odd. I took a look at Content Complete and I didn't really see anything in there that would cause access troubles.

Anyway... I guess I'll set this back to a "support request" rather than bug, and I still don't have any other ideas.

sphopkins’s picture

Well was not content complete. Disabling the access control module now. But I really do not have any others that involve workflow... other than Views Bulk Operations.

sphopkins’s picture

Not Access Control either.

EDIT: Rebuilding Content Access Permissions. Maybe that will help.

jhodgdon’s picture

Hah! I should have thought of that. Good idea!

sphopkins’s picture

Unfortunately it did not work.

I will keep plugging away and see what comes up.

jhodgdon’s picture

Just as one more note: I added some new tests to the Search by Page / SBP Nodes module today to verify that "private" content can be indexed, and appears in search results for sufficiently privileged users.

I still don't know why those nodes of yours are getting "access denied".

Interestingly, I had to put in a line in the test to rebuild the content access permissions in order for the test to pass. I think this was mostly a SimpleTest framework artifact though.

jhodgdon’s picture

I just put out version 6.x-1.5 of Search by Page. The module is identical to the one I attached above in the zip file (with the extra debug logging added).

sphopkins’s picture

Still not working... cannot figure it out.... and the log file is filling up with the errors due to this....

Type	search_by_page
Date	Friday, November 6, 2009 - 11:30
User	Anonymous
Location	http://127.0.0.1/cron.php
Referrer	
Message	content for PID (1506), path (node/1934) was not indexed (3)
Severity	error
Hostname	127.0.0.1
Operations	
jhodgdon’s picture

Not sure what to tell you either. The (3) at the end of the message means that during search indexing, node/1934 had an access permissions error (i.e. access denied). And we cannot seem to figure out why that should be...

sphopkins’s picture

Not a complaint on my side for sure. Still trying to figure where the access problem is coming from....

Is that an error from the main Search function or just the SBP module ? Because it seems the main search module is indexing the node.

jhodgdon’s picture

That error is coming from the SBP module. See #25 above for an explanation of what it is doing and where the access checks are being performed.

The main search module doesn't work in the same way -- instead of calling menu_execute_active_handler() to render the theme's version of each page that is indexed (which is what SBP does), the main search module indexes the default rendering (theme-independent) from the node module. So it is not using the same permission checking that SBP is.

sphopkins’s picture

Not forgetting about this...

Any chance that SBP and Apache Solr Search Integration interferes with each other ?

Removed Apache Solr Search and framework, (added lucene API as well), and the log file does not seem to be having issues... I will have to see if theer are imnprovements in SBP.

jhodgdon’s picture

Oh, you're using Apache Solr Search? You never mentioned that before. I have no idea whether SBP and Solr work well together. SBP was meant to be used with the usual Drupal search module, not Solr. Solr has its own search index, so all of the queries we did above don't apply if you are using Solr, and the error messages in the log have to do with core Search putting things into its index, not Solr putting things into its index.

I also don't understand why you would use Lucene and Solr? My understanding is that both are integrations with 3rd-party search engines with their own code and index databases, so I would think you'd only use one or the other?

sphopkins’s picture

I had the Apache Solr module installed but was not using it. Lucene I added today..... and removed Solr.

jhodgdon’s picture

Well, I am not sure whether SBP will work well with Lucene or Solr. As far as I know, they have their own indexing mechanisms. SBP is designed to work with the core Search's indexing mechanism, and I just don't know much about Solr or Lucene to know whether they'd work with SBP. Sorry.

sphopkins’s picture

Core search is still there - Solr & Lucene make their own indexes. So SBP is not relying on them - so far the SBP is working the same as before but I am mot certain if the unindexed nodes are being indexed just yet.

sphopkins’s picture

I had high hopes... unfortunately it has not fixed things. So it is not Solr or Lucene that affects things.

However the one thing that I will check on is if Lucene can find those problem nodes...

jhodgdon’s picture

It's quite possible that Lucene can -- didn't we decide that it was a problem in node access inside of Search by Page's attempt to render the page?

sphopkins’s picture

Trying to follow up here and see if this is fixable.

As a refresh, there was a problem getting all nodes indexed by SBP. There were many that were not being indexed for unknown reason. I still have Content Access 6.x-1.2 providing the rights to load certain pages. I have still been getting the errors.

As some testing I allowed anonymous users to load the content type that I want indexed. And lo and behold, the errors stopped appearing in the log files. Not 100% sure things are all being indexed right now.... checking that out.

jhodgdon’s picture

Well.

I still don't understand why some of your content would be indexed and some not, given that you say that all of your content has the same permissions. Until I understand why that is, it will be difficult to fix...

sphopkins’s picture

I agree that this is perplexing.

The interesting thing is that I only enabled anonymous users. All other permissions stayed the same.

I have reset the blank pages in the SBP index, cleared the cached data and I will await the results of cron....

jhodgdon’s picture

I still need to understand what the difference is between the content that is and is not indexed. I don't think this is going to tell us anything, although it might fix your site. But periodically the content that was previously indexed will be reindexed, so once you set your permissions back to the correct values, the site will eventually revert to having this permissions problem, and the content won't be indexed again, I expect.

sphopkins’s picture

That is the weird thing... as far as I can see there is no difference. All teh same node type. All created with Node_import module.

I expect part of my testing to show me if there is any "reversion" when I reset the settings. I am stumped like you.

jhodgdon’s picture

Hah! I may have found an issue. I realized that in the automated tests I wrote for this module (as well as when I was trying to reproduce your problem manually), I (or the testing framework) was logged in as a priveleged user when running cron. So I took that out of the automated tests, and I experienced some test failures that seem like they might be related to your problem.

I will investigate further and report back.

sphopkins’s picture

Woo Hoo !

I really really really really really really really really really appreciate your dedication to looking at this.

jhodgdon’s picture

Many of us are working with all of our volunteer/spare time for Drupal development, trying to get Drupal 7 ready for alpha release, so it will be a couple of days before I can get back to this -- apologies in advance! But I haven't forgotten.

sphopkins’s picture

That is why I appreciate your efforts so much on this. If I was in Seattle I can guarantee I would be contracting you to do some work for me !

jhodgdon’s picture

I can work remotely... Actually, only about 1/2 to 1/3 of my clients, at any given time, are local to Seattle. :)

sphopkins’s picture

I am unfortunately on an intranet and I have confidential patient information so remote is hard ;-)

Though I may look at some theming and other ideas I see on your site.

sphopkins’s picture

OK I just verified that the nodes that were not indexed before are now indexing after allowing anonymous access to content. I will have to rebuild permissions and remove the anonymous access eventually but there is something going on there.

jhodgdon’s picture

Thanks for the information.

jhodgdon’s picture

Just to let you know, I'm working on fixing this in Search by Page now, and should have something for you to test in the next few days.

sphopkins’s picture

Excellent and thanks. In terms of availability, I am away Jan 22-29 so no rush in that timeline... I will be sipping margaritas on a beach somewhere hopefully warmer than Canada ;-)

I am eager to test something.

Thanks

jhodgdon’s picture

Status: Active » Needs review

I have just committed changes to the development version of Search by Page, for these issues:
#662282: Support for multilingual sites
#605458: Search by Page does not return all results
#492878: Multiple search types

Pertinent to this particular issue are access permission features:
- You can set a role to use when indexing nodes, users, etc.
- Also note that there is a new "search environments" feature, so the settings pages look a little different (see above issue)

I think it's working fine, and all of my automated tests also pass.

If anyone who's watching this issue wants to test, you can get the development version from CVS, or wait about 24 hours until it's updated on http://drupal.org/project/search_by_page (version 6.x-1.x-dev, but make sure its date/time is after the date/time of this comment here!).

You will need to run the update.php script after updating to this version of the module, and also visit the Search module settings page and tell it to redo the search index.

sphopkins’s picture

Will be downloading the dev version and testing in the next bit. Thanks for the work.

sphopkins’s picture

One error on running update.php

Table &#039;breastdb.sbpp_path&#039; doesn&#039;t exist query: UPDATE sbpp_path SET languages=&#039;1&#039; in /var/www/html/sites/all/modules/search_by_page/search_by_page.install on line 179.

Is this critical, can I fix it and can I try working with it now ?

Thanks

jhodgdon’s picture

You should be OK. If you are using Search by Page Paths, you might want to edit the paths you have created and make sure they have language(s) defined.

Thanks for letting me know about that error. I'll fix it.

jhodgdon’s picture

Ah, I see the problem, which is that the Search by Page update function assumed SBP Paths was turned on, and it shouldn't have. Thanks again for pointing that out, and I'll be curious to see your feedback on the other changes.

jhodgdon’s picture

Category: support » bug
jhodgdon’s picture

Any word on whether the development version fixes your issues?

jhodgdon’s picture

Status: Needs review » Fixed

I'm tentatively marking this issue "fixed", and it's out in version 6.x-1.8 (or should be shortly).

Please reopen if you are still having problems.

sphopkins’s picture

I am still having issues I guess, and I am not sure what to do.

I am getting these errors in the log:

Type search_by_page
Date Friday, February 12, 2010 - 11:33
User Anonymous
Location http://127.0.0.1/cron.php
Referrer
Message Content not rendered (access denied) - PID (32192), path (node/250435), realpath (node/250435), language (en)
Severity error
Hostname 127.0.0.1
Operations

The node type to be indexed is being indexed as an administrator. The administrator has access to that node type / content type via permissions and access control.

Not sure what I am doing wrong.

jhodgdon’s picture

Are you using the released 6.x-1.8 version now?

sphopkins’s picture

I have been using 6.x-1.x-dev (2010-Feb-01). I will update updated to the 1.8 today

jhodgdon’s picture

Status: Fixed » Needs work

Well, there were a few changes between Feb 1 and now, so I'll be interested to see if that changes anything...

One question: Are you running cron while logged in, from the Status Reports page, or from a cron job? Because that could affect permissions (although it should not with the new version of SBP). Are in-person cron jobs failing to index anything, and automatic cron jobs indexing everything, or some subset, or vice versa? Do you get the same errors in the log no matter how you run cron?

So I still don't fundamentally understand why some of the nodes of this content type are getting the access denied error and some aren't. What's the difference between the indexed and non-indexed nodes? Somewhere back up in the comment stream you had indicated there was no difference between these nodes that you were aware of, and that you weren't using any per-node content permissions, and presumably they were all ....

Hmmm....

So is there any difference in how these nodes were created? E.g. were some imported via a script or module, and some have later been edited from the user interface?

Or anything at all you can think of to distinguish the ones that are vs. are not indexed?

jhodgdon’s picture

Ah... I just fixed something that the Drupal 7 tests uncovered that might be an issue in Drupal 6 as well. Let me try something.

EDIT: Actually, no, that wouldnt' cause those errors in the log. Forget that. Sorry.

sphopkins’s picture

I am running the cron.php from cron on the FC11 server I have. I will have to try a manual cron from the status page.

In the SBP tables, things look like this for unindexed nodes:

pid	last_index_time	page_path	from_module	modid	language	environment	role
			32192	0	node/250435	sbp_nodes	250435	en	1	1
			18708	0	node/5327	sbp_nodes	5327	en	1	3
			18707	0	node/5321	sbp_nodes	5321	en	1	6
			18706	0	node/5326	sbp_nodes	5326	en	1	3
			18705	0	node/5320	sbp_nodes	5320	en	1	6
			18704	0	node/5325	sbp_nodes	5325	en	1	3
			18703	0	node/5319	sbp_nodes	5319	en	1	6
			18702	0	node/5324	sbp_nodes	5324	en	1	3
			18701	0	node/5318	sbp_nodes	5318	en	1	6

Basically all nodes are created via an automated script from Node_import, and some are edited and some have not been. All have very extensive view fields linked to them.

I will see what else I can dig up.

jhodgdon’s picture

What I'm trying to figure out is what is the difference in the history/values of the nodes that are indexed vs. the nodes that are unindexed. Not what is happening in the SBP tables, but what is happening in the node/cck/node_access/permissoins tables; and also as a side-line or clue into the database tables, what is the history of the nodes.

For instance, if all of the imported nodes are not indexed, until they are edited in the UI, then perhaps the node importing script is not triggering the same kind of permissions setup as what you would get if you created/edited the nodes in the UI.

???

I could understand it if no nodes of this type on your system were being indexed, or if all nodes on your system of this type were being indexed, but if some are and some aren't, and we cannot figure out what is different between the nodes that are and aren't indexed (again, outside of the SBP tables), then I have nothing to go on.

sphopkins’s picture

Since the change I can say that there are fewer nodes indexed than before. But the chance that the module node_import is causing problems may have legs as I have run into problems with the "drupal-ness" of that node in the past.

jhodgdon’s picture

Yeah, if that module is, for instance, bypassing node_save() (the central Drupal function for saving nodes) and doing its own database update queries, it's quite possible it has screwed up your permissions.

jhodgdon’s picture

I made some changes to how SBP is dealing with roles during search indexing, which may help you. Can you try out version 6.x-1.9 (which I just released, should be available within 15 minutes) and see if it helps? You may need to rebuild your node access permissions (you can do this at (your URL)/admin/content/node-settings/rebuild).

sphopkins’s picture

Will do that tomorrow. I have lots of nodes so I will have to do it at the end of the day...

sphopkins’s picture

Have not rebuilt teh content permissions but I am noticing the following in the logs:

Role 3 (administrator) could not be used to index PID (18264), path (node/5105)

jhodgdon’s picture

The new version is setting up dummy users for search indexing... Do you have user account creation blocked in some way on your site? It would be done at a low level, not via "users can create their own accounts", so the account settings on the usual account settings page would not block it...

So can you look back a bit in the logs and see if you can find a message that says something like

Unable to set up an indexing user for role 3 (administrator)

or

Created indexing user ## (sbp indexing administrator) for role 3 (administrator)

And do you see a blocked user in your Users page with the name "sbp indexing administrator"?

sphopkins’s picture

I do see the users have been added and the status is blocked. I will unblock them (did not think I set any restrictions on users but since I have created all of them myself it may be restricted).

sphopkins’s picture

Also noticing this error:

Duplicate entry &#039;sbp indexing administrator&#039; for key &#039;name&#039; query: INSERT INTO users (pass, name, mail, status, created) VALUES (&#039;5fe37353c8ed541ed0707c2de8806237&#039;, &#039;sbp indexing administrator&#039;, &#039;ZsQUyA6a85@XAvUTDh2Bq.com&#039;, 0, 1266340513) in /var/www/html/modules/user/user.module on line 327.

jhodgdon’s picture

Huh. You don't have to unblock the dummy users...

That error in #83 makes sense to me, given the other errors you are seeing. The code tries to recreate the users if it cannot load them, and it's getting a dup error because the user name has already been used.

Ummm...

Did you run the update.php script after installing the new version of the module? If not, run it now, and then from your Users admin page, delete the sbp users and try running cron again. I bet that is the problem...

sphopkins’s picture

I did run the update script but it may have been right when cron was about to run.... I will delete them and let cron do its job.

jhodgdon’s picture

OK. I can see I might need to make some updates to prevent this sort of thing.

EDIT: Filed a separate issue on this:
#716342: SBP users for indexing may have duplication problems

sphopkins’s picture

Still on this... made some changes and I will report back.

jhodgdon’s picture

Status: Needs work » Postponed (maintainer needs more info)

Any news?

jhodgdon’s picture

Any news?

jhodgdon’s picture

Category: bug » support

bump. Any news on this?

illepic’s picture

Just putting in my two cents here:

I too was having errors with Lucene indexing certain nodes on my site. We found out that our users were attaching cck images and including in the provided title and alt tags the abbreviation for inches as ". So we ended up with many attached title and alt tags like: Product Name 6"x6"x5".

Due to the positioning of the image on the page, this was preventing the rest of the body copy to be indexed for Lucene. Once we took out those quotation marks, Lucene indexed beautifully.

Just throwing it out there, don't know if it helps :)

jhodgdon’s picture

This issue has nothing to do with the Lucene module, and the problem was that certain nodes were not getting indexed at all, not that some content was being omitted from indexed nodes. So I don't think your note is related to this issue... but thanks for trying...

jhodgdon’s picture

Status: Postponed (maintainer needs more info) » Closed (fixed)

I've recently made some changes to Search by Page and how it decides to index content... and I haven't heard anything on this issue for several months. So at this point, I'm going to close this support request and hopefully it's been resolved.