Search by Page does not return all results
sphopkins - October 15, 2009 - 14:45
| Project: | Search by Page |
| Version: | 6.x-1.4 |
| Component: | Main Search by Page module |
| Category: | support request |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | active |
Description
Using Search by page to restrict search for certain users to one content type. Mostly works fine.
However I notice that there are times that Search by Page does not return a result that, when that same result is searched in the main Drupal Search (and even restricted in advanced search by type) that the result is found.
Any ideas where to start ?

#1
Search can only find things (pages for Search by Page, or nodes for default Search) that it has indexed.
It is quite possible for a given node to be indexed within the node/content search on the main search page, but not yet indexed within Search by Page (Search by Page renders the page differently, so it cannot use the main node/content search index). This would cause that node's page to come up in the main/advanced search results, but not in Search by Page results.
You can check the main Search settings page to see how much of the site is indexed. If it reads less than 100%, that means that not all of your site is indexed. If not all your site is indexed, then it is possible certain nodes are indexed within the main node search index, but not yet indexed within Search by Page. Running cron a few times will improve indexing -- each time you run cron, a certain amount of content gets indexed (or reindexed, if it has changed).
Hopefully this will clear up your issue...
#2
Thanks. I am at 100% indexed, and I have seen the problem still.
I am only searching for the title. My application based on Drupal is basically an electronic health record for a group of breast cancer patients. The title is their chart number, and I am only allowing people to search for the one type of CCK content (= Patient).
I have a cron job running (I think every 15 minutes) to ensure that indexing is done. I am at several hundred thousand nodes and I am up to date in indexing.
I will keep watching to see if the problem persists.
#3
Well, that is indeed curious. Are you using any kind of permissions module that could be restricting access to certain nodes?
The other question is: is the node title actually displayed on the page? Search by Page indexes what is displayed on the page in your theme, so anything that isn't displayed isn't indexed.
#4
Another thing to check would be to search for a different piece of text that is definitely displayed on the page you aren't seeing in search results, and see if that page shows up in the search results.
#5
I do have a permissions module in place, but the issue pre-dates that.
As well, as the administrator of the CMS I have full access to everything and it is happening to me ;-) Actually just confirmed a minute ago with a search.
The node title is displayed on the page - I am currently using the ZEN theme without any customizations.
I will try searching for other things and see what happens.
Thanks
#6
If you'd like another pair of eyes to take a quick look, send me a privatte message with the site URL and an example of a page/search term that is coming up in core Search but not Search by Page.
#7
Unfortunately it is on an intranet and also has confidential patient information in it ;-) But thanks for the offer!
#8
Ah...
If you want to do some more investigation, and have PHPMyAdmin or a similar database tool and know how to use it (or some other means to run queries), here are some things you can look at in the Drupal MySQL database:
a) First, verify that Search by Page realizes it needs to index the node in question. Here's a query that will tell you this, assuming the node ID number is 12345:
SELECT * FROM sbp_path WHERE page_path="node/12345"If you have a database prefix, you would need to replace sbp_path with drupal_sbp_path (or whatever your table prefix is) in that query.
If that query returns a result, then at least we know Search by Page knows it should be indexing that page. You can check the last index time field in the result, to verify that it has been indexed (any non-zero value).
If there is no result, then we've narrowed down the problem, which is to say that the Search by Page Nodes module for some reason decided not to tell the main Search by Page module to index that page.
b) If that query returns a result, you can note the "pid" field for that record/row, and then see what the search index looks like for that page (i.e. see what words are in the search index for that page). If the PID is, for instance, 38, your query would look like this:
SELECT * FROM search_index WHERE sid="38" and type="search_by_page"Again, if you have a database prefix, you would need to replace search_index with drupal_search_index (or whatever your table prefix is).
If there is no result for that query, then the page didn't get indexed correctly. If there is a result, you can browse through and see if the words you expect have been indexed.
#9
Edited...
Ah... I have an idea. Search by Page is rendering each page before indexing it. What happens if you are not logged in at all, and you try to view one of your Patient pages? Do you see anything on the screen?If you are getting a "403" error, then that is probably the problem -- Search by Page is not going to work for you, because it will only index what the generic user can see on the page. At least, I think that is the case -- it is requesting the page during the cron run, and I am pretty sure that is the same as what an anonymous user will see.Then again, if it is just some of your pages that are having this trouble, then maybe that is not the problem... unless you indexed some of them before you put some permissions in place?
Actually, this is incorrect. I have a test site where certain users have permission to see certain pages. All the pages are indexed correctly, and when searching, the permissions are honored. So this should not be the problem.
#10
I will check all of this out in the next little bit. I have phpMyAdmin and Sequel Pro to do queries directly on the database.
#11
OK.
For (a)
Checked the database and every node that should be indexed has an entry in that table. Verified a node that does not come up in a search by page and does come up in a regular search.
For (b)
That returns an empty dataset. There are 2.5M rows in that table so I will pass on browsing ;-)
Any idea of where or what to do next ?
#12
If the page that showed up in (a) shows a last indexing time that is not zero in that row, that means Search by Page tried to index it....
My best guess right now is that when cron/search/search by page tries to render the page in order to index it for searching, the permissions are such that it's not getting any content, so it's not able to index the content.
What content permissions module / setup / permissions are you using? I can try to replicate a similar setup on my test box and see if I can get the same problem. ...
The odd thing is that on my test box, I have a setup with some content set to no public access, and I can see it if I search when I'm logged in, but not when I'm not logged in... On that box, I am using "Content Access", and that particular content type is set up for access on a per-node basis. I have some nodes set to anonymous and some set to no anonymous viewing. And this is working fine... I even blew out the search index completely and ran cron a few times to reindex, and it is still fine.
#13
Yeah it shows up with a non-zero index time. Actually I did not find any that did not have an attempt at indexing by SBP.
It is funny that I had this problem both before and after introducing Content Access for Drupal. So I do not think that this is an access problem but you never know.
For reference I currently do not have per node access set up. I am only using per node type access control.
In my install, I have ~9300 "Patient" type CCK nodes. I only seem to have a problem with a percentage of the nodes. All nodes are indexed in the main Drupal search - if I run into a problem with SBP not returning a hit I know that I can go to the main search and get the hit (along with other node types that match - all nodes in my Drupal install use the patient's chart number as the title for each node that relates to a patient... hence the utility of SBP).
And as well I have blown out the search index and reindexed every node just in case that was the problem... took a while to get all of my nodes reindexed but I am now there ;-)
#14
Well, it's very odd.
I really do want to get to the bottom of this, so I've put in a couple of changes to Search by Page. If you're willing to keep investigating...
a) Unzip this file and replace your existing search_by_page.module file with this one.
b) Make sure you have the "Database logging" core module enabled.
c) Visit the Performance screen under Site Configuration, scroll down to the bottom, and click the "clear cache" button.
d) Visit the Search by Page Settings screen under Site Configuration, and click the "Click to reset blank pages so they will reindex at next cron run" link. You should see a message at the top of the screen saying "Blank pages have been reset to index at next cron run (###)", where ### is the number of blank pages it found in the search index.
e) Run cron (there's now a link from the Search by Page settings page to the Status Report page, where you can click on Run Cron Manually). If there were a lot of pages showing up in (d) for reindexing, you might need to run it a couple of times.
Now you can check (a) and (b) in comment #8 above to see if the pages were indexed this time. And if they ended up in the same state as before [where they were in (a) but not (b)], then Search by Page should have added an entry to the Recent Log Entries report in the Reports section.
Let me know how this turns out...
#15
#16
I will look into this probably on monday. Thanks for all the work. Should not be a problem testing.
#17
Had a chance to quickly do this...
Here is the result: "Blank pages have been reset to index at next cron run (723)"
I will look in on (a) & (b) early next week as I have a bunch of pages to index by cron...
#18
Well, that's encouraging -- the query to find un-indexed pages found the un-indexed pages on your site (I had tested it on my site by removing a page from the index via PHPMyAdmin). I'll be holding my breath... er, patiently waiting to see what happens with the reindex attempt, and/or the log entries if the reindexing fails again. (We should be able to get some information from the log entries in that case -- you may need to click in the log to see the full message.)
Thanks for your patience! Hopefully we'll soon figure out what the problem really is. Hard to have a solution otherwise. :)
#19
So far all I am getting for the pages that needed to be indexed is the following in the log file:
"Oct 19 10:00:13 localhost drupal: http://127.0.0.1|1255960813|search_by_page|127.0.0.1|http://127.0.0.1/cron.php||0||content for PID (13157), path (node/249313) was not indexed (3)"
So far no new SBP has been indexed.
#20
And as SBP tries to index one of the pages I know is there and it did not return a result before, it again fails and has information in the database:
pid last_index_time page_path from_module modid
13121 1255961712 node/249277 sbp_nodes 249277
#21
That is what we needed to know, actually. That (3) is an error code, MENU_ACCESS_DENIED. Which tells us that for those pages, during cron, when Search by Page is trying to render page (your site URL)/node/249313, it is getting an Access Denied error.
So it looks like it is NOT getting Access Denied when it tries to render most of your patient records, but it is in the case of node/249313, node/249277, and those other 700 or so nodes. What is different in the permissions for those pages?
#22
There should be nothing different for the permissions for those pages. I will investigate on my end but the content access module was added a week or two ago and those nodes (and every one else) was in place long before that.
#23
Hmmm.... You might check the publication status -- make sure all of them are set to "published" (you don't have a workflow module that could be interfering, I'm assuming?). You might also look in the database at the records in the node, node revisions, and node access tables for one of the broken nodes as well as one of the working nodes, and see if there is something different.
Also, if you are using a custom module to define the content type, there could be some access checks being done in the content type module.
That's really all I can think of, but given that log message, the problem is definitely that cron is getting access denied to just those 723 nodes. So there must be something different about them. ???
#24
They should be all published - the information on all of these nodes is imported using Node_Import and that is how we ensure all are available for viewing. Just checked and that one is published.
No custom modules - all are Drupal-submitted modules from Drupal.org
I do not see any difference in records between two records - one broken and one working.
Node Access table has this:
Broken:
nid gid realm grant_view grant_update grant_delete249277 6 content_access_rid 1 0 0
249277 3 content_access_rid 1 0 0
249277 4 content_access_rid 1 0 0
Working:
nid gid realm grant_view grant_update grant_delete3089 3 content_access_rid 1 0 0
3089 4 content_access_rid 1 0 0
3089 6 content_access_rid 1 0 0
Node table:
Broken:
nid vid type language title uid status created changed comment promote moderate sticky tnid translate249277 252522 patient en CHARTNUMBER 1 1 1254165013 1255634689 0 0 0 0 0 0
Working:
nid vid type language title uid status created changed comment promote moderate sticky tnid translate3089 92800 patient en CHARTNUMBER 1 1 1247766137 1251222387 0 0 0 0 0 0
Node Revisons:
Broken:
nid vid uid title body teaser log timestamp format249277 250682 1 CHARTNUMBER <p>Imported with node_import.</p> 1254165013 0
249277 252522 1 CHARTNUMBER 1255634689 0
Working:
nid vid uid title body teaser log timestamp format3089 3169 1 CHARTNUMBER Imported with node_import. 1247766137 0
3089 19619 4 CHARTNUMBER 1248973319 0
3089 92800 1 CHARTNUMBER 1251222387 0
#25
Hmmm... When I said "custom", I should have probably said "custom or contributed" Drupal modules.
Anyway, let me think this through. Here are some (technical) notes about what happens when cron is trying to index one of your patient nodes [skip this, except lines/sections starting with *** unless you are a PHP/Drupal programming geek or are interested]:
a) Search by Page calls http://api.drupal.org/api/function/menu_execute_active_handler/6 to render the page. As evidenced by the log shown in #19 above, this function is returning MENU_ACCESS_DENIED (3) on those problem nodes.
b) menu_excecute_active_handler() only returns MENU_ACCESS_DENIED when the call to http://api.drupal.org/api/function/menu_get_item/6 returns a menu router item with its access element set to FALSE.
c) menu_get_item() finds the best matching menu router item from the menu router table. I have been assuming that this would be the "node/%" menu router item defined in the node module. But maybe somehow this is getting bypassed? OK, we can check this:
*** Can you run this query on your database (you may need to change the table name from menu_router to prefix_menu_router if you have a database prefix):
SELECT * FROM menu_router WHERE path LIKE "node/%"
You should see about 30 results, depending on how many content types you have defined:
- There should be exactly one with path = node/% -- it should have access_callback = node_access and the load_functions field should look like a:1:{i:1;s:9:"node_load";} and the access arguments field should look like a:2:{i:0;s:4:"view";i:1;i:1;} and the to_arg_functions field should be blank.
- Then there should be some other entries in the results like node/%/delete, node/%/edit, node/add/page, etc.
- There should NOT be any specific entries in the results with paths such as node/249277
*** End of *** section
d) Assuming the check in (c) was correct, then the node/% path is the one getting matched and standard node loading/access is in effect. So, in http://api.drupal.org/api/function/_menu_translate/6 the node will get loaded and the access checks will be done. This will call both http://api.drupal.org/api/function/node_load/6 and http://api.drupal.org/api/function/node_access/6
e) If node_load() returns FALSE, then the access element will be set to false. But this shouldn't happen, given that the database looks the same in node_revisions and node for working and non-working nodes. I guess it is remotely possible that the access flag could be set to FALSE during something else in the node load (all the module_invoke stuff that node_load does to add information to the node). That would have to be from some contrib module that is doing something specific for some subset of your nodes. But this seems unlikely.
f) Access checks in node_access()... If there is no custom module associated with the 'patient' content type, then this will use node_content_access() to check the access rights, and then if that doesn't say anything, it will check the node access table (which we have already verified is the same for working and non-working nodes). node_content_access() never returns anything for 'view' access checks, so unless there is a custom module defined, this cannot be the problem.
*** Another thing to check:
Can you find the record for the 'patient' content type in your "node_type" table, and look at the "module" field? If it is something other than "node", then that module may have an access check that is coming into play.
*** End of this check
Well... There are two things to check, and I think both of them are long shots... Other than this, I think I am out of ideas, and I don't know what to tell you...
*** One more thing you could try...
You could try disabling custom or contrib (non-core) modules that might have some bearing on workflow (marking nodes draft/published/etc.), access control, the patient content type, or even other content types. If you want to try this, I would follow these steps:
a) Pick a single module that could potentially be causing trouble.
b) Disable this module.
c) Click the link described in comment #14 above to tell cron you want to reindex those empty nodes.
d) Run cron once.
e) Check the log and see if the nodes that were reindexed in that cron run still have (3) at the end of their log messages indicating access denied.
If that doesn't change anything, then re-enable that module and try a different one.
*** End of this one other thing to try.
#26
OK I am not that much of a PHP/Drupal Geek but that is interesting to know ;-)
Here is what the table for the first query returns for the first few results:
node/% a:1:{i:1;s:9:"node_load";} node_access a:2:{i:0;s:4:"view";i:1;i:1;} node_page_view a:1:{i:0;i:1;} 2 2 node/% node_page_title a:1:{i:0;i:1;} 4 0node/%/access a:1:{i:1;s:9:"node_load";} content_access_node_page_access a:1:{i:0;i:1;} drupal_get_form a:2:{i:0;s:19:"content_access_page";i:1;i:1;} 5 3 node/% node/% Access control t 128 3 sites/all/modules/content_access/content_access.ad...
So that matches what you were hoping in the first part.
As for the Node_Type for "Patient" it is that - node.
I will try the disable and enable of modules. Who knows what it is now :)
I really appreciate your help. This is above and beyond and it is much appreciated!
#27
On a positive note only 703 nodes are needed to be re-indexed as opposed to 723 ;-)
This was after disabling Content Complete module - one that was recently updated on my site.
#28
That's interesting... but odd. I took a look at Content Complete and I didn't really see anything in there that would cause access troubles.
Anyway... I guess I'll set this back to a "support request" rather than bug, and I still don't have any other ideas.
#29
Well was not content complete. Disabling the access control module now. But I really do not have any others that involve workflow... other than Views Bulk Operations.
#30
Not Access Control either.
EDIT: Rebuilding Content Access Permissions. Maybe that will help.
#31
Hah! I should have thought of that. Good idea!
#32
Unfortunately it did not work.
I will keep plugging away and see what comes up.
#33
Just as one more note: I added some new tests to the Search by Page / SBP Nodes module today to verify that "private" content can be indexed, and appears in search results for sufficiently privileged users.
I still don't know why those nodes of yours are getting "access denied".
Interestingly, I had to put in a line in the test to rebuild the content access permissions in order for the test to pass. I think this was mostly a SimpleTest framework artifact though.
#34
I just put out version 6.x-1.5 of Search by Page. The module is identical to the one I attached above in the zip file (with the extra debug logging added).
#35
Still not working... cannot figure it out.... and the log file is filling up with the errors due to this....
Type search_by_pageDate Friday, November 6, 2009 - 11:30
User Anonymous
Location http://127.0.0.1/cron.php
Referrer
Message content for PID (1506), path (node/1934) was not indexed (3)
Severity error
Hostname 127.0.0.1
Operations
#36
Not sure what to tell you either. The (3) at the end of the message means that during search indexing, node/1934 had an access permissions error (i.e. access denied). And we cannot seem to figure out why that should be...
#37
Not a complaint on my side for sure. Still trying to figure where the access problem is coming from....
Is that an error from the main Search function or just the SBP module ? Because it seems the main search module is indexing the node.
#38
That error is coming from the SBP module. See #25 above for an explanation of what it is doing and where the access checks are being performed.
The main search module doesn't work in the same way -- instead of calling menu_execute_active_handler() to render the theme's version of each page that is indexed (which is what SBP does), the main search module indexes the default rendering (theme-independent) from the node module. So it is not using the same permission checking that SBP is.
#39
Not forgetting about this...
Any chance that SBP and Apache Solr Search Integration interferes with each other ?
Removed Apache Solr Search and framework, (added lucene API as well), and the log file does not seem to be having issues... I will have to see if theer are imnprovements in SBP.
#40
Oh, you're using Apache Solr Search? You never mentioned that before. I have no idea whether SBP and Solr work well together. SBP was meant to be used with the usual Drupal search module, not Solr. Solr has its own search index, so all of the queries we did above don't apply if you are using Solr, and the error messages in the log have to do with core Search putting things into its index, not Solr putting things into its index.
I also don't understand why you would use Lucene and Solr? My understanding is that both are integrations with 3rd-party search engines with their own code and index databases, so I would think you'd only use one or the other?
#41
I had the Apache Solr module installed but was not using it. Lucene I added today..... and removed Solr.
#42
Well, I am not sure whether SBP will work well with Lucene or Solr. As far as I know, they have their own indexing mechanisms. SBP is designed to work with the core Search's indexing mechanism, and I just don't know much about Solr or Lucene to know whether they'd work with SBP. Sorry.
#43
Core search is still there - Solr & Lucene make their own indexes. So SBP is not relying on them - so far the SBP is working the same as before but I am mot certain if the unindexed nodes are being indexed just yet.
#44
I had high hopes... unfortunately it has not fixed things. So it is not Solr or Lucene that affects things.
However the one thing that I will check on is if Lucene can find those problem nodes...
#45
It's quite possible that Lucene can -- didn't we decide that it was a problem in node access inside of Search by Page's attempt to render the page?