Closed (fixed)
Project:
Search by Page
Version:
6.x-1.4
Component:
Main Search by Page module
Priority:
Normal
Category:
Support request
Assigned:
Unassigned
Reporter:
Created:
15 Oct 2009 at 14:45 UTC
Updated:
11 Nov 2010 at 16:42 UTC
Jump to comment: Most recent file
Comments
Comment #1
jhodgdonSearch can only find things (pages for Search by Page, or nodes for default Search) that it has indexed.
It is quite possible for a given node to be indexed within the node/content search on the main search page, but not yet indexed within Search by Page (Search by Page renders the page differently, so it cannot use the main node/content search index). This would cause that node's page to come up in the main/advanced search results, but not in Search by Page results.
You can check the main Search settings page to see how much of the site is indexed. If it reads less than 100%, that means that not all of your site is indexed. If not all your site is indexed, then it is possible certain nodes are indexed within the main node search index, but not yet indexed within Search by Page. Running cron a few times will improve indexing -- each time you run cron, a certain amount of content gets indexed (or reindexed, if it has changed).
Hopefully this will clear up your issue...
Comment #2
sphopkins commentedThanks. I am at 100% indexed, and I have seen the problem still.
I am only searching for the title. My application based on Drupal is basically an electronic health record for a group of breast cancer patients. The title is their chart number, and I am only allowing people to search for the one type of CCK content (= Patient).
I have a cron job running (I think every 15 minutes) to ensure that indexing is done. I am at several hundred thousand nodes and I am up to date in indexing.
I will keep watching to see if the problem persists.
Comment #3
jhodgdonWell, that is indeed curious. Are you using any kind of permissions module that could be restricting access to certain nodes?
The other question is: is the node title actually displayed on the page? Search by Page indexes what is displayed on the page in your theme, so anything that isn't displayed isn't indexed.
Comment #4
jhodgdonAnother thing to check would be to search for a different piece of text that is definitely displayed on the page you aren't seeing in search results, and see if that page shows up in the search results.
Comment #5
sphopkins commentedI do have a permissions module in place, but the issue pre-dates that.
As well, as the administrator of the CMS I have full access to everything and it is happening to me ;-) Actually just confirmed a minute ago with a search.
The node title is displayed on the page - I am currently using the ZEN theme without any customizations.
I will try searching for other things and see what happens.
Thanks
Comment #6
jhodgdonIf you'd like another pair of eyes to take a quick look, send me a privatte message with the site URL and an example of a page/search term that is coming up in core Search but not Search by Page.
Comment #7
sphopkins commentedUnfortunately it is on an intranet and also has confidential patient information in it ;-) But thanks for the offer!
Comment #8
jhodgdonAh...
If you want to do some more investigation, and have PHPMyAdmin or a similar database tool and know how to use it (or some other means to run queries), here are some things you can look at in the Drupal MySQL database:
a) First, verify that Search by Page realizes it needs to index the node in question. Here's a query that will tell you this, assuming the node ID number is 12345:
If you have a database prefix, you would need to replace sbp_path with drupal_sbp_path (or whatever your table prefix is) in that query.
If that query returns a result, then at least we know Search by Page knows it should be indexing that page. You can check the last index time field in the result, to verify that it has been indexed (any non-zero value).
If there is no result, then we've narrowed down the problem, which is to say that the Search by Page Nodes module for some reason decided not to tell the main Search by Page module to index that page.
b) If that query returns a result, you can note the "pid" field for that record/row, and then see what the search index looks like for that page (i.e. see what words are in the search index for that page). If the PID is, for instance, 38, your query would look like this:
Again, if you have a database prefix, you would need to replace search_index with drupal_search_index (or whatever your table prefix is).
If there is no result for that query, then the page didn't get indexed correctly. If there is a result, you can browse through and see if the words you expect have been indexed.
Comment #9
jhodgdonEdited...
Ah... I have an idea. Search by Page is rendering each page before indexing it. What happens if you are not logged in at all, and you try to view one of your Patient pages? Do you see anything on the screen?If you are getting a "403" error, then that is probably the problem -- Search by Page is not going to work for you, because it will only index what the generic user can see on the page. At least, I think that is the case -- it is requesting the page during the cron run, and I am pretty sure that is the same as what an anonymous user will see.Then again, if it is just some of your pages that are having this trouble, then maybe that is not the problem... unless you indexed some of them before you put some permissions in place?
Actually, this is incorrect. I have a test site where certain users have permission to see certain pages. All the pages are indexed correctly, and when searching, the permissions are honored. So this should not be the problem.
Comment #10
sphopkins commentedI will check all of this out in the next little bit. I have phpMyAdmin and Sequel Pro to do queries directly on the database.
Comment #11
sphopkins commentedOK.
For (a)
Checked the database and every node that should be indexed has an entry in that table. Verified a node that does not come up in a search by page and does come up in a regular search.
For (b)
That returns an empty dataset. There are 2.5M rows in that table so I will pass on browsing ;-)
Any idea of where or what to do next ?
Comment #12
jhodgdonIf the page that showed up in (a) shows a last indexing time that is not zero in that row, that means Search by Page tried to index it....
My best guess right now is that when cron/search/search by page tries to render the page in order to index it for searching, the permissions are such that it's not getting any content, so it's not able to index the content.
What content permissions module / setup / permissions are you using? I can try to replicate a similar setup on my test box and see if I can get the same problem. ...
The odd thing is that on my test box, I have a setup with some content set to no public access, and I can see it if I search when I'm logged in, but not when I'm not logged in... On that box, I am using "Content Access", and that particular content type is set up for access on a per-node basis. I have some nodes set to anonymous and some set to no anonymous viewing. And this is working fine... I even blew out the search index completely and ran cron a few times to reindex, and it is still fine.
Comment #13
sphopkins commentedYeah it shows up with a non-zero index time. Actually I did not find any that did not have an attempt at indexing by SBP.
It is funny that I had this problem both before and after introducing Content Access for Drupal. So I do not think that this is an access problem but you never know.
For reference I currently do not have per node access set up. I am only using per node type access control.
In my install, I have ~9300 "Patient" type CCK nodes. I only seem to have a problem with a percentage of the nodes. All nodes are indexed in the main Drupal search - if I run into a problem with SBP not returning a hit I know that I can go to the main search and get the hit (along with other node types that match - all nodes in my Drupal install use the patient's chart number as the title for each node that relates to a patient... hence the utility of SBP).
And as well I have blown out the search index and reindexed every node just in case that was the problem... took a while to get all of my nodes reindexed but I am now there ;-)
Comment #14
jhodgdonWell, it's very odd.
I really do want to get to the bottom of this, so I've put in a couple of changes to Search by Page. If you're willing to keep investigating...
a) Unzip this file and replace your existing search_by_page.module file with this one.
b) Make sure you have the "Database logging" core module enabled.
c) Visit the Performance screen under Site Configuration, scroll down to the bottom, and click the "clear cache" button.
d) Visit the Search by Page Settings screen under Site Configuration, and click the "Click to reset blank pages so they will reindex at next cron run" link. You should see a message at the top of the screen saying "Blank pages have been reset to index at next cron run (###)", where ### is the number of blank pages it found in the search index.
e) Run cron (there's now a link from the Search by Page settings page to the Status Report page, where you can click on Run Cron Manually). If there were a lot of pages showing up in (d) for reindexing, you might need to run it a couple of times.
Now you can check (a) and (b) in comment #8 above to see if the pages were indexed this time. And if they ended up in the same state as before [where they were in (a) but not (b)], then Search by Page should have added an entry to the Recent Log Entries report in the Reports section.
Let me know how this turns out...
Comment #15
jhodgdonComment #16
sphopkins commentedI will look into this probably on monday. Thanks for all the work. Should not be a problem testing.
Comment #17
sphopkins commentedHad a chance to quickly do this...
Here is the result: "Blank pages have been reset to index at next cron run (723)"
I will look in on (a) & (b) early next week as I have a bunch of pages to index by cron...
Comment #18
jhodgdonWell, that's encouraging -- the query to find un-indexed pages found the un-indexed pages on your site (I had tested it on my site by removing a page from the index via PHPMyAdmin). I'll be holding my breath... er, patiently waiting to see what happens with the reindex attempt, and/or the log entries if the reindexing fails again. (We should be able to get some information from the log entries in that case -- you may need to click in the log to see the full message.)
Thanks for your patience! Hopefully we'll soon figure out what the problem really is. Hard to have a solution otherwise. :)
Comment #19
sphopkins commentedSo far all I am getting for the pages that needed to be indexed is the following in the log file:
"Oct 19 10:00:13 localhost drupal: http://127.0.0.1|1255960813|search_by_page|127.0.0.1|http://127.0.0.1/cron.php||0||content for PID (13157), path (node/249313) was not indexed (3)"
So far no new SBP has been indexed.
Comment #20
sphopkins commentedAnd as SBP tries to index one of the pages I know is there and it did not return a result before, it again fails and has information in the database:
pid last_index_time page_path from_module modid
13121 1255961712 node/249277 sbp_nodes 249277
Comment #21
jhodgdonThat is what we needed to know, actually. That (3) is an error code, MENU_ACCESS_DENIED. Which tells us that for those pages, during cron, when Search by Page is trying to render page (your site URL)/node/249313, it is getting an Access Denied error.
So it looks like it is NOT getting Access Denied when it tries to render most of your patient records, but it is in the case of node/249313, node/249277, and those other 700 or so nodes. What is different in the permissions for those pages?
Comment #22
sphopkins commentedThere should be nothing different for the permissions for those pages. I will investigate on my end but the content access module was added a week or two ago and those nodes (and every one else) was in place long before that.
Comment #23
jhodgdonHmmm.... You might check the publication status -- make sure all of them are set to "published" (you don't have a workflow module that could be interfering, I'm assuming?). You might also look in the database at the records in the node, node revisions, and node access tables for one of the broken nodes as well as one of the working nodes, and see if there is something different.
Also, if you are using a custom module to define the content type, there could be some access checks being done in the content type module.
That's really all I can think of, but given that log message, the problem is definitely that cron is getting access denied to just those 723 nodes. So there must be something different about them. ???
Comment #24
sphopkins commentedThey should be all published - the information on all of these nodes is imported using Node_Import and that is how we ensure all are available for viewing. Just checked and that one is published.
No custom modules - all are Drupal-submitted modules from Drupal.org
I do not see any difference in records between two records - one broken and one working.
Node Access table has this:
Broken:
Working:
Node table:
Broken:
Working:
Node Revisons:
Broken:
Working:
Comment #25
jhodgdonHmmm... When I said "custom", I should have probably said "custom or contributed" Drupal modules.
Anyway, let me think this through. Here are some (technical) notes about what happens when cron is trying to index one of your patient nodes [skip this, except lines/sections starting with *** unless you are a PHP/Drupal programming geek or are interested]:
a) Search by Page calls http://api.drupal.org/api/function/menu_execute_active_handler/6 to render the page. As evidenced by the log shown in #19 above, this function is returning MENU_ACCESS_DENIED (3) on those problem nodes.
b) menu_excecute_active_handler() only returns MENU_ACCESS_DENIED when the call to http://api.drupal.org/api/function/menu_get_item/6 returns a menu router item with its access element set to FALSE.
c) menu_get_item() finds the best matching menu router item from the menu router table. I have been assuming that this would be the "node/%" menu router item defined in the node module. But maybe somehow this is getting bypassed? OK, we can check this:
*** Can you run this query on your database (you may need to change the table name from menu_router to prefix_menu_router if you have a database prefix):
SELECT * FROM menu_router WHERE path LIKE "node/%"
You should see about 30 results, depending on how many content types you have defined:
- There should be exactly one with path = node/% -- it should have access_callback = node_access and the load_functions field should look like a:1:{i:1;s:9:"node_load";} and the access arguments field should look like a:2:{i:0;s:4:"view";i:1;i:1;} and the to_arg_functions field should be blank.
- Then there should be some other entries in the results like node/%/delete, node/%/edit, node/add/page, etc.
- There should NOT be any specific entries in the results with paths such as node/249277
*** End of *** section
d) Assuming the check in (c) was correct, then the node/% path is the one getting matched and standard node loading/access is in effect. So, in http://api.drupal.org/api/function/_menu_translate/6 the node will get loaded and the access checks will be done. This will call both http://api.drupal.org/api/function/node_load/6 and http://api.drupal.org/api/function/node_access/6
e) If node_load() returns FALSE, then the access element will be set to false. But this shouldn't happen, given that the database looks the same in node_revisions and node for working and non-working nodes. I guess it is remotely possible that the access flag could be set to FALSE during something else in the node load (all the module_invoke stuff that node_load does to add information to the node). That would have to be from some contrib module that is doing something specific for some subset of your nodes. But this seems unlikely.
f) Access checks in node_access()... If there is no custom module associated with the 'patient' content type, then this will use node_content_access() to check the access rights, and then if that doesn't say anything, it will check the node access table (which we have already verified is the same for working and non-working nodes). node_content_access() never returns anything for 'view' access checks, so unless there is a custom module defined, this cannot be the problem.
*** Another thing to check:
Can you find the record for the 'patient' content type in your "node_type" table, and look at the "module" field? If it is something other than "node", then that module may have an access check that is coming into play.
*** End of this check
Well... There are two things to check, and I think both of them are long shots... Other than this, I think I am out of ideas, and I don't know what to tell you...
*** One more thing you could try...
You could try disabling custom or contrib (non-core) modules that might have some bearing on workflow (marking nodes draft/published/etc.), access control, the patient content type, or even other content types. If you want to try this, I would follow these steps:
a) Pick a single module that could potentially be causing trouble.
b) Disable this module.
c) Click the link described in comment #14 above to tell cron you want to reindex those empty nodes.
d) Run cron once.
e) Check the log and see if the nodes that were reindexed in that cron run still have (3) at the end of their log messages indicating access denied.
If that doesn't change anything, then re-enable that module and try a different one.
*** End of this one other thing to try.
Comment #26
sphopkins commentedOK I am not that much of a PHP/Drupal Geek but that is interesting to know ;-)
Here is what the table for the first query returns for the first few results:
So that matches what you were hoping in the first part.
As for the Node_Type for "Patient" it is that - node.
I will try the disable and enable of modules. Who knows what it is now :)
I really appreciate your help. This is above and beyond and it is much appreciated!
Comment #27
sphopkins commentedOn a positive note only 703 nodes are needed to be re-indexed as opposed to 723 ;-)
This was after disabling Content Complete module - one that was recently updated on my site.
Comment #28
jhodgdonThat's interesting... but odd. I took a look at Content Complete and I didn't really see anything in there that would cause access troubles.
Anyway... I guess I'll set this back to a "support request" rather than bug, and I still don't have any other ideas.
Comment #29
sphopkins commentedWell was not content complete. Disabling the access control module now. But I really do not have any others that involve workflow... other than Views Bulk Operations.
Comment #30
sphopkins commentedNot Access Control either.
EDIT: Rebuilding Content Access Permissions. Maybe that will help.
Comment #31
jhodgdonHah! I should have thought of that. Good idea!
Comment #32
sphopkins commentedUnfortunately it did not work.
I will keep plugging away and see what comes up.
Comment #33
jhodgdonJust as one more note: I added some new tests to the Search by Page / SBP Nodes module today to verify that "private" content can be indexed, and appears in search results for sufficiently privileged users.
I still don't know why those nodes of yours are getting "access denied".
Interestingly, I had to put in a line in the test to rebuild the content access permissions in order for the test to pass. I think this was mostly a SimpleTest framework artifact though.
Comment #34
jhodgdonI just put out version 6.x-1.5 of Search by Page. The module is identical to the one I attached above in the zip file (with the extra debug logging added).
Comment #35
sphopkins commentedStill not working... cannot figure it out.... and the log file is filling up with the errors due to this....
Comment #36
jhodgdonNot sure what to tell you either. The (3) at the end of the message means that during search indexing, node/1934 had an access permissions error (i.e. access denied). And we cannot seem to figure out why that should be...
Comment #37
sphopkins commentedNot a complaint on my side for sure. Still trying to figure where the access problem is coming from....
Is that an error from the main Search function or just the SBP module ? Because it seems the main search module is indexing the node.
Comment #38
jhodgdonThat error is coming from the SBP module. See #25 above for an explanation of what it is doing and where the access checks are being performed.
The main search module doesn't work in the same way -- instead of calling menu_execute_active_handler() to render the theme's version of each page that is indexed (which is what SBP does), the main search module indexes the default rendering (theme-independent) from the node module. So it is not using the same permission checking that SBP is.
Comment #39
sphopkins commentedNot forgetting about this...
Any chance that SBP and Apache Solr Search Integration interferes with each other ?
Removed Apache Solr Search and framework, (added lucene API as well), and the log file does not seem to be having issues... I will have to see if theer are imnprovements in SBP.
Comment #40
jhodgdonOh, you're using Apache Solr Search? You never mentioned that before. I have no idea whether SBP and Solr work well together. SBP was meant to be used with the usual Drupal search module, not Solr. Solr has its own search index, so all of the queries we did above don't apply if you are using Solr, and the error messages in the log have to do with core Search putting things into its index, not Solr putting things into its index.
I also don't understand why you would use Lucene and Solr? My understanding is that both are integrations with 3rd-party search engines with their own code and index databases, so I would think you'd only use one or the other?
Comment #41
sphopkins commentedI had the Apache Solr module installed but was not using it. Lucene I added today..... and removed Solr.
Comment #42
jhodgdonWell, I am not sure whether SBP will work well with Lucene or Solr. As far as I know, they have their own indexing mechanisms. SBP is designed to work with the core Search's indexing mechanism, and I just don't know much about Solr or Lucene to know whether they'd work with SBP. Sorry.
Comment #43
sphopkins commentedCore search is still there - Solr & Lucene make their own indexes. So SBP is not relying on them - so far the SBP is working the same as before but I am mot certain if the unindexed nodes are being indexed just yet.
Comment #44
sphopkins commentedI had high hopes... unfortunately it has not fixed things. So it is not Solr or Lucene that affects things.
However the one thing that I will check on is if Lucene can find those problem nodes...
Comment #45
jhodgdonIt's quite possible that Lucene can -- didn't we decide that it was a problem in node access inside of Search by Page's attempt to render the page?
Comment #46
sphopkins commentedTrying to follow up here and see if this is fixable.
As a refresh, there was a problem getting all nodes indexed by SBP. There were many that were not being indexed for unknown reason. I still have Content Access 6.x-1.2 providing the rights to load certain pages. I have still been getting the errors.
As some testing I allowed anonymous users to load the content type that I want indexed. And lo and behold, the errors stopped appearing in the log files. Not 100% sure things are all being indexed right now.... checking that out.
Comment #47
jhodgdonWell.
I still don't understand why some of your content would be indexed and some not, given that you say that all of your content has the same permissions. Until I understand why that is, it will be difficult to fix...
Comment #48
sphopkins commentedI agree that this is perplexing.
The interesting thing is that I only enabled anonymous users. All other permissions stayed the same.
I have reset the blank pages in the SBP index, cleared the cached data and I will await the results of cron....
Comment #49
jhodgdonI still need to understand what the difference is between the content that is and is not indexed. I don't think this is going to tell us anything, although it might fix your site. But periodically the content that was previously indexed will be reindexed, so once you set your permissions back to the correct values, the site will eventually revert to having this permissions problem, and the content won't be indexed again, I expect.
Comment #50
sphopkins commentedThat is the weird thing... as far as I can see there is no difference. All teh same node type. All created with Node_import module.
I expect part of my testing to show me if there is any "reversion" when I reset the settings. I am stumped like you.
Comment #51
jhodgdonHah! I may have found an issue. I realized that in the automated tests I wrote for this module (as well as when I was trying to reproduce your problem manually), I (or the testing framework) was logged in as a priveleged user when running cron. So I took that out of the automated tests, and I experienced some test failures that seem like they might be related to your problem.
I will investigate further and report back.
Comment #52
sphopkins commentedWoo Hoo !
I really really really really really really really really really appreciate your dedication to looking at this.
Comment #53
jhodgdonMany of us are working with all of our volunteer/spare time for Drupal development, trying to get Drupal 7 ready for alpha release, so it will be a couple of days before I can get back to this -- apologies in advance! But I haven't forgotten.
Comment #54
sphopkins commentedThat is why I appreciate your efforts so much on this. If I was in Seattle I can guarantee I would be contracting you to do some work for me !
Comment #55
jhodgdonI can work remotely... Actually, only about 1/2 to 1/3 of my clients, at any given time, are local to Seattle. :)
Comment #56
sphopkins commentedI am unfortunately on an intranet and I have confidential patient information so remote is hard ;-)
Though I may look at some theming and other ideas I see on your site.
Comment #57
sphopkins commentedOK I just verified that the nodes that were not indexed before are now indexing after allowing anonymous access to content. I will have to rebuild permissions and remove the anonymous access eventually but there is something going on there.
Comment #58
jhodgdonThanks for the information.
Comment #59
jhodgdonJust to let you know, I'm working on fixing this in Search by Page now, and should have something for you to test in the next few days.
Comment #60
sphopkins commentedExcellent and thanks. In terms of availability, I am away Jan 22-29 so no rush in that timeline... I will be sipping margaritas on a beach somewhere hopefully warmer than Canada ;-)
I am eager to test something.
Thanks
Comment #61
jhodgdonI have just committed changes to the development version of Search by Page, for these issues:
#662282: Support for multilingual sites
#605458: Search by Page does not return all results
#492878: Multiple search types
Pertinent to this particular issue are access permission features:
- You can set a role to use when indexing nodes, users, etc.
- Also note that there is a new "search environments" feature, so the settings pages look a little different (see above issue)
I think it's working fine, and all of my automated tests also pass.
If anyone who's watching this issue wants to test, you can get the development version from CVS, or wait about 24 hours until it's updated on http://drupal.org/project/search_by_page (version 6.x-1.x-dev, but make sure its date/time is after the date/time of this comment here!).
You will need to run the update.php script after updating to this version of the module, and also visit the Search module settings page and tell it to redo the search index.
Comment #62
sphopkins commentedWill be downloading the dev version and testing in the next bit. Thanks for the work.
Comment #63
sphopkins commentedOne error on running update.php
Table 'breastdb.sbpp_path' doesn't exist query: UPDATE sbpp_path SET languages='1' in /var/www/html/sites/all/modules/search_by_page/search_by_page.install on line 179.Is this critical, can I fix it and can I try working with it now ?
Thanks
Comment #64
jhodgdonYou should be OK. If you are using Search by Page Paths, you might want to edit the paths you have created and make sure they have language(s) defined.
Thanks for letting me know about that error. I'll fix it.
Comment #65
jhodgdonAh, I see the problem, which is that the Search by Page update function assumed SBP Paths was turned on, and it shouldn't have. Thanks again for pointing that out, and I'll be curious to see your feedback on the other changes.
Comment #66
jhodgdonComment #67
jhodgdonAny word on whether the development version fixes your issues?
Comment #68
jhodgdonI'm tentatively marking this issue "fixed", and it's out in version 6.x-1.8 (or should be shortly).
Please reopen if you are still having problems.
Comment #69
sphopkins commentedI am still having issues I guess, and I am not sure what to do.
I am getting these errors in the log:
Type search_by_page
Date Friday, February 12, 2010 - 11:33
User Anonymous
Location http://127.0.0.1/cron.php
Referrer
Message Content not rendered (access denied) - PID (32192), path (node/250435), realpath (node/250435), language (en)
Severity error
Hostname 127.0.0.1
Operations
The node type to be indexed is being indexed as an administrator. The administrator has access to that node type / content type via permissions and access control.
Not sure what I am doing wrong.
Comment #70
jhodgdonAre you using the released 6.x-1.8 version now?
Comment #71
sphopkins commentedI have been using 6.x-1.x-dev (2010-Feb-01). I
will updateupdated tothe1.8 todayComment #72
jhodgdonWell, there were a few changes between Feb 1 and now, so I'll be interested to see if that changes anything...
One question: Are you running cron while logged in, from the Status Reports page, or from a cron job? Because that could affect permissions (although it should not with the new version of SBP). Are in-person cron jobs failing to index anything, and automatic cron jobs indexing everything, or some subset, or vice versa? Do you get the same errors in the log no matter how you run cron?
So I still don't fundamentally understand why some of the nodes of this content type are getting the access denied error and some aren't. What's the difference between the indexed and non-indexed nodes? Somewhere back up in the comment stream you had indicated there was no difference between these nodes that you were aware of, and that you weren't using any per-node content permissions, and presumably they were all ....
Hmmm....
So is there any difference in how these nodes were created? E.g. were some imported via a script or module, and some have later been edited from the user interface?
Or anything at all you can think of to distinguish the ones that are vs. are not indexed?
Comment #73
jhodgdonAh... I just fixed something that the Drupal 7 tests uncovered that might be an issue in Drupal 6 as well. Let me try something.
EDIT: Actually, no, that wouldnt' cause those errors in the log. Forget that. Sorry.
Comment #74
sphopkins commentedI am running the cron.php from cron on the FC11 server I have. I will have to try a manual cron from the status page.
In the SBP tables, things look like this for unindexed nodes:
Basically all nodes are created via an automated script from Node_import, and some are edited and some have not been. All have very extensive view fields linked to them.
I will see what else I can dig up.
Comment #75
jhodgdonWhat I'm trying to figure out is what is the difference in the history/values of the nodes that are indexed vs. the nodes that are unindexed. Not what is happening in the SBP tables, but what is happening in the node/cck/node_access/permissoins tables; and also as a side-line or clue into the database tables, what is the history of the nodes.
For instance, if all of the imported nodes are not indexed, until they are edited in the UI, then perhaps the node importing script is not triggering the same kind of permissions setup as what you would get if you created/edited the nodes in the UI.
???
I could understand it if no nodes of this type on your system were being indexed, or if all nodes on your system of this type were being indexed, but if some are and some aren't, and we cannot figure out what is different between the nodes that are and aren't indexed (again, outside of the SBP tables), then I have nothing to go on.
Comment #76
sphopkins commentedSince the change I can say that there are fewer nodes indexed than before. But the chance that the module node_import is causing problems may have legs as I have run into problems with the "drupal-ness" of that node in the past.
Comment #77
jhodgdonYeah, if that module is, for instance, bypassing node_save() (the central Drupal function for saving nodes) and doing its own database update queries, it's quite possible it has screwed up your permissions.
Comment #78
jhodgdonI made some changes to how SBP is dealing with roles during search indexing, which may help you. Can you try out version 6.x-1.9 (which I just released, should be available within 15 minutes) and see if it helps? You may need to rebuild your node access permissions (you can do this at (your URL)/admin/content/node-settings/rebuild).
Comment #79
sphopkins commentedWill do that tomorrow. I have lots of nodes so I will have to do it at the end of the day...
Comment #80
sphopkins commentedHave not rebuilt teh content permissions but I am noticing the following in the logs:
Role 3 (administrator) could not be used to index PID (18264), path (node/5105)
Comment #81
jhodgdonThe new version is setting up dummy users for search indexing... Do you have user account creation blocked in some way on your site? It would be done at a low level, not via "users can create their own accounts", so the account settings on the usual account settings page would not block it...
So can you look back a bit in the logs and see if you can find a message that says something like
or
And do you see a blocked user in your Users page with the name "sbp indexing administrator"?
Comment #82
sphopkins commentedI do see the users have been added and the status is blocked. I will unblock them (did not think I set any restrictions on users but since I have created all of them myself it may be restricted).
Comment #83
sphopkins commentedAlso noticing this error:
Duplicate entry 'sbp indexing administrator' for key 'name' query: INSERT INTO users (pass, name, mail, status, created) VALUES ('5fe37353c8ed541ed0707c2de8806237', 'sbp indexing administrator', 'ZsQUyA6a85@XAvUTDh2Bq.com', 0, 1266340513) in /var/www/html/modules/user/user.module on line 327.Comment #84
jhodgdonHuh. You don't have to unblock the dummy users...
That error in #83 makes sense to me, given the other errors you are seeing. The code tries to recreate the users if it cannot load them, and it's getting a dup error because the user name has already been used.
Ummm...
Did you run the update.php script after installing the new version of the module? If not, run it now, and then from your Users admin page, delete the sbp users and try running cron again. I bet that is the problem...
Comment #85
sphopkins commentedI did run the update script but it may have been right when cron was about to run.... I will delete them and let cron do its job.
Comment #86
jhodgdonOK. I can see I might need to make some updates to prevent this sort of thing.
EDIT: Filed a separate issue on this:
#716342: SBP users for indexing may have duplication problems
Comment #87
sphopkins commentedStill on this... made some changes and I will report back.
Comment #88
jhodgdonAny news?
Comment #89
jhodgdonAny news?
Comment #90
jhodgdonbump. Any news on this?
Comment #91
illepic commentedJust putting in my two cents here:
I too was having errors with Lucene indexing certain nodes on my site. We found out that our users were attaching cck images and including in the provided title and alt tags the abbreviation for inches as ". So we ended up with many attached title and alt tags like: Product Name 6"x6"x5".
Due to the positioning of the image on the page, this was preventing the rest of the body copy to be indexed for Lucene. Once we took out those quotation marks, Lucene indexed beautifully.
Just throwing it out there, don't know if it helps :)
Comment #92
jhodgdonThis issue has nothing to do with the Lucene module, and the problem was that certain nodes were not getting indexed at all, not that some content was being omitted from indexed nodes. So I don't think your note is related to this issue... but thanks for trying...
Comment #93
jhodgdonI've recently made some changes to Search by Page and how it decides to index content... and I haven't heard anything on this issue for several months. So at this point, I'm going to close this support request and hopefully it's been resolved.