Content Access + Domain Access - ApacheSolr returns all the results not respecting the node access permissions

duozersk - August 22, 2009 - 09:28
Project:Apache Solr Search Integration
Version:6.x-2.x-dev
Component:Code
Category:bug report
Priority:normal
Assigned:Unassigned
Status:needs work
Description

Looking for a advanced and fast search solution I tried Apache Solr Search Integration on my test machine.

My site uses Content Access to limit the access to the content + Domain Access to bind the content to the one of two domains (forum.example.com and kb.example.com; more domains to add). Domain Access is set up to publish the content to all configured domains (called affiliates) + we are using the Domain Advanced that changes the Domain Access to use db_rewrite_query() instead of node access (so the node_access table is only ruled by the Content Access module).

I have installed two Lucene based solutions in parallel - Apache Solr Search Integration and Search Lucene API. Then I run the cron jobs to get the content indexed. The issue is thatSearch Lucene API handles restricted content just fine, but the Apache Solr Search Integration returns restricted content results to the not authorized user (and when this user tries to open them - Access Denied message is shown).

Please advise as Apache Solr Search Integration is much faster in returning the results and filtering by facets. I would really want to get it sorted out.

#1

Scott Reynolds - August 22, 2009 - 15:42

This seems silly to ask but you used apache solr node access module right?

#2

duozersk - August 22, 2009 - 17:57

Yes, I did enabled the apachesolr_nodeaccess module... didn't find any settings for it - so I assume it should just work once enabled. But it really didn't produce the results I expect.

#3

duozersk - August 22, 2009 - 18:28

I'm not that proficient in D6 node access, the only thing I noticed is that Domain Access still adds one entry to the node_access table on the install (even when using the Domain Advanced module):

+-----+-----+------------+------------+--------------+--------------+
| nid | gid | realm      | grant_view | grant_update | grant_delete |
+-----+-----+------------+------------+--------------+--------------+
|   0 |   0 | domain_all |          1 |            0 |            0 |
+-----+-----+------------+------------+--------------+--------------+

And then from looking into the code of apachesolr_nodeaccess module I see that it tries to handle the node_access grants... probably the above entry somehow confuses the apachesolr_nodeaccess algorithm. But still, another module figured it out correctly...

Let me know if I can provide anything else to get to the root cause of this behavior.

#4

duozersk - August 22, 2009 - 21:38
Title:Content Access + Domain Access - ApacheSolr returns all the reslts not respecting the node access permissions» Content Access + Domain Access - ApacheSolr returns all the results not respecting the node access permissions
Category:support request» bug report

Changing to bug report, I believe it is appropriate.

#5

Scott Reynolds - August 24, 2009 - 02:36

The trick is actually really simple. Give each subdomain a different apachesolr_site_hash.

That will filter the results for you per domain. so you search from suba.domain.com will only be suba results

you search from subb.domain.com will only be subasubb results

This is centered around the node_access module saying "Anything user 0 can see, it is node_access_all"

#6

duozersk - August 24, 2009 - 11:39

Thanks Scott. It seems I didn't describe it correctly. I do want to search both domains content from any domain (be it kb.example.com or forum.example.com).

And we restrict the access to the content not by domain, but by roles (using the Content Access module). So that we have roleA, roleB and roleC and want these roles to get the search results with only the content items they have access to.

#7

duozersk - September 9, 2009 - 13:51

Any help with this one, please? Can someone point me to the right direction if I don't get it right?...

#8

duozersk - November 10, 2009 - 00:08
Version:6.x-1.0-rc2» 6.x-1.0-rc3
Priority:normal» critical

Up... still can't figure it out.

#9

Scott Reynolds - November 10, 2009 - 00:11
Priority:critical» normal

...not critical.

#10

duozersk - November 12, 2009 - 12:32

Yes, it is not - the search still works :) People tend to raise the priority if the issue is not closed in the expected timeframe ;)

Anyway, I figured it out by writing the custom module based of the apachesolr_nodeaccess. Removed some stuff from it and slightly modified the queries (basically, they were around the multisite that is not critical for me as I don't use it). Not sure what was causing the issue.

Attaching it here for reference.

AttachmentSize
acronissolr_nodeaccess.zip 1.99 KB

#11

Jody Lynn - November 23, 2009 - 21:08
Status:active» needs review

I had the same problem.

Apachesolr_node_access makes the assumption that if the anonymous user can access a node that there is no node access to worry about. This is not true when using domain module because the anonymous user can access content on one domain but not on another.

This patch does the same thing as the .zip module above, removing that flawed assumption.

AttachmentSize
apachesolr-556426.patch 1.84 KB

#12

robertDouglass - November 25, 2009 - 13:52
Version:6.x-1.0-rc3» 6.x-2.x-dev

A new version to test. I'm testing against 6.2 but it should apply to 6.1 as well.

AttachmentSize
node_access.patch 4.15 KB

#13

robertDouglass - November 25, 2009 - 13:53

Same patch with a corrected code comment.

AttachmentSize
node_access.patch 4.14 KB

#14

agentrickard - November 25, 2009 - 15:01

Note: The domain_all grant is put there by Domain Access for just the situation you describe. DA has a setting -- read the documentation -- for "Search content on active domain" or "search content from all domains." It also allows path-based registration of URLs on which you want to disable DA.

In those cases, we pass the 'domain_all' grant, which effectively removes DA from the node access query. However, I do not know if Domain Access Advanced (a separate module that I do not maintain), actually respects this setting.

See domain_grant_all() and domain_node_grants() in domain.module.

SOLR module + DA respects this setting properly. I do not know how the patch would affect that.

#15

robertDouglass - November 25, 2009 - 15:25

Yeah. It's complicated. Any direct testing is greatly appreciated. Thanks!

#16

agentrickard - November 25, 2009 - 17:19

The hard part is using multiple node access modules, which doesn't really work in Drupal anyway. And Domain Access Advanced, technically, is not a node access module. It replaces the node access elements with its own db_rewrite_sql.

#17

pwolanin - November 29, 2009 - 14:25

I think in some cases, people will have to write their own versions of the nodeaccess module - it's only ~100 lines...

#18

agentrickard - November 30, 2009 - 16:08

I'm basically with Peter on this. If ApacheSolr can support one node access module at a time (which is basically all core can do), then it's ok, and anything else requires custom code.

#19

robertDouglass - November 30, 2009 - 16:16

Good, we agree on that (#17 & #18)

Hopefully we can get some review of #13 http://drupal.org/node/556426#comment-2304968

#20

agentrickard - November 30, 2009 - 17:05

What are we actually testing in #13? I never had a DA+Solr problem. The issue may be DA Advanced and its use of hook_db_rewrite_sql().

#21

robertDouglass - November 30, 2009 - 18:13

testing #13 for general compatibility with node_access. If you could test it with DA that'd be great. I tested it with workflow_access. The current implementation doesn't work with workflow access. I think this is more robust.

#22

pwolanin - December 1, 2009 - 14:17
Status:needs review» needs work

The main reason I had the anonymous check there and did not bypass all checkes for adminiuster nodes was to maintain the viability of doing multi-site searches. The proposed patch destroys that.

#23

duozersk - December 1, 2009 - 19:34

Ok, since I started it all - here is my take on this.

I'm using the two node access modules as I need the Content Access to control the actual access and the Domain Access to have two different domains. Basically, all the content should be searchable from any domain. I'm using the Domain Access Advanced as it is recommended not to have 2 modules that are using the node_access table.

I will refer to the situation before the patch. The apachesolr_nodeaccess handles the things quite good and is able to handle the multi-site setups using the node_access('view', $node, drupal_anonymous_user()) check. This check runs fine and returns the correct information. I removed it only for the sake of simplicity as I didn't need the multi-site options.

But when the above check returns FALSE - there is a query to a node_access table:

db_query('SELECT * FROM {node_access} WHERE (nid = 0 OR nid = %d) AND grant_view = 1', $node->nid)

which is when parsed and added as fields to the Apache SOLR index.

Even when using the Domain Access Advanced module there is one record in the node_access table:

+-----+-----+------------+------------+--------------+--------------+
| nid | gid | realm      | grant_view | grant_update | grant_delete |
+-----+-----+------------+------------+--------------+--------------+
|   0 |   0 | domain_all |          1 |            0 |            0 |
+-----+-----+------------+------------+--------------+--------------+

and if we apply the above SQL query to this row - we will get the nodeaccess_sitehash_domain_all field added to the Apache SOLR index for all nodes that return FALSE on the node_access('view', $node, drupal_anonymous_user()) check. For me this is where it all fails (though not confirmed). So that when the subquery is being added it effectively adds the OR condition against this nodeaccess_sitehash_domain_all field which returns TRUE and all the nodes are exposed to every user.

Probably this is not 100% correct. But from looking into how the Lucene API module handles it - it doesn't query the node_access table for the rows that contain nid=0 and it works just fine. The interesting thing is that Chris wrote in comments for his Lucene API node access implementation that he got the idea on how to make it from this project.

Hope this helps.

Thank you all for you hard work on this Apache SOLR initiative, really saves time and money.

 
 

Drupal is a registered trademark of Dries Buytaert.