Maybe I'm wrong, but it doesn't look like Solr is respecting the permissions of CCK fields as created by the Content Permissions module. Am I wrong? Is this planned? Is it impossible? My brief tests seemed to show that Solr was showing results to anonymous users that included a field that anonymous users weren't supposed to have access to view.

Thanks much,
-Joseph

Comments

damien tournoud’s picture

That would require quite a bit of work. Moreover we cannot do that without giving ApacheSolr intimate knowledge of the Content permissions module, and I'm not sure it would be very wise to do so.

janusman’s picture

Does the same thing happens with core search?

JacobSingh’s picture

This is pretty impossible, node level permissions are enough of a PITA. Perhaps if a field has restricted permissions it should be omitted from the index by default, but role based perms is likely to never happen.

pwolanin’s picture

In general content access module is something I would avoid. however, if you indexing cron runs are only triggered by the anonymous user, then the indexed content should only be what's visiable to anonymous users. Did you trigger cron manaually?

anarchivist’s picture

Yeah, Solr isn't designed to support document or field level security in my understanding.

@jtbayly - One possible fix would be to add a custom module that post processes results based on the user's role.

pwolanin’s picture

Status: Active » Closed (won't fix)
robertdouglass’s picture

So: The solution to this is to have a second search approach that uses node_load instead of relying on the stuff Solr sends back. Then we could really lock everything down. Admins would have to know what kind of search they're dealing with and choose appropriately.

heacu’s picture

subscribing

aufumy’s picture

I will have to work on this sometime, so I am starting by creating project Apache Solr Field Access

http://drupal.org/project/apachesolr_fieldaccess

nally’s picture

I see the issue here, that anonymous may find out that a particular node is found via a search on "foo", where "foo" is in a hidden field, i.e. they never see 'foo' in the detail page for that node because Drupal prevents it from showing that field. Nevertheless, the user can see indirect information about the node based on the search result.

Rather than solving the general problem, with arbitrary numbers of roles and permission sets... just assuming for a moment that there are only two levels of permissions "anon" and "admin"...

... can you guys imagine how one might get this right by building two indexes? i.e. the "anon" index is built without using the hidden fields and the "admin" index IS built using the hidden fields?

Asked another way: can ApacheSolr be configured to build two indexes from the same website (set of nodes) and then provide search services for each permission set?

Scott Reynolds’s picture

Status: Closed (won't fix) » Active

hmm I think another solution would be when building the query fields and boost functions into the search string, only use query fields the current user can see?

I think that makes a lot of sense. Doing this will prevent matching a hidden field. Im going to reopen and risk reprisal :-D.

pwolanin’s picture

@Scott - the issue is that we render the full node body and that's almost always one of the searchable fields

A possible approach would be to exclude any fields with permission on them when you build the node body, and only include them as separate fields in the index (or even 1 separate field?)

Doable, perhaps, but I wouldn't rate it a priority.

Scott Reynolds’s picture

Status: Active » Postponed

Ah your right. The main body of 'searchable' content is the node body built at index time. That might be a hairy mess to unwind.

You could do something extreme and index each field separately. Ill mark as postponed as is a big thing for something little.

janusman’s picture

re #12: @pwolanin: But, isn't the full node rendered for indexing according to what an anonymous user sees? If that's the way core behaves I don't see a problem having the same thing.

If I'm correct then also agree with #11: @Scott's using field/boost functions to fields viewable by current user.

robertdouglass’s picture

Peter, what do you think - is this a case where eDismax is going to help us? Should we start looking forward to indexing all fields separately?

jpmckinney’s picture

Status: Postponed » Active
nally’s picture

One "work around" that I'd love to get some opinions on is this:

Would building a separate index for each role and then guiding each user to their specific index when they search be a good way to achieve what we want?

jtbayly’s picture

I'm not sure that would work, nally, since users can be assigned multiple roles. Maybe there is a way to work around that problem, but it isn't obvious to me.

pwolanin’s picture

@robertDouglass - edismax doesn't make any different in terms of this compared to dismax.

The only reasonable progress I can thin k of for 6.x-2.x is to make sure that all content is indexed with the node as viewed by an anonymous users - I think this will at least hide restricted fields from the search results, and then someone else can add one field-level indexing and searching.

robertdouglass’s picture

After thinking about this a bit I think it is in our advantage to index all CCK fields separately and specify the query fields at search time. Then, when content permissions is in use, we can simply check at search time which fields the searching user has permission to view. I have to investigate this more, but it's actually a pretty common use case, and the content permissions module is fairly widely used, so there is a strong argument for solving this problem.

pwolanin’s picture

How much breakage are we willing to introduce into 6.x-2.x?

If we are going this route, we clearly need to index comments separately as a more urgent problem.

janusman’s picture

I think we're opening up a box full o'pandora:

Spellchecking:
- Would saying "did you mean: XXXX" be bad when XXX only exists in a field you currently cannot see?
- would have to "absorb" content from many different fields. (Which ones depends on the above).

Also, what would the snippet include? Public fields only? Or could we store separate snippets per role? (If so, ditto for spellchecking?)

I bet there are some performance/scalability issues here on the Solr end. Maybe.. querying many fields == worse response times? How about index size?

Yikes =)

pwolanin’s picture

Other than the comment problem, I really don't feel like we should try to solve permissions beyond node access. That's a perfect role for a another contrib module to solve.

nally’s picture

What do you mean by "index them separately" ?

I'm curious because I need to build this, and I'd love to be in step with what people think ought to be done here.

pwolanin’s picture

Each CCK field would need to be a different field in the Solr document - one of the dynamic fields.

nally’s picture

Janusman wrote:

Start-Quote
I think we're opening up a box full o'pandora:

Spellchecking:
- Would saying "did you mean: XXXX" be bad when XXX only exists in a field you currently cannot see?
- would have to "absorb" content from many different fields. (Which ones depends on the above).

Also, what would the snippet include? Public fields only? Or could we store separate snippets per role? (If so, ditto for spellchecking?)
End-Quote

One possibility that I'd thought of was to have a separate search index, one for each kind of permission-set. (At first, I'd been thinking of using "role" for such a permission set, but it was pointed out that roles accumulate permissions, so that would be problematic.)

Imagine instead another designation... with one value per user. It might be called "search index". It could be a hidden field, that the admin would be able to see and set.

Then, when content is indexed, it is modified for each index... one index for each "permission set".

Some questions regarding Janusman's objections: would an index for each permission set solve those? i.e. is Did you Mean done on an index by index basis? Are snippets constructed only from one index?

The UI for field level permissions is pretty good already. Perhaps what's being discussed here is a contrib module that associates a "permission set" with a single role, and then the previously mentioned 'hidden field' would be which actually Drupal role to use for the complete set of permissions associated with a given index. Each user would get only one permission set, not accumulated ones.

I bet this would solve most use cases, as most systems likely only have a few permission sets, and clearly defined indexes on a user by user basis.

pwolanin’s picture

I think there is some overlap with #551278: CCK mappings don't respect shared fields and #751004: Flag node for re-indexing if the 'exclude' setting is changed for one of its fields since at the least it's important that it be possible to exclude sensitive fields from the index.

Scott Reynolds’s picture

One possibility that I'd thought of was to have a separate search index, one for each kind of permission-set.

Ya this makes my skin crawl. You would have to index the same entity in N times where N is the number of 'permission sets' (for which I am assuming you mean roles).

Each user would get only one permission set, not accumulated ones.

But the Administrator would have the authenticated role. By what metric would you choose the permission set? User 1 generally doesn't have any roles because its not needed for that user. Therefore, would user 1 not be able to use the special index?

nally’s picture

Thanks for the thoughtful reply. I'm soaking up everything said in this thread with great interest.

Re admin, each user has a setting that admin can play with to change which "permission set" (i.e. index) a given user uses... so, admin can set their index to be whichever they like.

Do you know whether DYM is done on an index by index basis?

nally’s picture

I need this to work, so I'm going to be starting to write something I would really love some guidance, if there are opinions on how I should proceed.

This thread has discussed two options:

a) separate the CCK fields out and have them indexed as separate fields. then choose the 'field set' to search at search time based on accumulated permissions as recorded in the field level permission information. Pro: it works with the concept of "role", which is fundamental to Drupal and is also how the field level permissions work. Con: it's not yet clear whether DYM would work with this technique. (Is DYM done after the search results are assembled?)

b) build an index for each "permission set", and create a setting for every individual user to designate which index they will use. Only admin would get to set that field. (There probably a little logic to figure out that would enable the system to set a default permission set based on role assignments, or something like that.)

Is it that the community wants one contrib module that attempts to go down each road? Would it even be possible to do a) without major changes to the main Solr integration modules? (if that's the case, then I need to start down road b) for that reason alone)

Does anyone have any warnings or advice before I dig in? (I've got a serious itch to scratch on this.)

pwolanin’s picture

Well, my suggestion is that we should start by rending the node according to what an anonymous viewer can see. this is in line with what the nodeaccess module does alreadyl.

This would go in the apachesolr module, and would at least minimize information disclosure problems. It woudl also be a basis where you could add more fields to match (or show in the snippet) on top in your own sub-module.

janusman’s picture

I agree with #31. This way the spellchecker would have the least-sensitive (although maybe not complete) set of tokens.

nally’s picture

With regard to the spell-checker... is the spell checker done on an "index-wide" basis? Or is it done across all indexes?

I ask, only because it appears that I'm definitely going to need more than one index.

(It also appears that since the UI is pretty much set up to run single indexes, I may need to go with a multi-core / multi-site set up if I want to admin more than one index through the UI , but would still like to set which index is being used on a user by user basis. perhaps each multi-core/multi-site would deduce which index gets used based on looking for roles starting from the least permissive then work its way up to the most permissive.)

nally’s picture

And with respect to multi-core, multi-site solutions... can you envision a set of sites that share all tables except the ApacheSolr settings? Would that be a good way to get one UI per index?

robertdouglass’s picture

@nally - I *have* to work on this starting in August. I'll join your effort then. It's good that you're getting a head start.

pwolanin’s picture

Version: 6.x-2.x-dev » 6.x-1.x-dev
Status: Active » Needs review
StatusFileSize
new897 bytes

as above I really think most of this needs to be a contrib module for 6.x

I suggest something like this patch as all that we should do for the main project.

nally’s picture

thanks @pwolanin i'll give this a test.

Having had yet another think on the issue, I'm running up against how much effort is required to maintain more than one index on a "single" website.

Further, I've started to peel back the possibilities with the DYM association, and I've got an argument put together that suggests that some or many clients might not care about this.

It goes something like this:

To create a first concrete example (there will be a second and third), let's take three nodes, with fields Title and Hidden-Field

Title: First Node
Hidden-Field: This is a secret.

Title: Second Node
Hidden-Field: This is a secrett.

Title: Third Node
Hidden-Field: This is also a secrett.

Let's say the anonymous user searches for the word secret.

In this (highly manufactured) case, the user's search would not return any results because there are no nodes that they have access to with a field they can search against that has the word 'secret' in it. But if DYM is enabled, the system might say "Did you mean 'secrett' ?" because there are actually more incorrect spellings of the word secret with an extra t. The user ends up learning that there are entries with the word spelled secrett, and this would divulge more than they were entitled to know.

That said, the real way around such a scenario is for the admins to get better at spelling, so let's investigate the another case, where it's spelled properly in the data, but the user spells it incorrectly.

Title: Alpha Node
Hidden-Field: This is top secret.

Title: Beta Node
Hidden-Field: This is top secret.

Title: Gamma Node
Hidden-Field: This is also top secret.

In this case, if the user searches for 'secrett', the system will say "Did you mean 'secret'?" thereby divulging that the system has indexed data with the word 'secret' in it, even though the user doesn't have access to any fields with the word secret.

That might be an issue for words where many were held in hidden fields, with NO values for the user to find in the accessible fields.

However, by adding just one node to the above case, the system appears to divulge much less about those hidden fields.

Title: Alpha Node
Hidden-Field: This is top secret.

Title: Beta Node
Hidden-Field: This is top secret.

Title: Gamma Node
Hidden-Field: This is also top secret.

Title: This is not secret.
Hidden-Field: This is not a secret.

In this last case, if the anonymous user searches for 'secrett', and the system responds with "Did you mean 'secret'?" the user is pointed in the direction of a search for 'secret' which would return one node with a title "This is not a secret", and the user hasn't really learned anything.

So... for a case where there is a lot of data in the public containing the words that we worry the user will learn about through crafty use of DYM, this risk of giving away information is reduced when those words appear many times in accessible fields. There's also the possibility that the admin configured the spellings.txt file to have words in it that don't appear in the data, and that could explain any empty results that the user sees.

For the particular project I'm working on, it might be the case that discovering the existence of words in fields they don't have access too, isn't a huge security risk. In such a case, it might turn out that the best solution for this project is to include all fields in a single index and suppress fields the user doesn't have access to on a query by query basis.

(In that regard, I do have a question that's more of an aside: when suppressing fields on a query-by-query basis, is there a significant performance hit? Or is it that there is something like a field bit-mask applied on every query anyway?)

It might be that the above patch should have a configuration parameter wrapped around it, so that the administrator would get to choose whether the system renders nodes as anonymous or admin. For the case where it's rendered as admin, admin's could choose whether to deploy a contrib module that suppressed hidden fields on a query-by-query basis (which works better with roles and field level permissions, I think). It sounds like the ultimate-ultimate-ultimate solution for cases where DYM must be included and absolutely nothing is given away about the index unless you've got permissions would be one where DYM is permissions sensitive or multiple indexes are run.

Can anyone say whether field suppression on a query-by-query basis is a performance hit?

pwolanin’s picture

@nally - please study the indexing process in more details. Your suggestions don't really make sense - we should always render as anonymous since all fields visible are added to the node body in the index and can be searched by everyone.

janusman’s picture

Quick review of patch in #36... looks fine by me. Thinking #751004: Flag node for re-indexing if the 'exclude' setting is changed for one of its fields should also get in ASAP so that fields are handled as expected, then along with this we'll get a sane (although maybe not all-encompassing) solution =)

jpmckinney’s picture

Status: Needs review » Reviewed & tested by the community

Integration with content_permissions should go in a contrib module. It is sufficient for the apachesolr module to render everything as the anonymous user.

pwolanin’s picture

Status: Reviewed & tested by the community » Needs review
StatusFileSize
new832 bytes

Maybe we should actually put this code further up in the code path? That way all modules using the apachesolr framework get the fix.

pwolanin’s picture

Title: Solr should respect field-level permissions » Always index content as anonymous (so Solr can respect field-level permissions)

better title

kaakuu’s picture

Subscribed

pwolanin’s picture

Version: 6.x-1.x-dev » 7.x-1.x-dev
Status: Needs review » Patch (to be ported)

committed this to 6.x-1.x - patch applied with fuzz to 6.x-2.x, so applied there too.

note, in the schema.xml we have:

  <!-- This field is used to build the spellchecker index -->
   <field name="spell" type="textSpell" indexed="true" stored="true" multiValued="true"/>

   <copyField source="title" dest="spell"/>
   <copyField source="body" dest="spell"/>

so only the title and (rendered) node body go into the spell index.

pwolanin’s picture

Status: Patch (to be ported) » Fixed
StatusFileSize
new1.42 KB

Also fixed 7.x (HEAD) with this patch. and 5.x-2.x with the patch in #41.

note function name change in 7.x.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.