Comments

David Lesieur’s picture

A brute force solution that's been mentioned in today's BoF was to have Solr return a list of nids, then query the node table with those, thus giving an opportunity for invoking all hook_db_rewrite_sql() and checking node access. However, this behavior would have to be optional, since there would be a significant performance loss by doing that, and not all installations would need it.

JacobSingh’s picture

Sorry if this question sounds ignorant, but is it possible to provide access controls at the indexing phase? Wherein, the roles allowed to access a node would be provided as params along with the content, so it could later be filtered on search?

This of course, does not allow for other types of node access controls, but perhaps it would be a good start.

robertdouglass’s picture

Role based access control would be easy to implement at indexing time. The question about whether node based access could be implemented in the same way is the real issue here.

I think the solution is something like follows:

1. Ask Solr for a complete result set (ie all the nodes that match).
2. Filter this result set based on a query against node_access
3. build the pager using a simple query, like is already done in the ApacheSolr module: db_query("Select 312") where 312 is the number of nodes left after #2.

It's a bit heavy handed, but it will work, I think.

janusman’s picture

A similar issue is stickiness of nodes. "Sticky" items should bubble to the top of search results (IMO)

Is this a separate item? Or merge into this discussion? (I say merge, it has to do with checking node attributes like permissions, stickiness, and probably others to affect search results).

JacobSingh’s picture

Janusman:

Yes, this is really about result biasing. A very important topic indeed, but unrelated to the permissions discussion.

aaron1234nz’s picture

Hi Robert,

I can't think of a better way of doing this either. I have a couple of projects that need this kind of functionality. One using standard drupal access controls, and one with custom access controls. I'll have a go at crafting up some code, then let you know how I get on.

Aaron

aaron1234nz’s picture

Version: 5.x-1.x-dev » 6.x-1.0-alpha3
Status: Active » Needs review
StatusFileSize
new2.44 KB

I've made a patch that uses Robert's suggestion of doing access control by running the nodes of each search result through the node_access module.
The basic flow is
1. Query solr, asking for only the nid and the type fields
2. Build a thin node object (containing only the nid, type, and status fields)
3. call node_access on each of these objects
4. build a list of accessible nids that should be shown on the current search page and decrease the $total variable by one for every node that is not accessible. There is also some logic to make sure the correct number of results are displayed on each page.
5. Perform the normal solr search
6. By looking up the list from 4. only return the accessible results.

Unfortunately this patch does not correct the numbers for faceting.

aaron1234nz’s picture

Version: 6.x-1.0-alpha3 » 5.x-1.0-alpha3
robertdouglass’s picture

The other approach would be to completely re-implement node access in the Solr index.

Here's the approach:

First, the node_access table, for reference:

mysql> describe node_access;
+--------------+---------------------+------+-----+---------+-------+
| Field        | Type                | Null | Key | Default | Extra |
+--------------+---------------------+------+-----+---------+-------+
| nid          | int(10) unsigned    | NO   | PRI | 0       |       | 
| gid          | int(10) unsigned    | NO   | PRI | 0       |       | 
| realm        | varchar(255)        | NO   | PRI |         |       | 
| grant_view   | tinyint(3) unsigned | NO   |     | 0       |       | 
| grant_update | tinyint(3) unsigned | NO   |     | 0       |       | 
| grant_delete | tinyint(3) unsigned | NO   |     | 0       |       | 
+--------------+---------------------+------+-----+---------+-------+

At index time we'd add multivalue fields. They'd be dynamic fields so that grants could be added. Let's say this is what is in the node_access table:

mysql> select * from node_access;
+-----+-----+--------+------------+--------------+--------------+
| nid | gid | realm  | grant_view | grant_update | grant_delete |
+-----+-----+--------+------------+--------------+--------------+
|   2 |   5 | groups |          1 |            0 |            1 | 
|   2 |   6 | groups |          1 |            1 |            0 | 
+-----+-----+--------+------------+--------------+--------------+

This basically says that for node 2, user 5 is granted viewing and deleting rights by the 'groups' realm. This would translate into the following fields when indexing node #2:
- nodeaccess_groups = {5,6}

Note that the nid is already known by Solr.

Note also that we only have to concern ourselves with the view grant because Solr doesn't afford the opportunity to update or delete nodes.

Then, at search time, we could then programmatically enforce node access. Let's say User 5 is doing a search. Before we send the query to Solr we add these bits:

+nodeaccess_groups=5

robertdouglass’s picture

Priority: Normal » Critical
aaron1234nz’s picture

This is a good suggestion, It would certainly make searching faster as only one query need be performed. It would also get around the issue with the facet results returning incorrect numbers (as per my patch above). The dynamic field idea is really clever.

However I have a couple of concerns over this approach:
* Every time a the permissions are changed for a node the appropriate index entries would need to be rebuilt (I don't know if there is a way of detecting this - I'm not too familiar with the node access system)
* Permissions would not be affected until the next cron run (or runs depending the the number of nodes affected)
* when node_access_rebuild is called, the entire index would have to be completely rebuilt
* I'm unsure if this approach is compatible with the dismax query handler (it might be if you use fq)

Thoughts??

robertdouglass’s picture

@aaron1234nz: Very good points, all of them. I think it means we need to support an instant indexing mechanism for when node access is in use to avoid waiting for cron. The other side effects you mention are all completely true and need to be documented so that people who want to use Solr in this way know what they're dealing with. Can you elaborate on the last point some?

I'm unsure if this approach is compatible with the dismax query handler (it might be if you use fq)

JacobSingh’s picture

I realize this is a separate issue, but it is something I've been thinking about on node deletes and it applies here.

As possible we should not rely on "on demand" index updates. Here is the scenario:

I unpublish a node or change access
The Solr Server has a problem handling my request. Maybe it has a locked file, maybe it is down, maybe there is a network issue. Perhaps another module uses nodeapi and redirects or craps something out... I know it's not going to happen 95% of the time, but it's quite possible.

How can we then update the solr index to reflect this change? There is no way to know that the operation failed (without looking in watchdog, and even then the error is totally ambiguous to what the user was doing). Next time we index the content, it won't remove it.

I can only see a couple ways to solve this:

(along with all of these, we explicitly give the user an error message so they know)

1). We actually block the operation with a confirmation dialog if Solr rejects the request
- I realize this will require a menu or init hack and some hacking to figure out if solr is supposed to be acting to preempt node.module, and probably impossible.
2). We store a list of pending index updates which failed, and give the user a place to process the list in a batch mode.
3). We provide a "node sync" operation (which will be expensive and must be run in batches). This would just be removing variable entries for last node updated and re-indexing from scratch, but not using cron, providing an interface to do it if it has to be run in emergency (this would also be useful in the case of schema change / reset).

Am I just being paranoid? I know it seems like work for something which is "not supposed to happen" however I feel that the consequences of unavailable causing unpublished or access restricted content to the public and NO way to fix it or even know about it is pretty bad.

robertdouglass’s picture

@JacobSingh: good points. Off topic to the thread. Please open a new one or take the discussion to the thread that currently discusses indexing.

aaron1234nz’s picture

Sorry, I should have been more clear. If the request handler ('qt') is changed from "standard" to something handled by the dismax handler (eg. partitioned) then you cannot add "nodeaccess_group:5" to the query (q) as it interprets this literally. I think you would need to do the nodeaccess filtering with a filterquery (fq)

$params = array(
'qt' => 'partitioned',
'fq' => 'nodeaccess_groups:5',
'fl' => '*,score',
'rows' => $rows,
);

robertdouglass’s picture

@aaron1234nz: ah, now I'm tracking what you're saying. That shouldn't be an insurmountable obstacle to implementation, though, should it?

aaron1234nz’s picture

Hi Robert,

I'm kinda attached to method 1 (filtering the results using node_access) even though it's a bit inefficient. I've had some thoughts about how to correct the facating numbers. I might craft up some code over the next couple of days and try it out. Basically what I plan to do is to retrieve the tid field in the initial query, and then calculate an offset for the number of terms that belong to nodes where access is denied, then subtract this offset from the returned number at rending time.

Three reasons for my liking this method:
1. I am currently working on a website that does not use the standard access control because the permissions structure is too complex, and thus I cant store any of this information in the index. I think this is the only way forward for me.
2. After the initial query, I think the the solr cache might be primed and make the second query reasonably fast. I can't be sure till I've done some testing.
3. It gets around the necessity of needed to repeatedly reindex. Depending on the site, it is conceivable that the indexing will be a bigger bottleneck than searching.

Aaron

robertdouglass’s picture

Aaron, glad you're going to take this approach and I look forward to the code. This is a tough nut to crack and we will at least have to compare the methods. Cheers!

How much in your plan has been changed since the patch in #7?

JacobSingh’s picture

Hi Aaron,

How are you going to fix facet counts for all types of facets? What you described might work for terms, but then this logic (and pretty massive overhead) has to go into every facet used on the site, right?

Not saying it won't work, I'm just not understanding probably.

Best,
Jacob

robertdouglass’s picture

The node_access problem really really sucks. Really.

Node access is a two step problem and my proposed solution only addresses one part. The first part is the module based hook_access invocation:

  // Can't use node_invoke(), because the access hook takes the $op parameter
  // before the $node parameter.
  $module = node_get_types('module', $node);
  if ($module == 'node') {
    $module = 'node_content'; // Avoid function name collisions.
  }
  $access = module_invoke($module, 'access', $op, $node, $account);
  if (!is_null($access)) {
    return $access;
  }

Only if all the hook_access calls return NULL for a node will the node_access SQL table be queried, and that's the part I was designing to solve.

I don't see any way we can guarantee node access rights and maintain pager and facet counts. We're in "good" company, though. Core's search module doesn't do this either. It's node access is just simply broken.

aaron1234nz’s picture

StatusFileSize
new4.89 KB

I agree. This problem sucks, mind you, I don't know of many other search engines that do access control. I've just been looking at nutch and its got the same problem. People have suggested taking the approach you suggested in #9.

I have a patch that now takes care of correcting the faceting numbers. I've also put some microtime() statements in the code so we can measure the performance hit. With my test database of 50 nodes, I get a performance hit ranging from -2% and 140%, but mostly ~20%

The approach is basically the same as in #7, except I return all fields which facets are requested for, and process them accordingly with lots of foreach loops and array comparison.

I expect there is a bit of cleanup/optimization that can be done (by someone who is more into algorithms than I), but the code does work.
Using some sort of short-term cache might also be useful in upping the performance.

janusman’s picture

Since this is a brainstorm... I'll go ahead with some musings =)

How about precalculating node_access('view',node) and add values to each node in Solr (integer field, multivalued)?

Precalculate:

  • per-user (exact, but takes more time, more index space in solr). We would add the uids that have access to the node being added to Solr, and at query time add the current uid to the query (access_uid:XX).
  • per-role (faster, would be all that's needed for simple sites). As above, but with role ids.

It might turn out that, during the per-user precalc (testing access control for all nodes, for all users), that some (roles?) or all users share the same access. Or perhaps looking if hook_access functions are present in installed modules.

This means we could have some algorithm simplify (reduce?) to per-role, or to "none required"? The admin could have the final decision as to what access control method to enforce-- perhaps choose the less exact (but quicker?) access control. Maybe since we are not showing full nodes in search results, it might be *ok* for less strict access control...

robertdouglass’s picture

@janusman: I think we'd run into indexing problems if we had to run node_access $user_count number of times. It's scary to me. node_access is also user specific, so there is no way we could accurately do a role based call on it.

puregin’s picture

Could we not enable document level access control in Solr (in Java) using the SolrRequestHandler framework? Our handler could call back to Drupal to do actual checking of permissions.

puregin’s picture

This point of this being that there could be many different policies implemented - user based, group based, ACL, role, etc, each of which could be implemented on the Drupal side.

aaron1234nz’s picture

@puregin

Interesting thought. I'd not even contemplated that request handlers could be used for this purpose until now.

I think this is an avenue that is worth exploring further. I don't know the internals of solr so I'm not sure how this would work.
Here's a few questions that come to mind though:
1. What would be the benefit of determining access on the solr side vs the drupal side?
2. Would the handler need to hit the drupal database to determine access, if so what would the likely performance hit be?
3. Can a new handler be implemented without recompiling solr?
4. Could handlers be chained together? eg, can the dismax and standard handlers still be used?

JacobSingh’s picture

Responses to questions above:

1. What would be the benefit of determining access on the solr side vs the drupal side?

The implementation would not have to reside in the index, but rather could be coded in the response handler. This would allow the index to remain "pure" and therefor make changing perms easier and would allow for more complex rules

2. Would the handler need to hit the drupal database to determine access, if so what would the likely performance hit be?

Yes. very high IMO. I don't see how we pull this off without basically getting a list of all nodes the user can access.

3. Can a new handler be implemented without recompiling solr?

yes, it would require adding a new class and re-bundling, and then modifying solrconfig.xml to use the handler.

4. Could handlers be chained together? eg, can the dismax and standard handlers still be used?
Not 100% sure, I think so.

I have a couple objections to doing on the Java side:

1. It would be code we would have to maintain and this is a community of PHP developers
2. How will we find out the access rights of the user accessing it? We would then need to have Solr callback to drupal to get this via XML-RPC or something... this sounds brittle.
3. It makes installation a lot more complicated.

My vote is that we ditch this idea here. If someone feels really compelled to go ahead anyway, I'll review it, but I have my doubts of this working in the long term.

JacobSingh’s picture

I've looked at the patch in #21, and I don't think this is very practical. Here are my reasons:

1. If the index is actually 100,000 nodes, and the SolrServer has any latency with the site, getting every record to test for facets will make the thing crawl
2. If there are too many facets, or too many facet values (I work with one index containing 4000 terms), it will be murder on the server. There was a patch earlier (don't recall the number, to change the facet count from -1 (all) to 200 and it gave a massive performance boost. If Solr has to compute counts for 4000 facet values and return you a dataset with 250,000 nodes (I have such a client), this is going to be a massive hit. I don't think that even extreme examples like that are necessary.

My KISS proposal

I don't think we need to support every complex access configuration out there. As Robert mentioned, Drupal core search is totally broken in this regard and our 1.0 goal is just to replicate Drupal Search using Solr IIRC. So to that end, I propose we start with role based authentication. Just add an rid_access field to the schema, and then query against it if node_access is enabled. This will be fast, easy to code, easy to test and will cover 95% of use cases.

Is everyone cool with that?

Best,
Jacob

janusman’s picture

I agree.

Perhaps alternative aproaches might help, like offerring an option to filter results post-search, checking permissions for just those N nodes shown in a search results page. Again, simple, doable, and helpful. Although perhaps cheating/lying/scamming a bit =)

aufumy’s picture

If this approach solves this issue: #255796: Solr + organic groups (og), then I am all for the KISS approach.

drunken monkey’s picture

Perhaps alternative aproaches might help, like offerring an option to filter results post-search, checking permissions for just those N nodes shown in a search results page.

But then, if permissions were denied for some nodes, there would be different numbers of results per page, possibly even some pages with no results at all. Plus the facet counts would be wrong, too.

Personally I'd go with the simple approach, it's still way better than the status quo.

moshe weitzman’s picture

Interesting discussion. Glad I decided to stroll in here.

The KISS suggestion in #28 does not achieve the goal of replicating core search. Core search supports all sorts of interesting node acces schemes such as 'by organic group' . By role is perhaps not even the most popular method of controlling view access. It is OK to go ahead with mere role based support, but lets not say that we are core search.

I'm pretty intrigued by #9. I think the obstacles there are surmountable (i admit i know little about solr), and it would be a perfect match with core node access.

aaron1234nz’s picture

Perhaps alternative aproaches might help, like offerring an option to filter results post-search, checking permissions for just those N nodes shown in a search results page.

This is similar to what I proposed in #7, however the issue there was that the facets reported incorrectly.

I agree that my patch in #21 would be slow but it does cover all the bases. I'm voting for #28

JacobSingh’s picture

Okay,

Here is the patch to implement #9. I had a really bad time getting simpletest 6.2.5 to behave and so I hope everything still works as before the test. This patch includes:

1. The apachesolr_nodeaccess module.

This module implements the apachesolr_update_index hook and the apachesolr_modify_query hook. It takes the grants offered in the node_access table and puts them into a field in Solr. This new field is a dynamic field prefixed with nodeaccess_.

2. The new schema.xml which provides the nodeaccess_* field. Remember, you will need to remove your index directory and re-index if you change your schema!

3. A unittest which:

  • Tests creating a node, assignign role and author based perms to it and ensuring that the Document object has the proper grants.
  • Tests that the function to modify the query based on the user's role contains the expected access controls

Please give this a shot! It is a crucial feature, and needs many eyes. If someone has a more complex setup using og or something else, would really appreciate your feedback.

drunken monkey’s picture

The patch didn't attach.

JacobSingh’s picture

StatusFileSize
new9.58 KB

D'oh!

Just saw this, sorry.

pwolanin’s picture

Version: 5.x-1.0-alpha3 » 6.x-1.x-dev
StatusFileSize
new9.06 KB

Since this seems pretty central, why have it under /contrib? Since I don't have CVS access to the project, here's an svn diff (hope this works).

There was some dead code in the patch AFAIKT. Re-rolled here with also some minor code cleanup and putting the rebuild to only occur on submit of the confirm form.

Important note: You need this patch for nodeaccess.module to make the the test work: http://drupal.org/node/323977#comment-1090923

pwolanin’s picture

StatusFileSize
new9.96 KB

Jacob and I discussed the problem today of global grants being changed without this module knowing about it.

Here's a new patch with an implementation of hook_node_access_records() which tries to address this concern. This is not really tested yet.

In general, maybe for 6.x solr should use a custom table to track which nodes need to be re-indexed, rather than running that horrific slow queries in apachesolr_search_search() and ApacheSolrUpdate::getNodesToIndex(). Basically module could use something like the schema and code from tracker2. In this case, the timestamp in that table could be touched without actually changing the {node} table.

robertdouglass’s picture

pwolanin I really like the idea of ditching the slow query to figure out what needs updating. The indexing speed of ApacheSolr is unacceptably slow (and this has nothing to do with the Solr part of it), so that would be another benefit.

aufumy’s picture

* downloaded latest cvs checkout DRUPAL-6--1 of apachesolr module
* patched with http://drupal.org/node/330079
* patched with above patch/or editing schema.xml and creating apachesolr_nodeaccess module
* enabled apachesolr_nodeaccess module

On drupal site with organic groups installed, super admin search saw all results. User that belonged to one but not another organic group showed no search results, even though there were discussion and group results that should have been seen for that user.

JacobSingh’s picture

@aufumy:

can you confirm that simple role and author based node access (per content type) works for you?

This will help us rule out an environment issue. Either way, I'll look into tomorrow.

Best,
Jacob

aufumy’s picture

Hi Jacob

afaics, there is no simple view permission per content type, only create, edit, delete.

So here are my steps to test create permissions:
* create role 'can create' and 'cannot create'
* take out permissions for authenticated user and place in role 'can create'
* leave cannot create as empty
* change user 4 to role 'cannot create'
* login to user 4 and browse to '/node/add/discussion' it shows permission denied.
* double check by changing role for user 4 to 'can create'
* re-login to user 4 and browse to 'node/add/discussion' it shows the create discussion node form.

Let me know if this is helpful.

Thanks
Audrey

aufumy’s picture

Apologies, it appears to be working after all. I thought I had updated schema.xml, but had not. So re-indexed apachesolr, ran cron.php

and using http://localhost:8983 interface results seen for nodeaccess_og_public:0

The first test with admin user versus user of some groups looks good. The user sees discussions that they have access to and sees the group they are not part of, but does not see the discussion of that group.

Thanks
Audrey

JacobSingh’s picture

Status: Needs review » Patch (to be ported)

Committed to DRUPAL-6.

Needs a D5 port. janusman, are you interested?

aufumy’s picture

On cvs, I noticed that it has been committed in the same directory level as contrib, rather than under contrib.

I am curious about the reason for that.

Thanks
Audrey

pwolanin’s picture

@aufumy - Robert expressed a preference for moving all sub-modules to directories at the top level, so this is the first such and the others will likely follow.

Note also, this patch left an outstanding bug that Jacob found - see: http://drupal.org/node/332971

janusman’s picture

@JacobSingh : re: porting to D5, my "job" (I think!) was to check that branches are in sync, not actually backport stuff =)

For now I'm swamped so I'd move for the original patch submitter or another volunteer to actually backport it.

PS: Not that I've had a good chance to actually do said job for a while =)

robertdouglass’s picture

Ok, looking for a volunteer.

robertdouglass’s picture

@pwolanin: Actually I don't remember expressing that preference (wrt top level vs in contrib directory). In any case, I moved the nodeaccess module to the contrib directory for now so that it's all tidy.

pwolanin’s picture

Status: Patch (to be ported) » Closed (fixed)
damien_vancouver’s picture

Version: 6.x-1.x-dev » 5.x-1.x-dev
Assigned: Unassigned » damien_vancouver
Status: Closed (fixed) » Patch (to be ported)

I will volunteer to backport this to Drupal 5, I need it for a production rollout ASAP.

damien_vancouver’s picture

Status: Patch (to be ported) » Needs review
StatusFileSize
new11.65 KB

Here is the Drupal 5 port of the node_access patch from #38. It's working for me to protect private Organic Group content.

No real surprises backporting it, I had to update a few Drupal API calls where they differed between 5 and 6. The simpletest I had more trouble with... I did my best getting the simpletest to work, but it is still failing a couple of the tests, I'm not really sure why.

The following testing should work:. I was testing with Organic Groups DRUPAL-5--3:
1. apply patch to 5.x-1.x-dev apachesolr module.
2. Turn on Apache Solr Node Access in admin/build/modules. Also ensure that og and og access control are turned on and configured.
3. Create two private groups, Group1 and Group2.
4. Create two users, User1 and User2. Place User1 in Group1, and User2 in Group 2.
5. Create three posts, one visible in Group1, one visible in Group2, and one that is public and not in a group. Use three different words in each of the three posts (like apple, grape, orange or something easy to search for).
6. Hit your cron.php to get the documents in the solr index, then revisit admin/settings/search, ensure site is reindexed.
7. login as User1. You should be able to search for and find the post in your Group1, and the public post only when searching for the keywords.
8. Login as User2. The same behavior should occur, but with the opposite group post.
9. Login as admin again. You should be able to search for and find all three posts.

There's only one spot I'm not sure if I did a bad thing. In hook_node_access_records:

function apachesolr_nodeaccess_node_access_records($node) {
  // node_access_needs_rebuild() will usually be TRUE during a
  // full rebuild.
  if (empty($node->apachesolr_nodeaccess_ignore) && !node_access_needs_rebuild()) {
    // Only one node is being changed - mark for re-indexing.
    apachesolr_mark_node($node->nid);
  }
}

I removed the !node_access_needs_rebuild() call, as this function doesn't exist in Drupal 5.
My question is: How can we detect the index rebuild in Drupal 5? Do we have to? I'm not really clear why this is in the code but I'm sure there must be a good reason :-P If someone could provide a hint that would be good.

Otherwise, if someone could try it on a Drupal 5 site that would also be good! Tomorrow I'll promote it off my development machine onto a near-production staging site with lots of content and groups, and test it some more.

robertdouglass’s picture

Thanks for this Damien, I'll get the word out to others who can look.

damien_vancouver’s picture

StatusFileSize
new12.01 KB

OK I tested this out on a real site tonight with lots of content, and most importantly a bunch of attachments indexed in ApacheSolr.

I quickly discovered that attachment content would no longer return in searches (except for group admins). This is because apachesolr_attachments fires a different apachesolr_update_index hook, which also includes a reference to the $file object. This hook is named hook_apachesolr_attachments_update_index (as opposed to the main module's hook_apachesolr_update_index). So, the extra nodeaccess fields weren't being indexed on the attachment documents in solr... making them invisible to searches.

Adding a wrapper function that just calls the existing apachesolr_update_index hook solves the problem. Indexed data inside attachments then properly returns as one might expect, according to node access permissions.

This problem also exists in the D6 version, but is solved with the exact same addition - to the top of apachesolr_nodeaccess.module:


/**
 * Implementation of apachesolr_attachments_update_index
 */
function apachesolr_nodeaccess_apachesolr_attachments_update_index(&$document, $node, &$file) {
  // A wrapper function for the regular hook_apachesolr_update_index() function:
  return apachesolr_nodeaccess_apachesolr_update_index($document,$node);
}

The attached patch is the same as #52's but with this fix. The test plan from #52 should extend to attachment contents.

Should I make an issue /patch for D6, or is it easier for someone to just paste in the code above?

JacobSingh’s picture

The attachments module is broken in other ways AFAIK.

I recommend you put this in a separate issue which is specifically about the attachments module as it is sorta being maintained separately.

You could try to follow up with Frank Febraro about it as well.

damien_vancouver’s picture

StatusFileSize
new56.32 KB

OK I'll file it as a seperate issue against the D6 branch.

Broken you say? apachesolr_attachments is working awesome for me, instant search results from the bodies of thousands of existing attachments on a client site... it totally totally rocks my world (and the customer's!) - see attached screenshot.

pwolanin’s picture

@damien - it works if you are smart enough to configure it correctly with helper apps, but many users were not that smart and were flooding their Solr server with endless delete requests since they could never get the attachments to index right. Basically that module needs more error checking before it starts to operate - and so is not really suitable for general use at the moment.

In any case, we want to upgrade it to use Tika, to avoid the need for helper apps.

damien_vancouver’s picture

@pwolanin, OK that makes sense... I got my apachesolr attachments stuff working by adding lots of extra debugging and a bit of help from the maintainers :) For apachesolr_attachments, I was going to propose a patch actually that modifies the helper apps to be in better default locations (ie. where the proper deb or rpm packages put them in popular modern linux), and provides some extra instructions on that page (including the install commands and correct package names for apt-get in debian/ubuntu or yum in redhat/centos/fc).

I think with just some simple instructions on that module config page people could get it to work. I found the hardest thing getting this module production ready was the lack of good docs from a how-to-install it perspective. This is no surprise as it's under a lot of early and active development, but now that the 5.x branch is pretty much falling way behind, it would be nice to stabilize it for people just like me that are running production clients on drupal 5 still for stability and want a stable enough to use version that isn't going to change.

I took some notes during my own installation odyssey and would really be into helping revamp the docs (at least for the 5.x version which is likely to pretty much stay like it is from what i can tell). I think this node_access fix for 5, with the attachments working, those changes to make that part more intuitive, and then better install docs overall (including how to set up solr in tomcat, again with simple step by step commands how to do it in debian/ubuntu and how to do it in redhat/centos/fc) would be all that's needed to call the 5.x branch a release. Meanwhile new features and stuff will continue to go into the 6.x branch and other than security fixes and really missing features like this node_access thing, 5.x-1.0 is good enough and done IMO. Anyway I've proposed some of this to Robert in PMs and plan to get around to actually making the patches AFTER i'm fully deployed and it's working. We just did another big test deployment and I'm at T-2 days or something so it won't be much longer :)

While I've already hijacked this thread a bit to talk about general module stuff and you guys are reading, I've also found a couple other thigns I may as well post here as I've got it going for this deployment:

- squirrelly stuff with having multiple sites in the Solr index.. that all works fine thanks to the multisite changes, but when you reindex sites it seems to kill (or somehow damage) the whole index. If we could somehow tell solr to just delete * for site=www.example.com instead of using its overall reset that would be handy. Right now, if you are using multi-site, and reset the index, does it clobber all your sites indexes?? And I realize a workaround for this could be multiple cores in Solr, looks easy as pie (I'm going to do it when I deploy actually)... but this should be clearly and obviously documented. Again, I'm willing to help.

- (this one really wasted a few hours last night) - if cron.php is being called using http://127.0.0.1/cron.php, then that is the site that goes in the index, and you won't find any documents hitting the site on some other IP or DNS name.... Maybe the "site" used for solr indexing and searching should be based on something fairly constant in the variables table, rather than what URL is used to access the site at the time. Is this even worth a patch? I only think so because it wasted a few hours of my life... but maybe not. At least not before fixing other stuff.

But yeah I'm really interested in the small (as I can see) effort to get the 5.x branch just that much further into a stable release state, so that everyone can enjoy it without having to pick the code apart to get it going! Then the parts that turn out particularly useful for the 6.x branch too (the doc improvements) we can port up to it as you guys want to use it....

So I'm going to get these patches organized as described here, but I'll of course need help testing :)

somebodysysop’s picture

Is this patch now implemented in 6.x? If not, is there a version of this patch available for apache_solr 6.x?

pwolanin’s picture

apachesolr_nodeaccess module in the 6.x branches should be working fine

somebodysysop’s picture

Is this patch now implemented in 6.x? If not, is there a version of this patch available for apache_solr 6.x?

robertdouglass’s picture

Category: support » feature
Status: Needs review » Fixed

I have no idea what the goal of the last patches or issues in this thread pertain to, and if they're still relevant.

@damien_vancouver - there is some potentially interesting code in your last patch. If this is something you need, or if it's still relevant, please open a new issue with a new title and a good description of what the patch does. Thanks.

damien_vancouver’s picture

Hi Robert, yes there are outstanding fixes in that last patch. Without the fixes in #54's patch ApacheSolr 5.x is broken at least two ways:
- attachments don't index (doh!)
- node_access permissions are not respected.. anyone can access any indexed content (oops!)

At Affinity Bridge, we've now been running ApacheSolr 5.x-1.x-dev with my #54 patch for over 6 months on prod sites with no problems. I'd really like to see this module going from useless/broken (current state) to fixed (just this one patch and better docs). It being hard to make work and/or poorly documented is not going to stop people downloading the broken version and trying. So... let's fix these tiny remaining issues.. I even volunteer to do (most) of the work.

Here's the patches I want to make/see committed, in order of importance

#1: The fix described in #54 here: proper node_access indexing, and proper update hook so attachment contents show up. I'll re-post this as a new (clean) issue as suggested by Robert.

#2: Support for indexed documents in .docx, .xlsx etc files. This means just finding the right helper. I'd like to make a patch though so that the defaults are sensible for common (Redhat/Debian) install locations and pointing to a helper that reads the new formats. The helper defaults / instructions are totally wrong, with invalid paths for modern UNIX.

#3: Documentation improvements. I'd like to distill my notes from setting up many of these in real prod environments into better install instructions. Nothing too serious, just repeatable step by steps. Including all the stuff you need to know starting out, like how to get Solr to work under Tomcat, implications of fighting vs. bypassing the security manager, and testing/troubleshooting steps to bring the entire system to life in the right oder.

I have time allocated already for #2 next week (need it for my client), I could get all 3 done as well if you guys are willing to review / test / commit too. The updates to #2 probably are applicable to 6.x as well, and I'll port those fixes up to the current version when we move to 6 (shortly). This will also give me time to try the fix in the 5.x production sites and make sure the docx -> ascii conversion is happening as it should.

Please let me know if any of this sounds agreeable, or if I am taking crazy pills here. Thanks!!

PS - we are moving to Drupal 6 soon enough... then I will be able to help on the current version of the module.

heacu’s picture

subscribing

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.