Hi, our system needs unpublished nodes to be indexed so they can be searched. I've been looking through the code for the apachesolr.module and it appears it does not index unpublished nodes. From line 312 in apachesolr.module (6.x-1.0-rc3):
$result = db_query_range("SELECT asn.nid, asn.changed FROM {apachesolr_search_node} asn ". $join_sql ."WHERE (asn.changed > %d OR (asn.changed = %d AND asn.nid > %d)) AND asn.status = 1 ". $exclude_sql ."ORDER BY asn.changed ASC, asn.nid ASC", $args, 0, $limit);
I was thinking just removing the filter, asn.status = 1, and that seems like it should solve it. But I wanted to see if there was something I might be missing.
Thanks
Comments
Comment #1
pwolanin commentedthere is also code to remove nodes from the index when they become unpublished.
Comment #2
droberge commentedgood point pwolanin. I did notice this code while looking through the module but didn't think of that.
Any recommendations on how I could implement this?
I was thinking possibly adding a new setting, "Index Unpublished Nodes", that could be selected in apachesolr settings.
Comment #3
pwolanin commentedThe reason we decided not to put unpublished nodes in the indexes is that to avoid showing htem you have to add an extra filter query to every query by a normal user. Not a big deal, but something to consider, especially since we sometimes run against the may URL length limit when making complex queries. Making it toggle-able might be ideal, but it a bit of extra complexity and not something I have time to work on.
Comment #4
janusman commentedWe switched our workflow to making all nodes published, then use the workflow, workflow_access module and apachesolr_nodeaccess to make nodes appear in search results and/or be editable depending on the user's role.
In your case, you could replace workflow and workflow_access module with whatever node access module you want. Just make all the nodes published so that apachesolr can index them, then remove access to certain users/roles using your node access module (for instance, Organic Groups, or Taxonomy Access Control Lite, etc)
Comment #5
anarchivist commentedSeconding @janusman's comment - we implemented this in the same way.
Comment #6
Scott Reynolds commented@pwolanin seems like people are finding workarounds that probably are worse then indexing published nodes.
(i.e. node_access facets have a much longer strlen() then status)And as a side affect, I think indexing unpublished nodes would simplify the queries in Apachesolr module, making maintaining that state of what gets indexed simpler.
Updated to a feature request.
edit *sigh* that was a silly comment. See my next comment by me
Comment #7
anarchivist commented@Scott Reynolds I don't think these "workarounds" are necessarily worse. I don't think that unpublished nodes should be indexed.
Comment #8
Scott Reynolds commentedMy metric:
Your method adds a filter query to every query for both normal and admins. Hence, worse.
And why shouldn't unpublished nodes by indexed? And why should unpublished nodes by deleted from the index?
I don't think indexing unpublished nodes is a bad thing and it makes code simpler. The SQL to determine what nodes to index gets smaller less s sent to the Solr server.
It would also allow Apache Solr Views to create a View for admins that is searchable for all the unpublished nodes.
Sounds like a good idea to me, as it doesn't harm anything (and in fact is better then your workaround).
What is the rationale behind not indexing unpublished nodes? Peter's is to keep the number of fq's down. Makes sense, but not sure its a deal breaker.
Comment #9
anarchivist commentedWell, I wasn't recommending @janusman's suggestion to @droberge - sorry if that wasn't clear. What I meant to emphasize is that we use workflow and nodeaccess instead.
"Unpublished" content in our system is really more of a personal draft state - I know our users wouldnt be comfortable knowing that their drafts were being indexed. I think I'd like to see this as an option if implemented...
Comment #10
droberge commentedThanks for the the helpful suggestions. Maybe I'm missing something here, but if you publish all nodes, how can you differentiate who can view them and who can't? For example, in our system, an unpublished node means it is not ready for the public to see, only internal people working on the node can see it. So if all nodes were published, how could our system determine the "true" status of the node. In other words, a published node may not necessarily be ready to be seen by the public. It seems like you would need to add another flag for the node, essentially what the status field is used for, to flag whether the node can be viewed by the public or not. If this is the case, then why not use the status field? Why add another field that serves the same purpose as the status field?
@janusman and @anarchivist, is this a similar situation for your deployments? If so, can you explain how you differentiate between a node in "draft" vs a node ready to be viewed?
Thanks again for the helpful input.
Comment #11
anarchivist commented@droberge - We're using Workflow to determine who can see the nodes that need those sort of restrictions. We treat "unpublished" as an internal state for admins/individual content editors only.
Comment #12
pwolanin commented@Scott - I am considering a baseline site to be one with no node access module, and a limited number of unpublished nodes that are spam, or drafts, or otherwise not useful content. e.g. drupal.org or groups.drupal.org
Sties that are using a node access module are already sending extra filter queries, so using that system to hide content should not add any more overhead.
We could certainly look at various ways to toggle the behavior - but I guess it's useful to know what our model site looks like. Really the main reason for this would be to enable admin searches of unpublished content, as far as I can tell.
Comment #13
robertdouglass commentedI'm for changing to indexing unpublished nodes. I can see very useful cases, especially in conjunction with Views, where we'd want this. I'm +0.5 for a toggle. I don't think we really need it.
Comment #14
pwolanin commentedHere's a mildly amusing thought: we could see if we can add a default fq to exclude unpublished nodes from the normal search, and then add a second named handler to allow searches without that filter being enforced.
Comment #15
pwolanin commentedfrom the example solrconfig.xml:
Comment #16
robertdouglass commentedInteresting technique. I wonder if this might be a nice way to do other things, too, like comment search, or a search for all the nodeaccess_all_0 (visible to everybody) content?
Comment #17
droberge commented@pwolanin: I'm not an apache solr expert, but that looks like a promising approach.
@robertDouglass: Is this going to be included in a future release of apache solr module?
For a quick and dirty solution, could we:
1. Change the query to retrieve nodes to index to ignore status field
2. Disable the query to remove unpublished nodes
3. Implement the modify query hook to filter for whether we want to retrieve only published nodes.
Is there anything I'm missing? I'm more than happy to submit the patch.
Comment #18
pwolanin commentedI'm a little reluctant to add this to 1.x, since it would require people to update their solrconfig to avoid exposing unpublished content.
Still... we'd like to make some other updates there and to the schema
Comment #19
robertdouglass commentedAgree that this is not appropriate for 6.1. Absolutely the realm of 6.2.
Comment #20
deverman commentedUsing Nodeaccess can really slow down a server I try to avoid all modules that use that table. Sometimes it get corrupted and we might accidentally expose data. Hope that this change can get into 6.2.
Comment #21
Scott Reynolds commentedWell that puts Apache Solr Views and any other module wanting to expose unpublished nodes in the search results in a bad corner. How would you expose that? fq[]=(published:true OR published:false) so that we override it?
Big -1 for making it very un-api.
Comment #22
pwolanin commentedWell, I said it was amusing, not that it was right...
Comment #23
justindodge commentedHi guys. Came across the same problem of needing unpublished nodes to be indexed. Unfortunately the other workarounds won't work for me - it's too late to go back and alter all my data, and combinations of moderated/published have special meaning to sections of the site.
My approach: the 'apachesolr_search_node' table has a column 'status' that indicates to the indexer whether or not a node has been published. During the indexing, the module looks at this table and decides if there is any indexing action that should happen to that node based on it's status.
My plan is to 'trick' the search engine by setting my own cron task that will set every node to published inside the search node table. By doing this, every node will be index, but drupal will still preserve the status of the node as published or not, and the apachesearch_nodeaccess module will determine whether the user searching will be able to to view it in searches or not.
When nodes are submitted, the solr module will attempt to update the DB table in question (during nodeapi), and it is possible that if the solr hook_cron runs before my modules hook_cron that some unintended action (un-index) could happen, however the next cron would quickly correct this. If I want to be really crafty, I can set my module's weight lower in the system table to safe guard against this, but I'm not too worried.
If anyone thinks this approach is worthwhile, I can publish the module.
Comment #24
drewish commentedsubscribing. we need to be able to search for unpublished nodes.
Comment #25
janusman commentedSee comments #4 and #5 for a possible solution.
Comment #26
drewish commentedjanusman, i'd read those but as with others our workflow is too baked in at this point. i'm working on a patch to add an option to index unpublished nodes.
Comment #27
fp commentedsusbcribing.
Comment #28
damienmckenna+1 for a workable solution for sites using Workflow, seems like a rather much-needed item.
Comment #29
jpmckinney commentedComment #30
solquimpo commentedSubscribing
Comment #31
pwolanin commentedIn the absence of my need for this feature and no patches forthcoming for 7.x, marking "won't fix".
Comment #32
mrharolda commentedI have a couple hours to burn and I'd like this feature too!
We're on a project where content owners can search their own content and publish/unpublish it from the search results. Indexing unpublished content is a must here.
@pwolanin: if you don't mind I'll have a go at creating a patch for Solr/D7.
Comment #33
justindodge commentedI made a module that does this a while back.
It seems to do the trick, but I suspect there may be better ways of handling the task, and with my limited knowledge of apachesolr, I don't know if it affects anything else adversely.
I've seen the index page look confused about how many nodes are left to index, but in the end it always gets to 100%. I'm not sure if that is a product of this module or not. No bugs that I've encountered otherwise.
It hasn't been published it on drupal.org, so take a look at the attachment.
Comment #34
justindodge commentedI should be clear and mention that I developed it with the 6.x branch, specifically I'm using 1.2, but I'm sure any 1.x would work.
My guess is that the update_index hook probably hasn't changed too drastically in subsequent versions, but can't say for sure.
Comment #35
mrharolda commented@justindodge: any plans of porting it to 7.x? We don't use D6 for new projects.
I'll have a look at your solution. I had to patch apachesolr at 4 places to really support indexing unpublished nodes, so your solution may be a lot better, or incomplete ;)
Comment #36
mrharolda commentedSetting this back to 7.x
Comment #37
mrharolda commentedThis is the patch I'm testing right now. It introduces the 'apachesolr_index_unpublished' variable into the apachesolr module.
I'm testing it together with apachesolr_access and content_access (patched) and so far, it looks like it's working fine! I patched against 7.x-v1.0-beta8 since that's what we're using in our current project.
Add this to your settings.php file to enable indexing unpublished nodes:
$conf['apachesolr_index_unpublished'] = 1;Comment #38
mrharolda commentedThere are some remaining issues with my patch, but they are present in the grants system of Drupal, apachesolr_access and/or content_access.
Authors of unpublished nodes with the 'view own unpublished content' permission never get a grant to view the node but they do have node_access('view') and thus are able to view their own node. IIUC, apachesolr_access only stores realms and grants and thus discards default Drupal permissions. This causes the node to be ommited from the search results, even if the user has a view right for that node.
I'm not quite sure where this feature request/bug report belongs to...
Comment #39
pwolanin commentedRight, the module has not way of knowing the View grants on the individual nodes except via the node access table. As a possible feature/fix, we could add a condition of the node uid matching the user uid i the access module in addition to what comes out of the table.
Comment #40
mrharolda commented@pwolanin,
I also logged a feature request against content_access: #1225520: Add 'View own unpublished content' setting
What do you think of my patch so far? I'll add an admin interface option for the unpublished variable as soon as all issues are resolved, in this case listing the users' own unpublished content, either by adding grants for that in content_access or in apachesolr itself...
Comment #41
mrharolda commentedOk, my guess is that I should add a extra exception in
apachesolr_access_build_subquery()that adds a subquery that shows all content he/she owns.Now I only have to figure out which query to use ;)
Comment #42
mrharolda commentedWow, it now seems I was already really close, but too dumb to use the indexed field name: 'is_uid'. ;)
I'm testing this right now and it looks promising:
Comment #43
brianV commentedBumping this as we have an interest in possibly seeing this as well. Any forward progress on this patch?
Changing version, because that's the version where the missing feature is. Also setting to 'needs review' as there is a patch under consideration.
Comment #44
pwolanin commentedPatch fails because is_uid is not guaranteed to be meaningful to identify content in a multi-site situation.
This might work if you also filter by site hash.
I don't know why the variable is there about indexing unpublished content?
Comment #45
nick_vh#1442358: General Mock object for simpletests (expected value/real value object model) solves this, the status callback is easy to override.
The access callback might be handled differently but all in all this is a task for a custom module
Comment #46
teknic commentedCan you expand on this Nick? I don't see where I can implement this callback to override the unpublish / publish status check.
For example, is there a hook to grab the document and set it's status to 1 for the search indexer? I'm just unclear how the #1442358: General Mock object for simpletests (expected value/real value object model) provides a solution to this issue.
Thank you.
Comment #47
Zombocom123 commentedAny news on this issue? What's the best solution at the moment?
Comment #48
jared_sprague commentedWe need to index all content reguardless of publication state. I noticed that this patch was closed as working as designed. But I still think there is a need for an option to index unpublished content, as we had this need and it's clear other people have and will have this need in the future. I'm submitting the patch that we are using, because the current patch on this issue is 4 years old as of Feb 2015. This patch is current with the latest dev version.
To use this patch add the following to your settings.php:
$conf['apachesolr_index_unpublished'] = TRUE;What this patch does is make the indexer status insensitive by making it think all content is published. This patch is only for indexing NOT search. So if you have a need to index unpublished nodes, you could use this patch, as a starting point.
Comment #49
torgospizzaWe also have a need for this, as we use Apachesolr to index a list of Commerce Products and their purchasers, and we use those values to alter the display of e.g. our catalog pages. However sometimes we want to allow access to unpublished nodes for some premium members (Kickstarter backers) - keeping their "exclusive" products hidden from other users, but allowing those products to still be indexed.
This patch looks like a good method, although I think ideally we should be allowed to drupal_alter() the reindexing callback. I do realize that you can use a hook_entity_info_alter to add or replace the node reindex callback specified for a bundle, but that seems like using a tank to hit a thumbtack.
EDIT: It looks like we can use hook_apachesolr_exclude() to do this very thing.
Comment #50
roynilanjan commentedAccording to @torgosPizza a good idea if we can write some custom implementation using
hook_apachesolr_entity_info_alterWe can write some work-around as,
Comment #51
roynilanjan commentedComment #54
greentee commentedHi,
I successfully use this patch, thank you.
But I found one case that needs to be improved.
When I create a new node item and this node is not in Published status it won't be indexed by SOLR.
It's because we have an additional checking on entity status in hook_entity_insert().