Searching unpublished nodes
droberge - November 13, 2009 - 20:03
| Project: | Apache Solr Search Integration |
| Version: | 6.x-2.x-dev |
| Component: | Code |
| Category: | feature request |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | active |
Jump to:
Description
Hi, our system needs unpublished nodes to be indexed so they can be searched. I've been looking through the code for the apachesolr.module and it appears it does not index unpublished nodes. From line 312 in apachesolr.module (6.x-1.0-rc3):
$result = db_query_range("SELECT asn.nid, asn.changed FROM {apachesolr_search_node} asn ". $join_sql ."WHERE (asn.changed > %d OR (asn.changed = %d AND asn.nid > %d)) AND asn.status = 1 ". $exclude_sql ."ORDER BY asn.changed ASC, asn.nid ASC", $args, 0, $limit);
I was thinking just removing the filter, asn.status = 1, and that seems like it should solve it. But I wanted to see if there was something I might be missing.
Thanks

#1
there is also code to remove nodes from the index when they become unpublished.
#2
good point pwolanin. I did notice this code while looking through the module but didn't think of that.
Any recommendations on how I could implement this?
I was thinking possibly adding a new setting, "Index Unpublished Nodes", that could be selected in apachesolr settings.
#3
The reason we decided not to put unpublished nodes in the indexes is that to avoid showing htem you have to add an extra filter query to every query by a normal user. Not a big deal, but something to consider, especially since we sometimes run against the may URL length limit when making complex queries. Making it toggle-able might be ideal, but it a bit of extra complexity and not something I have time to work on.
#4
We switched our workflow to making all nodes published, then use the workflow, workflow_access module and apachesolr_nodeaccess to make nodes appear in search results and/or be editable depending on the user's role.
In your case, you could replace workflow and workflow_access module with whatever node access module you want. Just make all the nodes published so that apachesolr can index them, then remove access to certain users/roles using your node access module (for instance, Organic Groups, or Taxonomy Access Control Lite, etc)
#5
Seconding @janusman's comment - we implemented this in the same way.
#6
@pwolanin seems like people are finding workarounds that probably are worse then indexing published nodes.
(i.e. node_access facets have a much longer strlen() then status)And as a side affect, I think indexing unpublished nodes would simplify the queries in Apachesolr module, making maintaining that state of what gets indexed simpler.
Updated to a feature request.
edit *sigh* that was a silly comment. See my next comment by me
#7
@Scott Reynolds I don't think these "workarounds" are necessarily worse. I don't think that unpublished nodes should be indexed.
#8
My metric:
Your method adds a filter query to every query for both normal and admins. Hence, worse.
And why shouldn't unpublished nodes by indexed? And why should unpublished nodes by deleted from the index?
I don't think indexing unpublished nodes is a bad thing and it makes code simpler. The SQL to determine what nodes to index gets smaller less s sent to the Solr server.
It would also allow Apache Solr Views to create a View for admins that is searchable for all the unpublished nodes.
Sounds like a good idea to me, as it doesn't harm anything (and in fact is better then your workaround).
What is the rationale behind not indexing unpublished nodes? Peter's is to keep the number of fq's down. Makes sense, but not sure its a deal breaker.
#9
Well, I wasn't recommending @janusman's suggestion to @droberge - sorry if that wasn't clear. What I meant to emphasize is that we use workflow and nodeaccess instead.
"Unpublished" content in our system is really more of a personal draft state - I know our users wouldnt be comfortable knowing that their drafts were being indexed. I think I'd like to see this as an option if implemented...
#10
Thanks for the the helpful suggestions. Maybe I'm missing something here, but if you publish all nodes, how can you differentiate who can view them and who can't? For example, in our system, an unpublished node means it is not ready for the public to see, only internal people working on the node can see it. So if all nodes were published, how could our system determine the "true" status of the node. In other words, a published node may not necessarily be ready to be seen by the public. It seems like you would need to add another flag for the node, essentially what the status field is used for, to flag whether the node can be viewed by the public or not. If this is the case, then why not use the status field? Why add another field that serves the same purpose as the status field?
@janusman and @anarchivist, is this a similar situation for your deployments? If so, can you explain how you differentiate between a node in "draft" vs a node ready to be viewed?
Thanks again for the helpful input.
#11
@droberge - We're using Workflow to determine who can see the nodes that need those sort of restrictions. We treat "unpublished" as an internal state for admins/individual content editors only.
#12
@Scott - I am considering a baseline site to be one with no node access module, and a limited number of unpublished nodes that are spam, or drafts, or otherwise not useful content. e.g. drupal.org or groups.drupal.org
Sties that are using a node access module are already sending extra filter queries, so using that system to hide content should not add any more overhead.
We could certainly look at various ways to toggle the behavior - but I guess it's useful to know what our model site looks like. Really the main reason for this would be to enable admin searches of unpublished content, as far as I can tell.
#13
I'm for changing to indexing unpublished nodes. I can see very useful cases, especially in conjunction with Views, where we'd want this. I'm +0.5 for a toggle. I don't think we really need it.
#14
Here's a mildly amusing thought: we could see if we can add a default fq to exclude unpublished nodes from the normal search, and then add a second named handler to allow searches without that filter being enforced.
#15
from the example solrconfig.xml:
<!-- In addition to defaults, "appends" params can be specified
to identify values which should be appended to the list of
multi-val params from the query (or the existing "defaults").
In this example, the param "fq=instock:true" will be appended to
any query time fq params the user may specify, as a mechanism for
partitioning the index, independent of any user selected filtering
that may also be desired (perhaps as a result of faceted searching).
NOTE: there is *absolutely* nothing a client can do to prevent these
"appends" values from being used, so don't use this mechanism
unless you are sure you always want it.
-->
<lst name="appends">
<str name="fq">inStock:true</str>
</lst>
#16
Interesting technique. I wonder if this might be a nice way to do other things, too, like comment search, or a search for all the nodeaccess_all_0 (visible to everybody) content?
#17
@pwolanin: I'm not an apache solr expert, but that looks like a promising approach.
@robertDouglass: Is this going to be included in a future release of apache solr module?
For a quick and dirty solution, could we:
1. Change the query to retrieve nodes to index to ignore status field
2. Disable the query to remove unpublished nodes
3. Implement the modify query hook to filter for whether we want to retrieve only published nodes.
Is there anything I'm missing? I'm more than happy to submit the patch.
#18
I'm a little reluctant to add this to 1.x, since it would require people to update their solrconfig to avoid exposing unpublished content.
Still... we'd like to make some other updates there and to the schema
#19
Agree that this is not appropriate for 6.1. Absolutely the realm of 6.2.
#20
Using Nodeaccess can really slow down a server I try to avoid all modules that use that table. Sometimes it get corrupted and we might accidentally expose data. Hope that this change can get into 6.2.
#21
Well that puts Apache Solr Views and any other module wanting to expose unpublished nodes in the search results in a bad corner. How would you expose that? fq[]=(published:true OR published:false) so that we override it?
Big -1 for making it very un-api.
#22
Well, I said it was amusing, not that it was right...
#23
Hi guys. Came across the same problem of needing unpublished nodes to be indexed. Unfortunately the other workarounds won't work for me - it's too late to go back and alter all my data, and combinations of moderated/published have special meaning to sections of the site.
My approach: the 'apachesolr_search_node' table has a column 'status' that indicates to the indexer whether or not a node has been published. During the indexing, the module looks at this table and decides if there is any indexing action that should happen to that node based on it's status.
My plan is to 'trick' the search engine by setting my own cron task that will set every node to published inside the search node table. By doing this, every node will be index, but drupal will still preserve the status of the node as published or not, and the apachesearch_nodeaccess module will determine whether the user searching will be able to to view it in searches or not.
When nodes are submitted, the solr module will attempt to update the DB table in question (during nodeapi), and it is possible that if the solr hook_cron runs before my modules hook_cron that some unintended action (un-index) could happen, however the next cron would quickly correct this. If I want to be really crafty, I can set my module's weight lower in the system table to safe guard against this, but I'm not too worried.
If anyone thinks this approach is worthwhile, I can publish the module.