Index unpublished nodes [#632282]

Comment	File	Size	Author
#54	apachesolr-index-unpublished-nodes.patch	619 bytes	greentee
#48	apachesolr-index-unpublished-status-insensitive-632282-48.patch	3.79 KB	jared_sprague
#42	632282-apachesolr_access.patch	639 bytes	mrharolda
#37	632282.patch	3.44 KB	mrharolda
#33	apachesolr_index_unpublished.zip	2.16 KB	justindodge

Comment #1

pwolanin commented 14 November 2009 at 03:11

there is also code to remove nodes from the index when they become unpublished.

Log in or register to post comments

Comment #2

droberge commented 15 November 2009 at 17:58

good point pwolanin. I did notice this code while looking through the module but didn't think of that.

Any recommendations on how I could implement this?

I was thinking possibly adding a new setting, "Index Unpublished Nodes", that could be selected in apachesolr settings.

Log in or register to post comments

Comment #3

pwolanin commented 16 November 2009 at 00:49

The reason we decided not to put unpublished nodes in the indexes is that to avoid showing htem you have to add an extra filter query to every query by a normal user. Not a big deal, but something to consider, especially since we sometimes run against the may URL length limit when making complex queries. Making it toggle-able might be ideal, but it a bit of extra complexity and not something I have time to work on.

Log in or register to post comments

Comment #4

janusman commented 17 November 2009 at 15:43

We switched our workflow to making all nodes published, then use the workflow, workflow_access module and apachesolr_nodeaccess to make nodes appear in search results and/or be editable depending on the user's role.

In your case, you could replace workflow and workflow_access module with whatever node access module you want. Just make all the nodes published so that apachesolr can index them, then remove access to certain users/roles using your node access module (for instance, Organic Groups, or Taxonomy Access Control Lite, etc)

Log in or register to post comments

Comment #5

anarchivist commented 19 November 2009 at 11:01

Seconding @janusman's comment - we implemented this in the same way.

Log in or register to post comments

Comment #6

Scott Reynolds commented 19 November 2009 at 22:24

@pwolanin seems like people are finding workarounds that probably are worse then indexing published nodes. ~~(i.e. node_access facets have a much longer strlen() then status)~~

And as a side affect, I think indexing unpublished nodes would simplify the queries in Apachesolr module, making maintaining that state of what gets indexed simpler.

Updated to a feature request.

edit *sigh* that was a silly comment. See my next comment by me

Log in or register to post comments

Comment #7

anarchivist commented 19 November 2009 at 22:05

Version:	6.x-1.0-rc3	» 6.x-1.x-dev
Category:	support	» feature

@Scott Reynolds I don't think these "workarounds" are necessarily worse. I don't think that unpublished nodes should be indexed.

Log in or register to post comments

Comment #8

Scott Reynolds commented 19 November 2009 at 22:29

My metric:

The reason we decided not to put unpublished nodes in the indexes is that to avoid showing htem you have to add an extra filter query to every query by a normal user.

Your method adds a filter query to every query for both normal and admins. Hence, worse.

And why shouldn't unpublished nodes by indexed? And why should unpublished nodes by deleted from the index?

I don't think indexing unpublished nodes is a bad thing and it makes code simpler. The SQL to determine what nodes to index gets smaller less s sent to the Solr server.

It would also allow Apache Solr Views to create a View for admins that is searchable for all the unpublished nodes.

Sounds like a good idea to me, as it doesn't harm anything (and in fact is better then your workaround).

What is the rationale behind not indexing unpublished nodes? Peter's is to keep the number of fq's down. Makes sense, but not sure its a deal breaker.

Log in or register to post comments

Comment #9

anarchivist commented 20 November 2009 at 04:30

Well, I wasn't recommending @janusman's suggestion to @droberge - sorry if that wasn't clear. What I meant to emphasize is that we use workflow and nodeaccess instead.

"Unpublished" content in our system is really more of a personal draft state - I know our users wouldnt be comfortable knowing that their drafts were being indexed. I think I'd like to see this as an option if implemented...

Log in or register to post comments

Comment #10

droberge commented 20 November 2009 at 19:39

Thanks for the the helpful suggestions. Maybe I'm missing something here, but if you publish all nodes, how can you differentiate who can view them and who can't? For example, in our system, an unpublished node means it is not ready for the public to see, only internal people working on the node can see it. So if all nodes were published, how could our system determine the "true" status of the node. In other words, a published node may not necessarily be ready to be seen by the public. It seems like you would need to add another flag for the node, essentially what the status field is used for, to flag whether the node can be viewed by the public or not. If this is the case, then why not use the status field? Why add another field that serves the same purpose as the status field?

@janusman and @anarchivist, is this a similar situation for your deployments? If so, can you explain how you differentiate between a node in "draft" vs a node ready to be viewed?

Thanks again for the helpful input.

Log in or register to post comments

Comment #11

anarchivist commented 20 November 2009 at 20:46

@droberge - We're using Workflow to determine who can see the nodes that need those sort of restrictions. We treat "unpublished" as an internal state for admins/individual content editors only.

Log in or register to post comments

Comment #12

pwolanin commented 20 November 2009 at 22:19

@Scott - I am considering a baseline site to be one with no node access module, and a limited number of unpublished nodes that are spam, or drafts, or otherwise not useful content. e.g. drupal.org or groups.drupal.org

Sties that are using a node access module are already sending extra filter queries, so using that system to hide content should not add any more overhead.

We could certainly look at various ways to toggle the behavior - but I guess it's useful to know what our model site looks like. Really the main reason for this would be to enable admin searches of unpublished content, as far as I can tell.

Log in or register to post comments

Comment #13

robertdouglass commented 25 November 2009 at 13:58

I'm for changing to indexing unpublished nodes. I can see very useful cases, especially in conjunction with Views, where we'd want this. I'm +0.5 for a toggle. I don't think we really need it.

Log in or register to post comments

Comment #14

pwolanin commented 25 November 2009 at 19:13

Here's a mildly amusing thought: we could see if we can add a default fq to exclude unpublished nodes from the normal search, and then add a second named handler to allow searches without that filter being enforced.

Log in or register to post comments

Comment #15

pwolanin commented 25 November 2009 at 19:38

from the example solrconfig.xml:

    <!-- In addition to defaults, "appends" params can be specified
         to identify values which should be appended to the list of
         multi-val params from the query (or the existing "defaults").

         In this example, the param "fq=instock:true" will be appended to
         any query time fq params the user may specify, as a mechanism for
         partitioning the index, independent of any user selected filtering
         that may also be desired (perhaps as a result of faceted searching).

         NOTE: there is *absolutely* nothing a client can do to prevent these
         "appends" values from being used, so don't use this mechanism
         unless you are sure you always want it.
      -->
    <lst name="appends">
      <str name="fq">inStock:true</str>
    </lst>

Log in or register to post comments

Comment #16

robertdouglass commented 25 November 2009 at 21:49

Interesting technique. I wonder if this might be a nice way to do other things, too, like comment search, or a search for all the nodeaccess_all_0 (visible to everybody) content?

Log in or register to post comments

Comment #17

droberge commented 27 November 2009 at 18:38

@pwolanin: I'm not an apache solr expert, but that looks like a promising approach.

@robertDouglass: Is this going to be included in a future release of apache solr module?

For a quick and dirty solution, could we:

1. Change the query to retrieve nodes to index to ignore status field

2. Disable the query to remove unpublished nodes

3. Implement the modify query hook to filter for whether we want to retrieve only published nodes.

Is there anything I'm missing? I'm more than happy to submit the patch.

Log in or register to post comments

Comment #18

pwolanin commented 27 November 2009 at 20:37

I'm a little reluctant to add this to 1.x, since it would require people to update their solrconfig to avoid exposing unpublished content.

Still... we'd like to make some other updates there and to the schema

Log in or register to post comments

Comment #19

robertdouglass commented 27 November 2009 at 22:45

Version:

6.x-1.x-dev

» 6.x-2.x-dev

Agree that this is not appropriate for 6.1. Absolutely the realm of 6.2.

Log in or register to post comments

Comment #20

deverman commented 30 November 2009 at 06:45

Using Nodeaccess can really slow down a server I try to avoid all modules that use that table. Sometimes it get corrupted and we might accidentally expose data. Hope that this change can get into 6.2.

Log in or register to post comments

Comment #21

Scott Reynolds commented 30 November 2009 at 17:22

Here's a mildly amusing thought: we could see if we can add a default fq to exclude unpublished nodes from the normal search, and then add a second named handler to allow searches without that filter being enforced.

Well that puts Apache Solr Views and any other module wanting to expose unpublished nodes in the search results in a bad corner. How would you expose that? fq[]=(published:true OR published:false) so that we override it?

Big -1 for making it very un-api.

Log in or register to post comments

Comment #22

pwolanin commented 2 December 2009 at 02:31

Well, I said it was amusing, not that it was right...

Log in or register to post comments

Comment #23

justindodge commented 4 December 2009 at 12:26

Hi guys. Came across the same problem of needing unpublished nodes to be indexed. Unfortunately the other workarounds won't work for me - it's too late to go back and alter all my data, and combinations of moderated/published have special meaning to sections of the site.

My approach: the 'apachesolr_search_node' table has a column 'status' that indicates to the indexer whether or not a node has been published. During the indexing, the module looks at this table and decides if there is any indexing action that should happen to that node based on it's status.

My plan is to 'trick' the search engine by setting my own cron task that will set every node to published inside the search node table. By doing this, every node will be index, but drupal will still preserve the status of the node as published or not, and the apachesearch_nodeaccess module will determine whether the user searching will be able to to view it in searches or not.

When nodes are submitted, the solr module will attempt to update the DB table in question (during nodeapi), and it is possible that if the solr hook_cron runs before my modules hook_cron that some unintended action (un-index) could happen, however the next cron would quickly correct this. If I want to be really crafty, I can set my module's weight lower in the system table to safe guard against this, but I'm not too worried.

If anyone thinks this approach is worthwhile, I can publish the module.

Log in or register to post comments

Comment #24

drewish commented 8 March 2010 at 18:54

subscribing. we need to be able to search for unpublished nodes.

Log in or register to post comments

Comment #25

janusman commented 8 March 2010 at 20:47

See comments #4 and #5 for a possible solution.

Log in or register to post comments

Comment #26

drewish commented 8 March 2010 at 22:25

janusman, i'd read those but as with others our workflow is too baked in at this point. i'm working on a patch to add an option to index unpublished nodes.

Log in or register to post comments

Comment #27

fp commented 14 April 2010 at 23:50

susbcribing.

Log in or register to post comments

Comment #28

damienmckenna

TN, USA

commented 18 January 2011 at 20:55

+1 for a workable solution for sites using Workflow, seems like a rather much-needed item.

Log in or register to post comments

Comment #29

jpmckinney commented 20 March 2011 at 04:12

Title:	Searching unpublished nodes	» Index unpublished nodes
Version:	6.x-2.x-dev	» 7.x-1.x-dev

Log in or register to post comments

Comment #30

solquimpo commented 29 June 2011 at 06:01

Subscribing

Log in or register to post comments

Comment #31

pwolanin commented 18 July 2011 at 01:44

Status:

Active

» Closed (won't fix)

In the absence of my need for this feature and no patches forthcoming for 7.x, marking "won't fix".

Log in or register to post comments

Comment #32

mrharolda commented 20 July 2011 at 13:31

Status:

Closed (won't fix)

» Active

I have a couple hours to burn and I'd like this feature too!

We're on a project where content owners can search their own content and publish/unpublish it from the search results. Indexing unpublished content is a must here.

@pwolanin: if you don't mind I'll have a go at creating a patch for Solr/D7.

Log in or register to post comments

Comment #33

justindodge commented 20 July 2011 at 17:55

Version:

7.x-1.x-dev

» 6.x-1.2

Status	File	Size
new	apachesolr_index_unpublished.zip	2.16 KB

I made a module that does this a while back.

It seems to do the trick, but I suspect there may be better ways of handling the task, and with my limited knowledge of apachesolr, I don't know if it affects anything else adversely.

I've seen the index page look confused about how many nodes are left to index, but in the end it always gets to 100%. I'm not sure if that is a product of this module or not. No bugs that I've encountered otherwise.

It hasn't been published it on drupal.org, so take a look at the attachment.

Log in or register to post comments

Comment #34

justindodge commented 20 July 2011 at 17:58

I should be clear and mention that I developed it with the 6.x branch, specifically I'm using 1.2, but I'm sure any 1.x would work.

My guess is that the update_index hook probably hasn't changed too drastically in subsequent versions, but can't say for sure.

Log in or register to post comments

Comment #35

mrharolda commented 21 July 2011 at 07:42

@justindodge: any plans of porting it to 7.x? We don't use D6 for new projects.

I'll have a look at your solution. I had to patch apachesolr at 4 places to really support indexing unpublished nodes, so your solution may be a lot better, or incomplete ;)

Log in or register to post comments

Comment #36

mrharolda commented 21 July 2011 at 07:43

Version:

6.x-1.2

» 7.x-1.x-dev

Setting this back to 7.x

Log in or register to post comments

Comment #37

mrharolda commented 21 July 2011 at 08:35

Version:

7.x-1.x-dev

» 7.x-1.0-beta8

Status	File	Size
new	632282.patch	3.44 KB

This is the patch I'm testing right now. It introduces the 'apachesolr_index_unpublished' variable into the apachesolr module.

I'm testing it together with apachesolr_access and content_access (patched) and so far, it looks like it's working fine! I patched against 7.x-v1.0-beta8 since that's what we're using in our current project.

Add this to your settings.php file to enable indexing unpublished nodes: $conf['apachesolr_index_unpublished'] = 1;

Log in or register to post comments

Comment #38

mrharolda commented 21 July 2011 at 12:46

There are some remaining issues with my patch, but they are present in the grants system of Drupal, apachesolr_access and/or content_access.

Authors of unpublished nodes with the 'view own unpublished content' permission never get a grant to view the node but they do have node_access('view') and thus are able to view their own node. IIUC, apachesolr_access only stores realms and grants and thus discards default Drupal permissions. This causes the node to be ommited from the search results, even if the user has a view right for that node.

I'm not quite sure where this feature request/bug report belongs to...

Log in or register to post comments

Comment #39

pwolanin commented 21 July 2011 at 21:10

Right, the module has not way of knowing the View grants on the individual nodes except via the node access table. As a possible feature/fix, we could add a condition of the node uid matching the user uid i the access module in addition to what comes out of the table.

Log in or register to post comments

Comment #40

mrharolda commented 21 July 2011 at 21:44

@pwolanin,

I also logged a feature request against content_access: #1225520: Add 'View own unpublished content' setting

What do you think of my patch so far? I'll add an admin interface option for the unpublished variable as soon as all issues are resolved, in this case listing the users' own unpublished content, either by adding grants for that in content_access or in apachesolr itself...

Log in or register to post comments

Comment #41

mrharolda commented 22 July 2011 at 08:56

Ok, my guess is that I should add a extra exception in apachesolr_access_build_subquery() that adds a subquery that shows all content he/she owns.

Now I only have to figure out which query to use ;)

function apachesolr_access_build_subquery($account) {
  if (!user_access('access content', $account)) {
    throw new Exception('No access');
  }
  $node_access_query = new SolrFilterSubQuery();
  if (user_access('bypass node access', $account)) {
    // Access all content from the current site, or public content.
    $node_access_query->addFilter('access__all', 0);
    $node_access_query->addFilter('hash', apachesolr_site_hash());
  }
  else {
    // Get node access grants.
    $grants = node_access_grants('view', $account);
    foreach ($grants as $realm => $gids) {
      foreach ($gids as $gid) {
        $node_access_query->addFilter('access_node_' . apachesolr_site_hash() . '_' . $realm, $gid);
      }
    }
    $node_access_query->addFilter('access__all', 0);
  }
  if (user_access('view own unpublished content', $account)) {
    // Access all owned content.
    $node_access_query->addFilter('OR uid query/filter', $account->uid); // ??? @TODO: add an OR that adds all owned content ???
  }
  return $node_access_query;
}

Log in or register to post comments

Comment #42

mrharolda commented 25 July 2011 at 11:33

Status	File	Size
new	632282-apachesolr_access.patch	639 bytes

Wow, it now seems I was already really close, but too dumb to use the indexed field name: 'is_uid'. ;)

I'm testing this right now and it looks promising:

function apachesolr_access_build_subquery($account) {
  if (!user_access('access content', $account)) {
    throw new Exception('No access');
  }
  $node_access_query = new SolrFilterSubQuery();
  if (user_access('bypass node access', $account)) {
    // Access all content from the current site, or public content.
    $node_access_query->addFilter('access__all', 0);
    $node_access_query->addFilter('hash', apachesolr_site_hash());
  }
  else {
    // Get node access grants.
    $grants = node_access_grants('view', $account);
    foreach ($grants as $realm => $gids) {
      foreach ($gids as $gid) {
        $node_access_query->addFilter('access_node_' . apachesolr_site_hash() . '_' . $realm, $gid);
      }
    }
    $node_access_query->addFilter('access__all', 0);
  }
  if (variable_get('apachesolr_index_unpublished', 0) && user_access('view own unpublished content', $account)) {
    // Access all owned content regardless of status.
    $node_access_query->addFilter('is_uid', $account->uid);
  }
  return $node_access_query;
}

Log in or register to post comments

Comment #43

brianV commented 26 October 2011 at 20:36

Version:	7.x-1.0-beta8	» 7.x-1.x-dev
Status:	Active	» Needs review

Bumping this as we have an interest in possibly seeing this as well. Any forward progress on this patch?

Changing version, because that's the version where the missing feature is. Also setting to 'needs review' as there is a patch under consideration.

Log in or register to post comments

Comment #44

pwolanin commented 28 October 2011 at 19:10

Status:

Needs review

» Needs work

Patch fails because is_uid is not guaranteed to be meaningful to identify content in a multi-site situation.

This might work if you also filter by site hash.

I don't know why the variable is there about indexing unpublished content?

Log in or register to post comments

Comment #45

nick_vh

he/him

Ghent

commented 17 February 2012 at 11:44

Status:

Needs work

» Closed (works as designed)

#1442358: General Mock object for simpletests (expected value/real value object model) solves this, the status callback is easy to override.

The access callback might be handled differently but all in all this is a task for a custom module

Log in or register to post comments

Comment #46

teknic commented 21 August 2013 at 20:53

Category:

feature

» support

Can you expand on this Nick? I don't see where I can implement this callback to override the unpublish / publish status check.

For example, is there a hook to grab the document and set it's status to 1 for the search indexer? I'm just unclear how the #1442358: General Mock object for simpletests (expected value/real value object model) provides a solution to this issue.

Thank you.

Log in or register to post comments

Comment #47

Zombocom123 commented 26 July 2014 at 08:24

Issue summary:

View changes

Any news on this issue? What's the best solution at the moment?

Log in or register to post comments

Comment #48

jared_sprague commented 14 February 2015 at 00:08

Status	File	Size
new	apachesolr-index-unpublished-status-insensitive-632282-48.patch	3.79 KB

We need to index all content reguardless of publication state. I noticed that this patch was closed as working as designed. But I still think there is a need for an option to index unpublished content, as we had this need and it's clear other people have and will have this need in the future. I'm submitting the patch that we are using, because the current patch on this issue is 4 years old as of Feb 2015. This patch is current with the latest dev version.

To use this patch add the following to your settings.php:
$conf['apachesolr_index_unpublished'] = TRUE;

What this patch does is make the indexer status insensitive by making it think all content is published. This patch is only for indexing NOT search. So if you have a need to index unpublished nodes, you could use this patch, as a starting point.

Log in or register to post comments

Comment #49

torgospizza

he/him

English

Portland, OR

commented 21 September 2015 at 19:53

We also have a need for this, as we use Apachesolr to index a list of Commerce Products and their purchasers, and we use those values to alter the display of e.g. our catalog pages. However sometimes we want to allow access to unpublished nodes for some premium members (Kickstarter backers) - keeping their "exclusive" products hidden from other users, but allowing those products to still be indexed.

This patch looks like a good method, although I think ideally we should be allowed to drupal_alter() the reindexing callback. I do realize that you can use a hook_entity_info_alter to add or replace the node reindex callback specified for a bundle, but that seems like using a tank to hit a thumbtack.

EDIT: It looks like we can use hook_apachesolr_exclude() to do this very thing.

Log in or register to post comments

Comment #50

roynilanjan commented 25 January 2016 at 06:49

According to @torgosPizza a good idea if we can write some custom implementation using
hook_apachesolr_entity_info_alter
We can write some work-around as,

function [MODULE]_apachesolr_entity_info_alter(&$entity_info) {
  $entity_info['node']['status callback'][] = 'apachesolr_node_status_callback';
}

function apachesolr_node_status_callback($entity_id, $entity_type) {
  $node = node_load($entity_id);
  if ($node->type == 'specific') {
    return ($node->status == NODE_PUBLISHED || $node->status == NODE_NOT_PUBLISHED);
  }
}

Log in or register to post comments

Comment #51

roynilanjan commented 25 January 2016 at 06:23

Status:

Closed (works as designed)

» Needs review

Log in or register to post comments

Comment #52

25 January 2016 at 06:30

The last submitted patch, 37: 632282.patch, failed testing.

Log in or register to post comments

Comment #53

25 January 2016 at 06:30

The last submitted patch, 42: 632282-apachesolr_access.patch, failed testing.

Log in or register to post comments

Comment #54

greentee commented 1 June 2018 at 09:37

Status	File	Size
new	apachesolr-index-unpublished-nodes.patch	619 bytes

Hi,
I successfully use this patch, thank you.
But I found one case that needs to be improved.
When I create a new node item and this node is not in Published status it won't be indexed by SOLR.
It's because we have an additional checking on entity status in hook_entity_insert().

Log in or register to post comments

Index unpublished nodes

Comments