.. even those that are set not to be indexed.

A site we're building has around 10000 nodes, and almost 6000 of them are profile nodes, that we don't index, set in solr settings. These nodes are still added to the apachesolr_search_node table, which makes solr think that it has 10000 nodes to index, not 4000. As it turns out, as I just had to rebuild the index, Solr goes through all 10000 nodes, but the first 6000 doesn't add anything to the index, but still takes up cron time from those nodes that should be indexed.

Is this intended behaviour?

Comments

pwolanin’s picture

Probably not ideal - we need to add some extra logic to the admin form so we can track when there is a state change I think.

blackdog’s picture

Title: apachesolr_search_node table conatins all nodes... » apachesolr_search_node table contains all nodes...
Version: 6.x-1.0-beta5 » 6.x-1.0-beta7
Status: Active » Needs review
StatusFileSize
new1.6 KB

I've added some checks to apachesolr.module to make sure that nodes that are not intended to be indexed isn't added to the table.

pwolanin’s picture

Status: Needs review » Needs work

if you change the settings, they formerly excluded nodes will never get in the index.

It might be better to either add another DB column or use the status column.

blackdog’s picture

Status: Needs work » Needs review

@pwolanin - isn't this snippet in function apachesolr_search_type_boost_form_submit doing exactly that - adding nodes if settings are changed:

foreach ($old_excluded_types as $type => $excluded) {
    // Set no longer omitted node types for reindexing.
    if (empty($new_excluded_types[$type]) && !empty($old_excluded_types[$type])) {
      db_query("UPDATE {apachesolr_search_node} SET changed = %d WHERE nid IN (SELECT nid FROM {node} WHERE type = '%s')", time(), $type);
    }
  }
pwolanin’s picture

UPDATE

blackdog’s picture

Status: Needs review » Needs work

Ahh.

I'm not really following why a new DB column would be needed for this. Wouldn't it work to just rewrite the above submit function to INSERT instead?

pwolanin’s picture

In that case, you need to delete them if the setting is changed or insert them. Either approach could work.

blackdog’s picture

Version: 6.x-1.0-beta7 » 6.x-1.x-dev
Status: Needs work » Needs review
StatusFileSize
new3.51 KB

Updated patch adds logic to the submit function to add and delete nodes from apachesolr_search_node when settings change.

blackdog’s picture

Any reviewers on this?

pwolanin’s picture

+  $exclude = array();
+  foreach ($excluded_types AS $type) {
+    if ($type != '0') {
+      $exclude[] = $type;
+    }
+  }

'AS' -> 'as'

use !empty()rather than != '0' which is too implementation specific.

You night also iterate through all existing types and check if they are in the excluded list.

pwolanin’s picture

Status: Needs review » Needs work
+  $total = db_result(db_query("SELECT COUNT(*) FROM {node} WHERE type NOT IN('".implode(',', $exclude)."') AND status = 1"));

This has code-style problems and shoudl use placeholders ad pass the arguments to the query rather than directly imploding into the query.

pwolanin’s picture

Looking at this - it might just be simpler to join to the node table and add an extra WHERE clause, rather than doing all this inserting/deleting.

pwolanin’s picture

Status: Needs work » Needs review
StatusFileSize
new5.81 KB
jody lynn’s picture

Looks useful. Read through the code and can test tomorrow.
function _apachesolr_exclude_types needs a code comment

blackdog’s picture

Sorry I haven't reviewed this yet, will get to it asap!

blackdog’s picture

Patch works as intended.

Applied patch, set node type Page as excluded. At next Solr commit, nodes of type Page are deleted, and no new Pages are added. Unsetting Page as excluded adds the nodes back to the index.

Awaiting Jody Lynns review to RTBC this.

Thanks for looking into this pwolanin!

pwolanin’s picture

Slight enhancement - add a daily check that we did not fail to delete any excluded nodes. Though I'm a little unsure about the namespaces thing - maybe this check should be in apacehsolr_search for jsut its excluded types?

JacobSingh’s picture

Looks good.

pwolanin’s picture

Status: Needs review » Needs work

Thinking more about this - the code to catch any failed delete shoudl not be in the framework. Our one mainn use case for namespaces has been node attachments. So, for example, if I exclude attachments on 'story' nodes that does NOT mean that all 'story' nodes should be deleted from the index.

jody lynn’s picture

We tested the patch in #13:
It worked but the apache solr index table never seems to get cleaned out, so it still has info from all the nodes types that have been exlcuded, but were previously indexed. (even after index deletion)

pwolanin’s picture

@Jody - that's by design. The latest patch leaves all the nodes in the table, but just excludes them from indexing via the JOIN sql.

pwolanin’s picture

Title: apachesolr_search_node table contains all nodes... » excluded node types should be skipped during counting and indexing from {apachesolr_search_node}

better title

pwolanin’s picture

Status: Needs work » Needs review

actually I think the patch in #13 might be good enough - perhaps we could remove the check empty($old_excluded_types[$type]) here so that admins can re-submit to send delete queries if they fail.

  foreach ($new_excluded_types as $type => $excluded) {
    // Remove newly omitted node types.
    if (!empty($new_excluded_types[$type]) && empty($old_excluded_types[$type])) {
      $solr = apachesolr_get_solr();
      $solr->deleteByQuery("type:$type");
    }
  }
pwolanin’s picture

ok, with updated README too.

pwolanin’s picture

Status: Needs review » Fixed

committed to 6.x

pwolanin’s picture

Status: Fixed » Closed (fixed)