From the drupal.org upgrade sprint in Boston.

One thing we need for d.o is a biais on the node type so that project nodes appear first in the search result (have you tried recently to search for "views" or "cck" here?).

This can't really be done at query time (we can't map types to weights using any Solr function), so we may need a small schema modification to do that.

Comments

damien tournoud’s picture

Status: Active » Needs review
StatusFileSize
new5.43 KB

Here is a first patch for this. Do we need a steepness, in addition to the boost?

pwolanin’s picture

not sure why a steepness would be relevant here if you can set the boost per content type.

Why this instead of just saving an array and letting it be serialized/unserialized?

list($type_steepness, $type_boost) = explode(':', $type_boost);

Also, per Jacob we need a setting (potentially) to totally exclude certain node types from indexing.

damien tournoud’s picture

StatusFileSize
new5.73 KB

Following a IRC discussion with Peter, here is a new version:

- use bf queries to set the type specific boosts
- allow to completely omit nodes from indexing

damien tournoud’s picture

StatusFileSize
new5.77 KB

And a fixed version.

JacobSingh’s picture

nice implementation! This feature will be killer.

I feel there is a usability issue here though. It's certainly good to remove node types which are not going to be queried on, however we need to warn the user that they are being removed when you omit the boost. I imagine many users would assume that "Omit" is going to mean, do not boost it, not "remove it".

Also, when they turn on a node type which they had previously not been using, we should warn them and/or force a re-index so they are not confused.

What do you think? How should we make this clear.

JacobSingh’s picture

Wait, I just reviewed again. I didn't notice you set the previously omitted nodes to be re-indexed.

Sorry :/

I'll go back through and do a proper review with the patched code later on today.

pwolanin’s picture

I think that a setting to omit from the index should really be a separate settings form from the boost. We should, but default, not any bq for content types if we can avoid it.

damien tournoud’s picture

This still needs review, but this is wrong:

+      $solr->deleteByQuery("type:$type");

This should take the site hash into account too.

pwolanin’s picture

Ah, sure at least if you think you might have a multi-site index.

Note, however, that the delete index operation is not limited currently to the current site - so we could go with this for now, but handle it better when multi-site support goes back in.

pwolanin’s picture

StatusFileSize
new6.93 KB

Here's a better patch that separates boosts from exclusion - also correctly handles the case where we 'Reset to defaults'

dww’s picture

Title: Add a biais on node type » Add a bias on node type
Issue tags: +drupal.org upgrade

I'll see if I can make time to review/test this, but I can't promise I will with all the other d6 upgrade issues on my plate... ;)

pwolanin’s picture

Title: Add a bias on node type » Add a bias on node type (and node-type exclusion)
pwolanin’s picture

Status: Needs review » Fixed

committed to 6.x

dreed47’s picture

I installed this patch and I see two issues with the node exclusion part.

First, the admin page at /admin/settings/apachesolr/index shows a count of all nodes in the system as though they are all to be indexed, even though I've set some node types to be excluded

Second (and much more important) It seems as though the cron job is pulling nodes that should be excluded. For example, I have it set to process 50 nodes per cron run and it pulls the first 50 that it comes to and they are all excluded node types so it indexes nothing and waits until the next cron run. For many people this may not be an issue but for my current use case it is. Say I have 10k nodes of a type that I don't want to index and 1k nodes of type that I do. The cron indexing should not have to loop thru all 11k nodes.

dww’s picture

Status: Fixed » Active

Haven't confirmed myself, but sounds like #14 brings up an important bug in how this works. ;)

pwolanin’s picture

@dww - yes, I'm aware of those issues - already had imagined we might need a follow-on patch. I'm not convinced that the node_load of non-indexed nodes is a problem, but much more serious is that indexing may hang forever if all the nodes selected for indexing on a given cron run are excluded.

pwolanin’s picture

Status: Active » Needs review
StatusFileSize
new721 bytes

This might be a sufficient fix to prevent the really critical bug.

pwolanin’s picture

StatusFileSize
new2.86 KB

A little better refactoring - a separate hook for node exclusion.

damien tournoud’s picture

Looks like a good idea at first sight.

I commented (on IRC) on the previous version of the patch that we probably don't want to output:

watchdog('Apache Solr', 'Adding @count documents.', array('@count' => count($documents)));

When count($documents) == 0 ;)

dww’s picture

Yup, +1 on the concept here. Code appears good on visual inspection though I haven't tested it. I guess I should really setup a solr instance on my laptop to test stuff like this. ;)

pwolanin’s picture

@Damien - is it bad to watchdog that we sent 0? might help with debugging. Committing as is - we can revisit the watchdog call if needed.

@dww - it's really easy to run the example (Jetty) server locally. grab me in IRC if you want assistance.

pwolanin’s picture

Status: Needs review » Fixed

see follow-up patch: http://drupal.org/node/370796

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

kentr’s picture

One thing we need for d.o is a biais on the node type so that project nodes appear first in the search result (have you tried recently to search for "views" or "cck" here?).

This can't really be done at query time (we can't map types to weights using any Solr function)

Just want to confirm: the content type bias is only applied at index time, not at query time (so I must re-index to see the effect)?

Thanks.

[Edit] Found the answer: Content type bias is done at query time.