refactor search node_rank with hook_node_rank scoring factors

douggreen - May 19, 2007 - 11:36
Project:Drupal
Version:7.x-dev
Component:node.module
Category:feature request
Priority:normal
Assigned:douggreen
Status:fixed
Description

See http://groups.drupal.org/node/4105. The summary is:

  • modules such as voting modules or other metrics systems can inject their own scoring factors. Wouldn't it be cool if the votes a node has received could improve its positioning in the search results?
  • it cleans up node.module code

The technical details are:

I've create a new hook_node_rank() function which returns an array of ranks which allows each hook_node_rank to define multiple rankings. The lower level array defines title, join, score, and terms; terms is an array; join and terms are optional. For example, hook_node_rank for comment.module:

function comment_node_rank() {
  return array(
    'node_rank_comments' => array(
      'title' => t('Number of comments'),
      'join' => 'LEFT JOIN {node_comment_statistics} c ON c.nid = i.sid',
      'score' => '2.0 - 2.0 / (1.0 + c.comment_count * %f)',
      'terms' => array(variable_get('node_cron_comments_scale', 0)),
    ),
  );
}

I moved the existing code out of node_search and implemented three hook_node_rank functions (node_, comment_, and statistics_). I think that this refactoring makes more sense, moving the code to the module that actually defines it. Thus, this patch impacts node.module, comments.module, and statistics.module.

You might notice that the weights are not included in the definition. I was able to easily extend the admin/settings/search page with all of the 'node_rank' variables, and then automatically add these to the generated query. This makes it a lot easier to add new node rankings, since the system handles the UI.

Lastly, I wrapped the score values with the SQL COALESCE(..., 0) because during testing of a 5.x system, I encountered NULL score values (I think related to comments) that we're messing up the results.

AttachmentSize
node_rank.patch8.46 KB

#1

nedjo - May 23, 2007 - 14:10

+1 in principle (haven't tested). This is part of what we need to do all over core: remove the hard-coded special treatment given to core modules (in this case, comment and statistics). Enabling contrib modules to affect scoring will indeed be a significant benefit as we get e.g. voting and other user-driven ranking parameters.

This also opens the door to other uses of node ranking outside search.

#2

douggreen - June 7, 2007 - 02:01

Updated patch, fixing 'field' reference that should of been 'score'...

AttachmentSize
145242.patch8.27 KB

#3

Scott Reynolds - September 8, 2007 - 04:04

subscribing. Great idea

#4

Zothos - September 10, 2007 - 15:23

definitely +1 for the idea. :)

#5

douggreen - September 11, 2007 - 06:33

I'm updating patch to current Drupal head. I think we're probably too late to get this into 6.x, but it doesn't break any api's. Anyone care to test and mark it RTBC?

AttachmentSize
145242_0.patch8.32 KB

#6

nedjo - September 12, 2007 - 20:30

Minor comments on the code, I haven't tested.

Looks very good overall.

The fact we're attaching this new hook to node rather than search reflects a general weakness in search module (the need to search separately for different types of objects, the relatively full implementation for node and lack thereof for other object types like user, taxonomy term, etc.). It'd be nice to see a solution that moved us toward general search support rather than overdeveloped node support, but this is at least a good step toward generalizing the hard-codedness.

Code comments should be full sentences (capitalized, periods or other punctuation).

I find the use of the word 'terms' here confusing, e.g.,

<?php
+      $rankings['terms'][] = $node_rank;
?>

They are fed into a function argument called $argument2. These seem to be arguments fed into the SQL? If so, let's call them 'arguments'. If not, they need some code comment to explain.

#7

nedjo - September 12, 2007 - 22:03

A couple of possible longer term directions:

1. Move ranking from node to search module, generalizing to accept a 'type' argument. Search is the only place I know of in core where we have built-in support for multiple object types (node, user, etc.). Compare e.g. the node_access system, hard-coded to just nodes. On the surface, generalizing hook_node_rank() to be hook_search_rank() would seem fairly simple: add a $type argument (e.g., 'node'), add a switch ($type) test to implementations. But adding e.g. a user ranking implementation would of course be more work.

2. Factor out the ranking into e.g. rank.module so it can be used independently of search. There are many other cases in which we wish to rank content (or users) by a synthesis of multiple criteria, e.g., a "recommended content" block.

#8

douggreen - September 12, 2007 - 22:40

This is definitely an incremental improvement. $arguments2 is part of the current core search code. When submitting patches, I frequently try to change as little as possible. Are suggesting that I/we discuss more significant changes.

I agree that hook_search_rank is a better name.

I don't know that it could be generalized, however, because the rank methods actually return SQL that is related to the search. For example, the 'join' array element.

You say that you find the use of the word "terms" confusing in one place. I've used the array keys of 'title', 'join', 'score', and 'terms'. I agree that 'terms' should probably be renamed to something like 'join args'.

#9

nedjo - September 12, 2007 - 23:29

My feeling is, if this goes in for 6.x, great, it's definitely an incremental improvement (though it's hard to say if it's enough of a priority to make it in at this point). If not, we might consider some more generalization. We have a core search system that's well developed for nodes and almost nonexistant for anything else. User searching is a skeletal "SELECT name, uid FROM {users} WHERE LOWER(name) LIKE LOWER('%%%s%%')". There's zip AFAIK for taxonomy terms. Moving logic from node to search may be a step toward more balanced implementations.

I don't know that it could be generalized, however, because the rank methods actually return SQL that is related to the search. For example, the 'join' array element.

If we wanted to consider moving ranking to a separate implementation rather than tying it to search, I guess we'd need to do something similar to what we do with db_rewrite_sql(), that is, feed in a table alias and field name. In the case of node ranking, that would be 'i', 'sid' for search and 'n', 'nid' for node.

#10

nedjo - September 18, 2007 - 22:57

I've taken a first crack at sketching in a generalized ranking system based on this patch. See this issue http://drupal.org/node/173593 on Content Recommendation Engine (CRE) and this module in my sandbox http://cvs.drupal.org/viewvc.py/drupal/contributions/sandbox/nedjo/modul.... I'm hoping this might go in the short term into CRE.

In the longer term, based on this experience, I am thinking we could probably factor out search's ranking into a small, general core ranking API that search and other modules could implement and extend.

Whether or not we do so, this patch on its is a positive step in the right direction.

#11

catch - October 26, 2007 - 16:13
Status:patch (code needs review)» patch (code needs work)

This is a very nice idea but it no longer applies.

#12

douggreen - November 27, 2007 - 13:18
Version:6.x-dev» 7.x-dev
Assigned to:Anonymous» douggreen

This issue will be one of my focus areas during the 7.x release cycle.

#13

dropcube - February 11, 2008 - 22:10

Subscribing.

#14

robertDouglass - April 15, 2008 - 08:22
Status:patch (code needs work)» patch (code needs review)

Rerolled against HEAD, tested, fixed a bug with an unbalanced (, and somewhat addressed Nedjo's concern about the variable name (I renamed term into factor since these are referred to scoring factors elsewhere).

AttachmentSize
hook_node_rank.patch8.31 KB

#15

douggreen - May 9, 2008 - 14:54
Status:patch (code needs review)» patch (code needs work)

LIVE FROM THE MINNESOTA SEARCH SPRINT, the attached patch does:

  1. rerolled to current Drupal HEAD
  2. changes the new hook to be named hook_ranking which is still consistent with the original verbage in the code, but is a better hook name without an underscore
  3. enforce the variable names to always be prepended as 'node_rank_', and then removes these from the new hook_ranking array index. This change will make it so that a hook_ranking implementor can't pollute other variable name spaces

This is still a work in progress, but something we should finish this morning.

Next, we need to review all of the existing core modules, and see if there is any low hanging fruit to further extend the search rankings.

AttachmentSize
145242.patch8.27 KB

#16

BlakeLucchesi - May 9, 2008 - 15:30

* Tests Needed
- Relevancy: Two nodes with similar nodes, one node has more instances of the same search keyword than the other.
- Recency: Two nodes with the exact same content; different timestamp.
- Statistics: Two nodes with the exact same content; different number of views.
- Comments: Two nodes with the exact same content; different number of comments.
- Promoted: Two nodes with the exact same content; one node is promoted.
- Sticky: Two nodes with the exact same content; one node is marked sticky.

#17

douggreen - May 9, 2008 - 16:01

LIVE FROM THE MINNESOTA SEARCH SPRINT, this fixes the patch, which was not searching properly. If the hook_ranking 'score' uses SQL arguments, this is now renamed to 'arguments' and the code is fixed to build the proper SQL arguments. It also removes factors with a value of 0 from the SQL.

AttachmentSize
145242.patch8.53 KB

#18

BlakeLucchesi - May 10, 2008 - 00:55

Re-rolled to make sure that there is no division by zero error when normalizing the score factors.

AttachmentSize
145542_3.patch9.32 KB

#19

cwgordon7 - May 10, 2008 - 15:22

Aren't we setting ourselves up for conflicts if we use table abbreviations such as {node_comment_statistics} c? Perhaps the hook should be refactored to force node_comment_statistics.nid notation? Something like:


<?php
function comment_ranking() {
  return array(
   
'comments' => array(
     
'title' => t('Number of comments'),
     
'join' => array(
       
'table' => 'node_comment_statistics',
       
'type' => 'left',
       
'on' => 'node_comment_statistics.nid = search_index.sid'
     
),
     
'score' => '2.0 - 2.0 / (1.0 + node_comment_statistics.comment_count * %f)',
     
'arguments' => array(variable_get('node_cron_comments_scale', 0)),
    ),
  );
}
?>

#20

douggreen - May 10, 2008 - 16:39

Rerolled with a simpletest (original version courtesy of Blake and Chad). I think that the test case has uncovered a problem in the code... it's either that or the test.

AttachmentSize
145242.patch9.55 KB

#21

douggreen - May 10, 2008 - 17:31

This update restores part of the original patch that was lost. The test is still failing the "recent", "comments", and "views" (statistics) tests.

AttachmentSize
145242.patch12.58 KB

#22

douggreen - May 10, 2008 - 18:55

This patch fixes the "recent" test failure, which was only a problem with the test case, and not the patch code.

AttachmentSize
145242.patch12.52 KB

#23

douggreen - May 10, 2008 - 19:55

And now, a finally working version, complete with simpletest. I'm marking as CNR, but I think cwgordon7 has a good point that we should consider before RTBC.

To cwgordon7's point, the existing search joins do use aliases, and one just needs to know to work around them. However, since we're opening this up to contrib, you could work around the core aliases, but you'll never know what another contrib module might alias. In this case, I think that standard should be to not use aliases at all. And if that's the standard for contrib, we should probably do this in core too.

AttachmentSize
145242.patch12.96 KB

#24

douggreen - May 10, 2008 - 19:55
Status:patch (code needs work)» patch (code needs review)

#25

BlakeLucchesi - May 10, 2008 - 20:04

agree to to reroll without aliases.

#26

douggreen - May 10, 2008 - 20:17

Agreed, the solution then is to just not introduce SQL table aliasing in hook_ranking. While we could do this in core (because everyone will know that core implements certain aliases and will avoid them), core is also an example to contrib. We don't want contrib doing this, so we just shouldn't do it in core. I've retested and rerolled without the aliases in comment_ranking and statistics_ranking.

AttachmentSize
145242.patch13.05 KB

#27

robertDouglass - May 10, 2008 - 20:37
Status:patch (code needs review)» patch (reviewed & tested by the community)

Nice work. Great new opportunities for contrib modules to interact with search.

#28

BlakeLucchesi - May 11, 2008 - 15:29
Status:patch (reviewed & tested by the community)» patch (code needs work)

Changing to "needs work" because we are going to add another ranking factor which increases the score of nodes with incoming links.

#29

douggreen - May 11, 2008 - 16:27
Status:patch (code needs work)» patch (reviewed & tested by the community)

And after further discussion, the new ranking factor is really a separate patch that is going to have to wait until this one commits.

#30

coupet - May 11, 2008 - 16:55

Highly interesting. subscribing.

#31

douggreen - May 14, 2008 - 01:36

Dries commented on #256792, when I think he intended to comment here.

  • We don't use the word 'Node' in output.
  • We write 'Node is sticky' not 'Node is Sticky'.
  • Code comments are inconsistent.
  • It would be useful to document the score functions (i.e. the math).

So, I re-rolled this patch changing "Node is Sticky" to "Content is sticky at top of lists" and changing "Node is Promoted" to "Content is promoted to the front page". I made sure that all of the ranking scores had comments, which were accidentally lost in the refactoring. I made sure that all of the new comments were sentences. I also tried to improve the comments in the new _node_rankings() function. And lastly I diffed the old patch with the new one, just to make sure that only comments and strings were changed. I did this hoping we could leave it RTBC.

AttachmentSize
145242.patch13.67 KB

#32

Dries - May 14, 2008 - 11:11
Status:patch (reviewed & tested by the community)» fixed

Committed to CVS. Thanks Doug!

 
 

Drupal is a registered trademark of Dries Buytaert.