Existing system

Search.module currently tracks links to nodes, and indexes the words in the link's text as part of the target node. Thus, if someone links to the magic quotes FAQ with <a>more info about magic quotes</a> then the target node will get a higher ranking for a query containing any of the words "more info about magic quotes". However, there are two problems with this:

1) Most people don't provide a good caption for their link ("click here" or even just the URL).

2) Sometimes there are a bunch of names for a particular subject and the user used a less common name (e.g. teaser, excerpt, summary).

This means that the link hint of "this page is important" often does not come into play simply because the search query doesn't match any linked keywords.

The search query goes like this:
SELECT i.sid, i.type, SUM(i.score / t.total) AS score FROM search_index i INNER JOIN search_total t ON i.word = t.word WHERE (...keyword matching...) GROUP BY i.sid, i.type ORDER BY score DESC

Thus it selects all matching keywords for the query, groups them by source (a sid/type pair, e.g. "node" + nid) and sums the (relative) scores per source.

For example when searching for "Drupal 4.6.0" one might get:

# word sid fromsid type score total sum (per sid/type)
#1 drupal 1 NULL node 6 10 6/10 + 2/5 + 1/10 = 1.1
#2 460 1 NULL node 2 5
#3 drupal 1 3 node 1 10
#4 drupal 2 NULL node 3 10 2/10 = 0.2

#3 is a link from node 3 to node 1 with the word "drupal". The other rows represent words that appeared literally in the result node.

Note that these are the rows before grouping and summation. MySQL will only return two rows, where the first row combines #1,#2,#3 (node 1 with score 1.1) and the second row is only for #4 (node 2 with score 0.2).

Changes

This patch improves the results by assigning a de-facto score boost to nodes which are linked to. Only the node itself still has to match one of the given keywords, not the link texts. The node receives a boost proportional to how many times it has been linked to.

It fits well within the existing search.module idea of trying to ensure the top results are relevant rather than focusing on making all results relevant. It does not obsolete the old mechanism, but complements it.

Implementation-wise, this is a bit tricky.

Issues

The new, generic link boost should however only be applied once per item, not once per keyword. And if an item does not match any keywords, we don't want it to show up in the results (even if it has links pointing to it).

There are two approaches to this:

- Store the link boost among the regular words with an empty word as a special marker. When selecting the results, we assume the empty word is implicitly part of the search query, but we need to only keep those sid/type pairs that have at least one non-empty word in them. The only way I can think of is to require that each indexed item has a matching row (with a 0 score if there are no links). Then we can do a COUNT(*) per item and require that each sid/type pair has at least two matching rows (one link boost + one real word). This would of course be quite wasteful as there are nowhere near as much links as there are nodes.

SELECT i.sid, i.type, COUNT(*) AS count, SUM(i.score / t.total) AS score FROM search_index i INNER JOIN search_total t ON i.word = t.word WHERE (i.word = '') OR (...keyword matching...) GROUP BY i.sid, i.type ORDER BY score DESC HAVING count > 1

- Store the link boost in a separate table to join on. As far as I know it's not possible to do one join per GROUP BY item, so I'd have to use a MAX() statement to reduce the link boost which is fetched per keyword to one score per sid/type pair. If we don't want an entry in search_links per item, then we have to use an if() statement to see if the join succeeded (0.02 is the score boost per link):

SELECT i.type, i.sid, (SUM(i.score/t.count) + MAX(if(j.sid IS NULL, 0, j.score))*0.02) AS score FROM search_index i INNER JOIN search_total t ON i.word = t.word LEFT JOIN search_links j ON i.sid = j.sid AND i.type = j.type WHERE (...keyword matching..) GROUP BY i.type, i.sid ORDER BY score DESC

In this patch I implemented a hybrid option: the link boosts are stored in the search_index table with an empty word, like in option 1. But like option 2, we use a join (on search_index itself) to fetch the boost and we don't store link counts for items without links.

SELECT i.type, i.sid, (SUM(i.score/t.count) + MAX(if(j.word IS NULL, 0, j.score))*0.02) AS score FROM search_index i INNER JOIN search_total t ON i.word = t.word LEFT JOIN search_index j ON i.sid = j.sid AND j.word = '' WHERE (...keyword matching...) GROUP BY i.type, i.sid ORDER BY score DESC

The reason I used the same table is because the links table would otherwise simply have the same DB structure as the search_index table. The word column is keyed, so restricting to j.word = "" should be fast.

While this patch works, I'm not too sure about the implementation. I'm not an SQL guru, so feedback is certainly needed on all aspects.

Results

I can't really try it out on Drupal.org because it requires re-indexing, but I do have some clues that it would provide an improvement: of all the linked keywords in the search index now, 1977 are for the word "http", which almost certainly means a literal URL. Thus these records are almost never used, unless someone actually searched for "http", even though they tell us something useful (the linked nodes were interesting).

CommentFileSizeAuthor
search.link.ranking.patch2.87 KBSteven

Comments

moshe weitzman’s picture

I suspect that mysql is going to slow down with the IF clauses in 2,3. They prevent DBMS from using indexes. I'm not sure about this though. Benchmarking of just queries would tell you.

Steven’s picture

Status: Needs review » Closed (won't fix)

Reconsidering this patch in a future incarnation.