Faster Core Search [#145560]

We no longer need to use temporary tables. If we don't use temporary tables, we can generate more optimal SQL that can also be cached. The advantage on caching the SQL is that the subsequent page requests are faster. Because of the use of temporary tables, this is currently not possible.

See http://groups.drupal.org/node/4106 for some history.

The attached patch makes several changes:

adds a unique key to {search_index} - this is required for the search logic to work
converts mysql {search_index} tables to MyISAM - my test results were much much worse with innodb
doesn't use search_dataset, I'm still not sure why this was needed and if I'm missing something by not using it
changes search_parse_query to return different $query[0], $query[1], and $query[3] values, as needed by the new search logic
changes the negative logic
adds i.fromsid=0 to the query - this is needed for proper results, but I'm not sure what the impact is
uses SQL IN clause instead of repetitive OR, and builds this slightly differently
removed the score normalization - this was harder to do without the intermediate temporary tables, and I'm not sure that it is really necessary

Questions:

is this functionally equivalent?
are subselects available in mysql 4.1, as I think they are? and are they available in pgsql? is this a reasonable requirement for Drupal 6.x?
did removing search_dataset lose any functionality
Is forcing MyISAM verse InnoDB for this one table acceptable?
is there any negative side effect to using i.fromsid=0?
is the SQL IN clause available in pgsql? MySQL EXPLAIN implied that this was identical and the code looks cleaner to me. Are there any issues with using the IN clause?
do we need the score normalization?
does update.php need to happen incrementally help on large sites?

Some benchmarking is needed. I'd like to benchmark and publish the results against the d.o database. I'd also like some independent benchmarking and verification. My results on two large databases show significant speed improvement. What "significant" is depends on the number of query terms, and the size of the result set. Many tests show a 3x - 5x speed improvement.

Still to do:

benchmarks and peer review
add pgsql code
update comment above do_search

Comment	File	Size	Author
	coresearch.patch	12.15 KB	douggreen

Comments

Comment #1

chx commented 21 May 2007 at 02:54

All your questions re. subselect and IN -- no problems. forcing myisam -- can't really imagine this to be a problem, but killes can answer this more. Reading system update 6018 I do not see anything that needs to be incremental.

Comment #2

moshe weitzman commented 21 May 2007 at 03:06

Comment #3

nedjo

he/him/his

English

commented 21 May 2007 at 03:16

Hey, thanks for giving some attention to this critical area. The use of subqueries makes good sense here and appears to be another area where MySQL 4.1 enables us to leave dated approaches behind.

Comment #4

owen barton commented 21 May 2007 at 04:04

Subscribing

Comment #5

inforeto commented 21 May 2007 at 05:47

subscribing

Comment #6

Steven commented 21 May 2007 at 21:07

How are you doing phrase searching if you don't use search_dataset?

Comment #7

douggreen commented 22 May 2007 at 00:32

Thanks Steven - I couldn't figure out why {search_dataset} was needed. I didn't really have a problem with it, except that it seemed unnecessary. I think it was always used, even when not searching for phrases. If it's only needed for phrase searching, then can we just join with when searching for phrases? I'll re-roll with the {search_dataset} restored and conditional logic for including it in the join only when phrase searching.

Comment #8

Steven commented 22 May 2007 at 04:41

Also:

is there any negative side effect to using i.fromsid=0?

The goal of fromsid is to make inter-node links affect scoring. Words that are part of a link are counted as part of the target node's body. If the link is just an URL, the target node's title is used. This means that often linked pages tend to have higher score.

However, this mechanism is ultimately flawed since it still relies on keyword matching. I've long wanted to replace this with a pagerank-like system, but so far, there is no code yet. This would simplify a lot of code, and probably speed up things quite a bit too. It would simply be a LEFT JOIN on a {node_rank} table or somesuch.

Now, while this old mechanism isn't perfect, it did have a noticable effect on the quality of results when I developed this (using a snapshop of drupal.org as a testbase). So, I'd prefer to keep this in, if possible.

However, if you can't get rid of the fromsid=0, then a lot more code and columns need to be snipped.

do we need the score normalization?

Absolutely, for two reasons.

First, the literal TF/IDF score (relevancy) is measured on an arbitrary positive scale. The exact range depends on the language used, on the site's specific content and on the specific search query entered, and is hard to predict since there are logarithms involved. In order for the scoring factors to be weighted meaningfully, each must be normalized to 0..1. Imagine what would happen if one of the factors was in the range 0..2 instead: it would be as if its weight was doubled. Obviously, without normalization, the weighting feature will not work well.

Secondly, the actual final score, which is the weighted sum of all ranking factors, is passed along with the results. These scores can be shown in the theme, or exported e.g. through OpenSearch. The OpenSearch Aggregator in contrib uses this to multiplex an arbitrary number of search feeds together into one ordered result set. For this to work, it must be a (relatively) uniform score from 0 to 100%. If the individual scoring factors are not uniformly or don't cover the entire range, the combined score will be useless for such purposes.

If you remove relevancy normalization, this would mean that depending on the query, the relevancy score would have more or less weight in the ordering of the results.

The old system solved the "arbitrary scale" problem by simply assuming that the most relevant result was 100% relevant.

While the possibility exists that you can predict the range of relevancy scores, I haven't looked into it myself, and I think it would be very hard for practical situations, since the corpora are usually small. And, you would need to do it before the query, since the relevancy weight is only one coefficient in the total score and cannot be corrected afterwards.

does update.php need to happen incrementally help on large sites?

I don't see how you could split up those queries, since they affect the entire table. However, I always read that mysql execution time was not counted towards the PHP time limit.

Comment #9

Steven commented 22 May 2007 at 08:01

Actually, here's an idea for normalization. Background in the TF/IDF ranking method is needed. Note that we use a variation on the classic implementation:

Instead of normalizing term scores by document length, additional terms in a document simply count for less (with an exponential decay). This model better suits the "page + comments" flow that we index and penalizes the common 1 line forum post.
Instead of using the ratio of documents containing a term as IDF, we use Zipf's law:
Classic IDF:
N = SELECT COUNT(sid) FROM search_index;
IDF(T) = log(N / SELECT COUNT(sid) FROM search_index WHERE word = T)

Zipf's law:
IDF(T) = log₁₀(1 + 1 / SELECT SUM(score) FROM search_index WHERE word = T)

See:
http://en.wikipedia.org/wiki/Zipf's_law

Neither affect the normalization though, since all it cares about is the resulting TF / IDF factors. The scoring works as follows. I use the same notation as Bayesian probabilities P(X|Y) meaning the chance of X, given (a certain) Y.

The relevancy of a matching item X for a query Q is the sum of the relevancy of all matching terms T_i for that query:

R(X|Q) = ∑_{Ti ∈ Q} R(T_i|X)

Note that the relevancy for an individual matching term T in an item X does not depend on the specific query Q. It is equal to:

R(T|X) = TF(T|X) * IDF(T)

Where:
(we ignore search_index.type for readability, and identify items purely by search_index.sid)

IDF(T) = (see above)
TF(T|X) = SELECT score FROM search_index WHERE word = T AND sid = X
(without taking into account inter-node word exchange)
TF(T|X) = SELECT SUM(score) FROM search_index WHERE word = T AND (sid = X OR fromsid = X)
(with inter-node word exchange)

Based on the above queries, we can see that each possible term TF(T_i|X) has a maximum value, and by extension, so does R(T_i|X):

max_X[ TF(T_i|X) ] = SELECT MAX(score) FROM search_index WHERE word = T
max_X[ R(T_i|X) ] = max_X[ TF(T_i|X) ] / IDF(T)

This means that for a given query:

max_X[ R(X|Q) ] = max_X[ ∑_{Ti ∈ Q} R(T_i|X) ]

Now, because all numbers involved are strictly positive:

max_X[ R(X|Q) ] <= ∑_{Ti ∈ Q} max_X[ R(T_i|X) ]

We now have an upper limit for the total relevancy for a particular query Q which can be efficiently calculated by caching MAX(score|T) * IDF(T) in search_total.

We can calculate a non-uniform, normalized relevancy simply by dividing by this upper limit. The resulting relevancy numbers are guaranteed to lie in the range 0..1.

At this point, we need to look at the actual statistical distribution of the relevancy in these results. Given that the above holds true regardless of the actual TF/IDF scores (which define the language model), it is not unreasonable to expect some consistent, relatively constant distribution here (similar to Zipf). If found, we can easily approximate an inverse transform, and get (somewhat) uniform, normalized relevancy scores with minimal extra effort.

Comment #10

Steven commented 22 May 2007 at 08:13

Actually I made a notation error above: there is an assumption that the maximum score for a term T in a given result set will roughly be the same as the maximum score of that term across all items, whether they match or not.

The correction is:

maxX[ TF(Ti|X) ] < max[TF(Ti)] = SELECT MAX(score) FROM search_index WHERE word = T
And thus assuming:
maxX[ TF(Ti|X) ] ≈ SELECT MAX(score) FROM search_index WHERE word = T

This is why the non-uniform numbers need to be processed further to be useful.

Comment #11

douggreen commented 22 May 2007 at 13:41

Conclusions from above:

i.fromsid=0 affects the scoring order.
score normalization affects the scoring, optional theming of the value, and at least one contrib modules relies on it.
I'm having a Good WIll Hunting moment. Steven Wittens is either far smarter, much funnier, or both.

Does fromsid affect results or ranking? For discussion purposes, assume node/1 <a href=node/2">a b</a> node/2 has a b c d. When searching for "c", does node/1 comes up in the results? Or is it that when searching for "a", node/1 has a higher score because "a" occurs in node/1 and node/2?

I think that the {node_rank} table might work somewhat with my proposed #145242 "refactor search node_rank with hook_node_rank scoring factors". I think that we need both a score for the term within the node and a general score for the node. I think that, other than the SUM(i.score), all of the node rank values are node related.

I have no solution to the relevancy. My solution was to push this responsibility into the hook_node_rank implementations. I'm very interested in understanding the TF/IDF algorithm. Forgive me , but doesn't the entire TF/IDF algorithm forget to normalize the other node rankings? Is this just a solution for normalizing the search scores?

Comment #12

Steven commented 23 May 2007 at 01:48

I used the term 'relevance' to denote the result of the TF/IDF calculation (i.e. pure text-based relevance). This is usually only a part of the final combined score (which you could also call 'relevance').

You do seem to be confusing a couple of things.

TF/IDF is just a scoring method for ranking items according to a keyword query. The only 'matching' property is that results which would not show up in a pure OR query will have score 0.

In search.module 5.0, we use this implicitly as part of the first pass, to eliminate a bunch of possible nodes. We also count the number of matching unique words per item to discard items that can't possibly satisfy the query (using the HAVING clause).

The second pass then joins onto the dataset table, checks for phrases and negatives, calculates the total score and sorts the results.

The effect of the fromsid-mechanism is that text is seemingly transplanted from the source node (containing the link) to the target node (being linked to). It tends to improve the recall of the results, since this tends to add synonyms of words used on the target page. This property would be lost if we replace it with a simple "per node" score.

There may be another solution though. Instead of keeping track of links in the index itself, we would maintain a separate table of inter-node links. When the HTML indexer encounters a node link, it would store the link in the database (with its caption) and force a re-indexing of the target node by touching its node.changed date. On index, hook_nodeapi('update index') would be used to add the link captions to the target node, thus achieving the same thing. Obviously, it would only re-index when the links in a node change, since otherwise this could cause endless re-indexing of circular links.

The benefit of this would be immense, since fromsid/type would disappear. The search index tables would shrink by a very significant percentage, and each sid/type pair would have only a single score per word.

Looking closer at your patch, there seem to be several problems though:

You changed the HAVING clause to = rather than >=, which I suspect is why you say fromsid=0 is needed. But I still don't see how that can work: the query "foo OR bar" by definition can satisfy items with both with 1 and 2 matching rows in search_index, and when fromsid gets involved, you can have double the amount of matching rows even for AND queries.
There are also a bunch of arbitrary changes, like changing the OR operator to be case insensitive (and using a bizarre method to do so). This was in fact an intentional design choice, and follows what search engines like Google do. Such changes should not be part of this patch.
Finally, the code is messy, comments are misformatted or plain missing. Things like // return [0]=1 for legacy reasons so array indexes don't need to change have no place in a core patch. This makes it incredibly hard to review.

There is a better search.module hiding in this patch somewhere, though.

Comment #13

Steven commented 23 May 2007 at 02:26

As an illustration of why the first pass (without search_dataset) isn't enough to satisfy 'or' queries, imagine the query "foo bar OR baz" (which is interpreted as "foo AND (bar OR baz)"). Items that satisfy this query must have at least 2 matching rows/words per item in the first pass (HAVING COUNT(*) >= 2).

However, there is an additional restriction: one of these two rows must be "foo". Rows that do not satisfy this are results to the queries "(foo OR bar) AND baz" and "(foo OR baz) AND bar".

I don't see where you verify this in your patch.

Comment #14

douggreen commented 23 May 2007 at 02:31

Status:

Needs work

» Active

Given the number of concerns, I've changed the status from "patch" to "active". I'm doubtful that I can write a new patch that addresses all of these concerns before the 6.x code-freeze, but I'm also hopeful that a better search is lurking here.

If TF/IDF is only for search word relevancy, then I think we're back to each node rank is responsible for returning normalized data.
I use HAVING COUNT(*) = for AND matches. If you're searching for 5 terms, and the search index is unique, you will get back exactly 5 matches when all of the terms are in the document. For OR matches, I require HAVING COUNT(*) >= 1, and since this is implied by the INNER JOIN it is not used.
I thought OR was broken and I fixed it (bizarre or not, I think that for a two character match might be the fastest solution). I'll agree that this is another issue and doesn't belong in this patch. Now that I know this was intentional, my suggestion is to display a warning message to the user, like Google does.
The comment should of said something like // TODO: cleanup returned array, but temporarily returning array compatible with previous function returns, or something like that. I did mark the patch as "patch (code needs work)".

Comment #15

douggreen commented 23 May 2007 at 02:39

You're right on the combination of AND's and OR's. I also realized this today when I looked closer at the search_parse_query() returns. One solution is to have multiple JOINs for this case.

I've been wondering if we're trying to do too much in one place. You've made a search that handles complicated terms, but is quite slow. I'd like to throw out the idea of having two search solutions, one that works fast for 90% of the searches (AND's or OR's, but not both, maybe excludes, but not phrases), and a slower fallback that works for the degenerative cases (a b OR c and phrases). I put excludes with the fast search because I think that excludes can be handled using one additional JOIN.

Comment #16

nedjo

he/him/his

English

commented 23 May 2007 at 02:58

Steven, thanks for the detailed explanations, this is invaluable.

Doug, please keep up the great work. I believe there is great interest in seeing performance improvements in core search.

Already it looks like we're getting to the point where we have two people who understand some of the innards of core search instead of (seemingly) only one, and this can only be a good thing :)

I like the suggestion of alternative search algorithms, at least as a fallback if it's not possible in the short term to get retain full support for all search methods in this patch. In my experience advanced searches count for a relatively small percentage of overall searches, so if they are slower we'll still get significant performance gains for most searches.

And multiple algorithms might be a first step toward the goal of enabling contrib search handling, i.e., contrib implementations of search handlers (MySQL full text, Lucene, etc.).

Comment #17

owen barton commented 23 May 2007 at 04:15

Great work guys - some brain stretching stuff indeed!

@Steven - do you thing breaking out the node-node links into a separate table is attainable for D6 code freeze?
Would this simplify the relevancy normalization process, or simply the generation of the node scoring?

It strikes me that a node-node links table would actually be invaluable for several contrib modules, such as related_links, and recommendation engine - definitely a free bonus!

If it would allow faster queries (for the 99% that don't use AND, OR etc) I think having a simplified (but still accurate!) search process could be a good approach.

Comment #18

Steven commented 23 May 2007 at 07:01

The point of the maths above was to prove that in fact, normalizing TF/IDF relevance should be indeed possible ahead of time, just like the other node ranking factors. But, it will require a query or two beforehand. So yes, that's the only issue with normalization: the rest is already there I believe.

The way I see it, there are three things we have now to speed things up:

We can remove the fromsid feature, save a lot of db space and code, and replace it with a on-index mechanism and even extend it with a node-rank re-using a node-node link table. Note however that such a table, while useful, would not contain the true links between all pages on the site: it would only count links in user generated text. As such, a Google PageRank-like algorithm would not do much good here. But, there are other ways of having this influence the score. This can be done independently of this patch (and should probably be done first).
Instead of a temporary table, we use a subselect. This is possible if we implement the normalization for tf-idf relevance as described above.
With fromsid removed, we can use HAVING = for pure AND queries and pure OR queries (both including negative terms, thanks to the new IS NULL JOIN mechanism), with no join on the dataset table needed.
Also, the result of HAVING >= for AND/OR queries would be better than before (more nodes eliminated on the first pass), meaning the join on the dataset table to verify these cases wouldn't be as bad.

You make it sound like it has to be one or the other. If you look at Drupal.org's searches, the vast majority of them are single words, followed by pure ANDs, followed by phrases+AND. So if we only join on the dataset table when necessary (we now know when), we should get a nice boost-up for common cases.

I also like the idea that modules that use this could artificially restrict the input to pure AND or pure OR queries and always get the fast system.

By the way, if we do all this... it would be best to force a re-indexing rather than try to convert the tables. The node link stuff could in theory be distilled out of an existing search_index table, but it would be a chore to write the update. A re-indexing is not the end of the world, especially now that we have the batch feature: we could even implement a UI for it.

Comment #19

douggreen commented 23 May 2007 at 11:12

This discussion has been educational. Thank you Steven!

What are the tasks moving forward?

Please consider #145242 before we try to write a new node-rank. I think that it will make the implementation cleaner.
Make search_index smaller and pave the way for the subselects, by removing fromsid and creating a new node-node link table. Steven wrote "We can remove the fromsid feature, save a lot of db space and code, and replace it with a on-index mechanism and even extend it with a node-rank re-using a node-node link table." Agreed, but I don't understand what you mean by the "on-index mechanism". I was only thinking of a new node-node link table and node-rank implementation. Also, what do you think the new table should be called and what is it's structure? search_node_links (nid, linknid)?
Remove the need for normalization by implementing TF/IDF. I don't have a good handle on this, but believe that with some study and direction, that I could..
Use subselects where possible (simple AND/OR HAVING = solution). We may still be stuck with temporary tables for more complicated searches, but this seems like a good trade-off.

What's do-able for the 6.x code-freeze? And who should do what? The use of subselects relies on fromsid=0 and not having to do normalization. So I don't think we can the benefit of #4 unless we do both #2 and #3. Is this correct?

Comment #20

douggreen commented 23 May 2007 at 13:19

"OR" discussion moved to #146278 - Improve User Experience of search syntax.

Comment #21

Steven commented 24 May 2007 at 06:02

Fromsid refactoring patch at: http://drupal.org/node/146466
(step 1)

Comment #22

Steven commented 29 May 2007 at 20:20

By the way, the refactoring is not about 'implementing TF/IDF'. Search.module already implements it (see wikipedia). The issue is how to normalize the resulting scores to a fixed, linear scale, so it can be mixed with other scoring factors.

Comment #23

hunmonk commented 13 June 2007 at 06:08

subscribing. i intend to take a crack at this code soon.

Comment #24

douggreen commented 17 December 2007 at 03:30