We no longer need to use temporary tables. If we don't use temporary tables, we can generate more optimal SQL that can also be cached. The advantage on caching the SQL is that the subsequent page requests are faster. Because of the use of temporary tables, this is currently not possible.
See http://groups.drupal.org/node/4106 for some history.
The attached patch makes several changes:
- adds a unique key to {search_index} - this is required for the search logic to work
- converts mysql {search_index} tables to MyISAM - my test results were much much worse with innodb
- doesn't use search_dataset, I'm still not sure why this was needed and if I'm missing something by not using it
- changes search_parse_query to return different $query[0], $query[1], and $query[3] values, as needed by the new search logic
- changes the negative logic
- adds i.fromsid=0 to the query - this is needed for proper results, but I'm not sure what the impact is
- uses SQL IN clause instead of repetitive OR, and builds this slightly differently
- removed the score normalization - this was harder to do without the intermediate temporary tables, and I'm not sure that it is really necessary
Questions:
- is this functionally equivalent?
- are subselects available in mysql 4.1, as I think they are? and are they available in pgsql? is this a reasonable requirement for Drupal 6.x?
- did removing search_dataset lose any functionality
- Is forcing MyISAM verse InnoDB for this one table acceptable?
- is there any negative side effect to using i.fromsid=0?
- is the SQL IN clause available in pgsql? MySQL EXPLAIN implied that this was identical and the code looks cleaner to me. Are there any issues with using the IN clause?
- do we need the score normalization?
- does update.php need to happen incrementally help on large sites?
Some benchmarking is needed. I'd like to benchmark and publish the results against the d.o database. I'd also like some independent benchmarking and verification. My results on two large databases show significant speed improvement. What "significant" is depends on the number of query terms, and the size of the result set. Many tests show a 3x - 5x speed improvement.
Still to do:
- benchmarks and peer review
- add pgsql code
- update comment above do_search
Comment | File | Size | Author |
---|---|---|---|
coresearch.patch | 12.15 KB | douggreen | |
Comments
Comment #1
chx CreditAttribution: chx commentedAll your questions re. subselect and IN -- no problems. forcing myisam -- can't really imagine this to be a problem, but killes can answer this more. Reading system update 6018 I do not see anything that needs to be incremental.
Comment #2
moshe weitzman CreditAttribution: moshe weitzman commentedsubscribe
Comment #3
nedjoHey, thanks for giving some attention to this critical area. The use of subqueries makes good sense here and appears to be another area where MySQL 4.1 enables us to leave dated approaches behind.
Comment #4
Owen Barton CreditAttribution: Owen Barton commentedSubscribing
Comment #5
inforeto CreditAttribution: inforeto commentedsubscribing
Comment #6
Steven CreditAttribution: Steven commentedHow are you doing phrase searching if you don't use search_dataset?
Comment #7
douggreen CreditAttribution: douggreen commentedThanks Steven - I couldn't figure out why {search_dataset} was needed. I didn't really have a problem with it, except that it seemed unnecessary. I think it was always used, even when not searching for phrases. If it's only needed for phrase searching, then can we just join with when searching for phrases? I'll re-roll with the {search_dataset} restored and conditional logic for including it in the join only when phrase searching.
Comment #8
Steven CreditAttribution: Steven commentedAlso:
The goal of fromsid is to make inter-node links affect scoring. Words that are part of a link are counted as part of the target node's body. If the link is just an URL, the target node's title is used. This means that often linked pages tend to have higher score.
However, this mechanism is ultimately flawed since it still relies on keyword matching. I've long wanted to replace this with a pagerank-like system, but so far, there is no code yet. This would simplify a lot of code, and probably speed up things quite a bit too. It would simply be a LEFT JOIN on a {node_rank} table or somesuch.
Now, while this old mechanism isn't perfect, it did have a noticable effect on the quality of results when I developed this (using a snapshop of drupal.org as a testbase). So, I'd prefer to keep this in, if possible.
However, if you can't get rid of the fromsid=0, then a lot more code and columns need to be snipped.
Absolutely, for two reasons.
First, the literal TF/IDF score (relevancy) is measured on an arbitrary positive scale. The exact range depends on the language used, on the site's specific content and on the specific search query entered, and is hard to predict since there are logarithms involved. In order for the scoring factors to be weighted meaningfully, each must be normalized to 0..1. Imagine what would happen if one of the factors was in the range 0..2 instead: it would be as if its weight was doubled. Obviously, without normalization, the weighting feature will not work well.
Secondly, the actual final score, which is the weighted sum of all ranking factors, is passed along with the results. These scores can be shown in the theme, or exported e.g. through OpenSearch. The OpenSearch Aggregator in contrib uses this to multiplex an arbitrary number of search feeds together into one ordered result set. For this to work, it must be a (relatively) uniform score from 0 to 100%. If the individual scoring factors are not uniformly or don't cover the entire range, the combined score will be useless for such purposes.
If you remove relevancy normalization, this would mean that depending on the query, the relevancy score would have more or less weight in the ordering of the results.
The old system solved the "arbitrary scale" problem by simply assuming that the most relevant result was 100% relevant.
While the possibility exists that you can predict the range of relevancy scores, I haven't looked into it myself, and I think it would be very hard for practical situations, since the corpora are usually small. And, you would need to do it before the query, since the relevancy weight is only one coefficient in the total score and cannot be corrected afterwards.
I don't see how you could split up those queries, since they affect the entire table. However, I always read that mysql execution time was not counted towards the PHP time limit.
Comment #9
Steven CreditAttribution: Steven commentedActually, here's an idea for normalization. Background in the TF/IDF ranking method is needed. Note that we use a variation on the classic implementation:
Classic IDF:
N =
SELECT COUNT(sid) FROM search_index
;IDF(T) = log(N /
SELECT COUNT(sid) FROM search_index WHERE word = T
)Zipf's law:
IDF(T) = log10(1 + 1 /
SELECT SUM(score) FROM search_index WHERE word = T
)See:
http://en.wikipedia.org/wiki/Zipf's_law
Neither affect the normalization though, since all it cares about is the resulting TF / IDF factors. The scoring works as follows. I use the same notation as Bayesian probabilities P(X|Y) meaning the chance of X, given (a certain) Y.
The relevancy of a matching item X for a query Q is the sum of the relevancy of all matching terms Ti for that query:
R(X|Q) = ∑Ti ∈ Q R(Ti|X)
Note that the relevancy for an individual matching term T in an item X does not depend on the specific query Q. It is equal to:
R(T|X) = TF(T|X) * IDF(T)
Where:
(we ignore search_index.type for readability, and identify items purely by search_index.sid)
IDF(T) = (see above)
TF(T|X) =
SELECT score FROM search_index WHERE word = T AND sid = X
(without taking into account inter-node word exchange)
TF(T|X) =
SELECT SUM(score) FROM search_index WHERE word = T AND (sid = X OR fromsid = X)
(with inter-node word exchange)
Based on the above queries, we can see that each possible term TF(Ti|X) has a maximum value, and by extension, so does R(Ti|X):
maxX[ TF(Ti|X) ] =
SELECT MAX(score) FROM search_index WHERE word = T
maxX[ R(Ti|X) ] = maxX[ TF(Ti|X) ] / IDF(T)
This means that for a given query:
maxX[ R(X|Q) ] = maxX[ ∑Ti ∈ Q R(Ti|X) ]
Now, because all numbers involved are strictly positive:
maxX[ R(X|Q) ] <= ∑Ti ∈ Q maxX[ R(Ti|X) ]
We now have an upper limit for the total relevancy for a particular query Q which can be efficiently calculated by caching MAX(score|T) * IDF(T) in search_total.
We can calculate a non-uniform, normalized relevancy simply by dividing by this upper limit. The resulting relevancy numbers are guaranteed to lie in the range 0..1.
At this point, we need to look at the actual statistical distribution of the relevancy in these results. Given that the above holds true regardless of the actual TF/IDF scores (which define the language model), it is not unreasonable to expect some consistent, relatively constant distribution here (similar to Zipf). If found, we can easily approximate an inverse transform, and get (somewhat) uniform, normalized relevancy scores with minimal extra effort.
Comment #10
Steven CreditAttribution: Steven commentedActually I made a notation error above: there is an assumption that the maximum score for a term T in a given result set will roughly be the same as the maximum score of that term across all items, whether they match or not.
The correction is:
maxX[ TF(Ti|X) ] < max[TF(Ti)] = SELECT MAX(score) FROM search_index WHERE word = T
And thus assuming:
maxX[ TF(Ti|X) ] ≈ SELECT MAX(score) FROM search_index WHERE word = T
This is why the non-uniform numbers need to be processed further to be useful.
Comment #11
douggreen CreditAttribution: douggreen commentedConclusions from above:
Does fromsid affect results or ranking? For discussion purposes, assume node/1
<a href=node/2">a b</a>
node/2 hasa b c d
. When searching for "c", does node/1 comes up in the results? Or is it that when searching for "a", node/1 has a higher score because "a" occurs in node/1 and node/2?I think that the {node_rank} table might work somewhat with my proposed #145242 "refactor search node_rank with hook_node_rank scoring factors". I think that we need both a score for the term within the node and a general score for the node. I think that, other than the SUM(i.score), all of the node rank values are node related.
I have no solution to the relevancy. My solution was to push this responsibility into the hook_node_rank implementations. I'm very interested in understanding the TF/IDF algorithm. Forgive me , but doesn't the entire TF/IDF algorithm forget to normalize the other node rankings? Is this just a solution for normalizing the search scores?
Comment #12
Steven CreditAttribution: Steven commentedI used the term 'relevance' to denote the result of the TF/IDF calculation (i.e. pure text-based relevance). This is usually only a part of the final combined score (which you could also call 'relevance').
You do seem to be confusing a couple of things.
TF/IDF is just a scoring method for ranking items according to a keyword query. The only 'matching' property is that results which would not show up in a pure OR query will have score 0.
In search.module 5.0, we use this implicitly as part of the first pass, to eliminate a bunch of possible nodes. We also count the number of matching unique words per item to discard items that can't possibly satisfy the query (using the HAVING clause).
The second pass then joins onto the dataset table, checks for phrases and negatives, calculates the total score and sorts the results.
The effect of the fromsid-mechanism is that text is seemingly transplanted from the source node (containing the link) to the target node (being linked to). It tends to improve the recall of the results, since this tends to add synonyms of words used on the target page. This property would be lost if we replace it with a simple "per node" score.
There may be another solution though. Instead of keeping track of links in the index itself, we would maintain a separate table of inter-node links. When the HTML indexer encounters a node link, it would store the link in the database (with its caption) and force a re-indexing of the target node by touching its node.changed date. On index, hook_nodeapi('update index') would be used to add the link captions to the target node, thus achieving the same thing. Obviously, it would only re-index when the links in a node change, since otherwise this could cause endless re-indexing of circular links.
The benefit of this would be immense, since fromsid/type would disappear. The search index tables would shrink by a very significant percentage, and each sid/type pair would have only a single score per word.
Looking closer at your patch, there seem to be several problems though:
// return [0]=1 for legacy reasons so array indexes don't need to change
have no place in a core patch. This makes it incredibly hard to review.There is a better search.module hiding in this patch somewhere, though.
Comment #13
Steven CreditAttribution: Steven commentedAs an illustration of why the first pass (without search_dataset) isn't enough to satisfy 'or' queries, imagine the query "foo bar OR baz" (which is interpreted as "foo AND (bar OR baz)"). Items that satisfy this query must have at least 2 matching rows/words per item in the first pass (HAVING COUNT(*) >= 2).
However, there is an additional restriction: one of these two rows must be "foo". Rows that do not satisfy this are results to the queries "(foo OR bar) AND baz" and "(foo OR baz) AND bar".
I don't see where you verify this in your patch.
Comment #14
douggreen CreditAttribution: douggreen commentedGiven the number of concerns, I've changed the status from "patch" to "active". I'm doubtful that I can write a new patch that addresses all of these concerns before the 6.x code-freeze, but I'm also hopeful that a better search is lurking here.
// TODO: cleanup returned array, but temporarily returning array compatible with previous function returns
, or something like that. I did mark the patch as "patch (code needs work)".Comment #15
douggreen CreditAttribution: douggreen commentedYou're right on the combination of AND's and OR's. I also realized this today when I looked closer at the search_parse_query() returns. One solution is to have multiple JOINs for this case.
I've been wondering if we're trying to do too much in one place. You've made a search that handles complicated terms, but is quite slow. I'd like to throw out the idea of having two search solutions, one that works fast for 90% of the searches (AND's or OR's, but not both, maybe excludes, but not phrases), and a slower fallback that works for the degenerative cases (a b OR c and phrases). I put excludes with the fast search because I think that excludes can be handled using one additional JOIN.
Comment #16
nedjoSteven, thanks for the detailed explanations, this is invaluable.
Doug, please keep up the great work. I believe there is great interest in seeing performance improvements in core search.
Already it looks like we're getting to the point where we have two people who understand some of the innards of core search instead of (seemingly) only one, and this can only be a good thing :)
I like the suggestion of alternative search algorithms, at least as a fallback if it's not possible in the short term to get retain full support for all search methods in this patch. In my experience advanced searches count for a relatively small percentage of overall searches, so if they are slower we'll still get significant performance gains for most searches.
And multiple algorithms might be a first step toward the goal of enabling contrib search handling, i.e., contrib implementations of search handlers (MySQL full text, Lucene, etc.).
Comment #17
Owen Barton CreditAttribution: Owen Barton commentedGreat work guys - some brain stretching stuff indeed!
@Steven - do you thing breaking out the node-node links into a separate table is attainable for D6 code freeze?
Would this simplify the relevancy normalization process, or simply the generation of the node scoring?
It strikes me that a node-node links table would actually be invaluable for several contrib modules, such as related_links, and recommendation engine - definitely a free bonus!
If it would allow faster queries (for the 99% that don't use AND, OR etc) I think having a simplified (but still accurate!) search process could be a good approach.
Comment #18
Steven CreditAttribution: Steven commentedThe point of the maths above was to prove that in fact, normalizing TF/IDF relevance should be indeed possible ahead of time, just like the other node ranking factors. But, it will require a query or two beforehand. So yes, that's the only issue with normalization: the rest is already there I believe.
The way I see it, there are three things we have now to speed things up:
Also, the result of HAVING >= for AND/OR queries would be better than before (more nodes eliminated on the first pass), meaning the join on the dataset table to verify these cases wouldn't be as bad.
You make it sound like it has to be one or the other. If you look at Drupal.org's searches, the vast majority of them are single words, followed by pure ANDs, followed by phrases+AND. So if we only join on the dataset table when necessary (we now know when), we should get a nice boost-up for common cases.
I also like the idea that modules that use this could artificially restrict the input to pure AND or pure OR queries and always get the fast system.
By the way, if we do all this... it would be best to force a re-indexing rather than try to convert the tables. The node link stuff could in theory be distilled out of an existing search_index table, but it would be a chore to write the update. A re-indexing is not the end of the world, especially now that we have the batch feature: we could even implement a UI for it.
Comment #19
douggreen CreditAttribution: douggreen commentedThis discussion has been educational. Thank you Steven!
What are the tasks moving forward?
What's do-able for the 6.x code-freeze? And who should do what? The use of subselects relies on fromsid=0 and not having to do normalization. So I don't think we can the benefit of #4 unless we do both #2 and #3. Is this correct?
Comment #20
douggreen CreditAttribution: douggreen commented"OR" discussion moved to #146278 - Improve User Experience of search syntax.
Comment #21
Steven CreditAttribution: Steven commentedFromsid refactoring patch at: http://drupal.org/node/146466
(step 1)
Comment #22
Steven CreditAttribution: Steven commentedBy the way, the refactoring is not about 'implementing TF/IDF'. Search.module already implements it (see wikipedia). The issue is how to normalize the resulting scores to a fixed, linear scale, so it can be mixed with other scoring factors.
Comment #23
hunmonk CreditAttribution: hunmonk commentedsubscribing. i intend to take a crack at this code soon.
Comment #24
douggreen CreditAttribution: douggreen commentedThis was a very interesting thread, and the predecessor to the link patch / fromsid refactoring / remove temporary tables patch #146466. Given that the work was done in that thread, and that it's been committed, this issue is no longer valid, so I'm closing it as a dup.