Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
I have a CVS site which includes a page with a long-ish membership list. After about 715 words, search results fail. Everything before that works fine. I created a new page for testing, pasting in the Fall of the House of Usher. I ran cron.php and the search page reported 100% indexed. Searching only returned results up to 212 words. Nothing past that was found.
Comments
Comment #1
gtcaz CreditAttribution: gtcaz commentedThis is version // $Id: search.module,v 1.138 2005/10/21 11:14:55 unconed Exp $
The only thing I've done is fix the cast error as reported here: http://drupal.org/node/34515
Comment #2
gtcaz CreditAttribution: gtcaz commentedAll the terms do appear in the search_dataset table.
Comment #3
robertgarrigos CreditAttribution: robertgarrigos commentedI'm looking at this and notice that many entries in the search_index table shows a score of 0 (zero) when any word is suposed to have a minimum score of 1. Thus those words with a score of zero don't get searched.
It's curious, also, that there is not any zero score during the first 120 entries, aprox., in that table. After that, zeros begin to appear more often as they are closer to the end of the table. Arrround row num 300, aprox., half of the rows have a zero score. From row 930, aprox., till the end they have a zero score.
(...)
The problem is at line 514 of search.module:
By comenting this line the problem gets fixed. However, I don't know how would this affect to the search it self. What is the exact purpose of that decaying value? Why a unique word have to score less when its found on a page with many single words? And why it have to score even less if its found more at the end of the page?
Unless any developer could give us a reason of what this was done this way, I would just take this out. So I don't upload a patch yet, in case there is a reason I cannot see.
Comment #4
gtcaz CreditAttribution: gtcaz commentedI can confirm this resolves the issue. The focus algorithm is broken and I will be commenting it out on my site. Perhaps this should be a configurable setting if it's fixed. I agree with Robert than on many pages, my members list being one of them, that being near the top does not, by itself, make the result more relevant.
Comment #5
Steven CreditAttribution: Steven commentedFixed in CVS. The problem was not $focus, but the INSERT query, which used %d even though the scores are now floating point. The integer cast set some scores to zero, which messes up things.
FYI, the $focus variable is used to offset the effect that very long pages tend to match more, even though they may not be more relevant (they just contain more different words in more quantity).
In traditional full-text searches, a normalization is applied across the entire text (wordscore = score per word / # of words in the text), but this is really bad for a web-cms like Drupal, because we have comments: it would mean that each new comment would decrease the score of the existing content. It also means that very short content scores very high (due to the 1/x relationship), which is also undesirable.
But, we do want some penalties for really long content.
So I arrived at the decaying focus value, which applies a penalty to later words, without affecting the score of existing content when e.g. a comment is added. Also, because it only counts unique words rather than total words, it means that a very long, but very on-topic discussion will not get that much penalty. Off-topic discussions will get much lower focus much faster.
I admit this relies on some assumptions about the type of content, but (at least after this bug fix) it only has an effect on the ordering of results.
Comment #6
Steven CreditAttribution: Steven commentedComment #7
shouchen CreditAttribution: shouchen commentedSteven,
I noticed that node.module changed with your commit. Please see http://drupal.org/node/41973 (I'm not suggesting that your commit caused the bug I'm reporting... but since you recently changed the same code I changed when patching the bug, I thought you might be interested.)
Thanks,
Steve
Comment #8
(not verified) CreditAttribution: commented