This is the patch with search improvements I've been working on, and which has been discussed on the devel list. I think in its current state it's pretty good and ready to go into core.
More work can be done in the future, but I've done everything I've set out to do ;).
There is a demo site on http://unconed.drupaldevs.org/search
Changes since last patch on drupal-devel:
- Added comment count to node results.
- Used format_name() for author.
- Updated doxygen. I also have updated docs for the API reference ready.
- Tweaked search output.
- Moved search config to admin/search, fixed broken contextual help.
- Fixed outdated search forms in bluemarine/pushbutton.
- Added update path to updates.inc.
Overview of changes:
1) Clean up the text analyser: make it handle UTF-8 and all sorts of characters. The word splitter now does intelligent splitting into words and supports all Unicode characters. It has smart handling of acronyms, URLs, dates, ...
2) It now indexes the filtered output, which means it can take advantage of HTML tags. Meaningful tags (headers, strong, em, ...) are analysed and used to boost certain words scores. This has the side-effect of allowing the indexing of PHP nodes.
3) Link analyser for node links. The HTML analyser also checks for links. If they point to a node on the current site (handles path aliases) then the link's words are counted as part of the target node. This helps bring out commonly linked FAQs and answers to the top of the results.
4) Index comments along with the node. This means that the search can make a difference between a single node/comment about 'X' and a whole thread about 'X'. It also makes the search results much shorter and more relevant (before this patch, comments were even shown first).
5) We now keep track of total counts as well as a per item count for a word. This allows us to divide the word score by the total before adding up the scores for different words, and automatically makes noisewords have less influence than rare words. This dramatically improves the relevancy of multiword searches. This also makes the disadvantage of now using OR searching instead of AND searching less problematic.
6) Includes support for text preprocessors through a hook. This is required to index Chinese and Japanese, because these languages do not use spaces between words. An external utility can be used to split these into words through a simple wrapper module. Other uses could be spell checking (although it would have no UI).
7) Indexing is now regulated: only a certain amount of items will be indexed per cron run. This prevents PHP from running out of memory or timing out. This also makes the reindexing required for this patch automatic. I also added an index coverage estimate to the search admin screen.
8) Code cleanup! Moved all the search stuff from common.inc into search.module, rewired some hooks and simplified the functions used. The search form and results now also use valid XHTML and form_ functions. The search admin was moved from search/configure to admin/search for consistency.
9) Improved search output: we also show much more info per item: date, author, node type, amount of comments and a cool dynamic excerpt à la Google. The search form is now much more simpler and the help is only displayed as tips when no search results are found.
10) By moving all search logic to SQL, I was able to add a pager to the search results. This improves usability and performance dramatically.
Comment | File | Size | Author |
---|---|---|---|
#33 | search_4.5.2_0.patch | 70.31 KB | pyromanfo |
#32 | search_4.5.2.patch | 71.78 KB | pyromanfo |
#26 | search_4.5.1_0.patch | 71.4 KB | pyromanfo |
#23 | search_4.5.1.patch | 69.78 KB | pyromanfo |
#9 | search_1.diff | 7.92 KB | Steven |
Comments
Comment #1
Steven CreditAttribution: Steven commentedHere's a screenshot:
http://acko.net/dumpx/searchpatch.png
Comment #2
nedjo+1 This contribution makes substantial improvements to all aspects of search, transforming the core search module from a very limited tool to a refined one.
Comments:
Comment #3
Steven CreditAttribution: Steven commentedThis is possible yes, there are avrious parameters in use. Central are the HTML tag scores, hardcoded in search_index()
It does, implicitly. Node and comment titles are wrapped in headers (h1/h2) so they receive a high score boost for that.
Comment #4
Dries CreditAttribution: Dries commentedA couple of issues/comments:
Comment #5
Steven CreditAttribution: Steven commentedI fixed those two search forms in admin and also removed the duplicate usernames in the results. My patch focuses on node/comment searching mostly; making user and profile search a quality feature is IMO a different patch to do. Combining profile.module's browsing feature with an integrated user search should be cool for social sites (especially with fancy stuff like FOAF profile exchange coming up).
Here's an updated patch.
Comment #6
Dries CreditAttribution: Dries commentedCommitted to HEAD. This is _big_ IMO. Thanks a bunch Steven.
Comment #7
andremolnar CreditAttribution: andremolnar commentedFirst off, the changes to search are amazing and work quite nicely.
This is a feature request. I did notice that noise word support has been taken out. I would like to make a request for noise words to come back.
As we all know a search for "the" on an english language site will return every single node - and a search for "the monkey" will likely return more results than just "monkey". Perhaps not the best use of processor time and bandwidth.
I think a good approach would be to:
1) Still index every word of every node.
2) Strip out noise words from queries before the sql query is built during search.
3) Return the results.
By indexing every word it would still allow for future improvements like allowing users to insist that a noise word be included in the search (e.g. +the monkey)
andre
Comment #8
Steven CreditAttribution: Steven commentedThe reason the noisewords feature was removed was because on 99% of all Drupal sites, there were no noise words configured. It does not make sense to have a feature that no-one uses. Noisewords are language- and topic-dependant so we couldn't just define a set of words of our own.
Automatically removing noisewords from the query is not easy because you have to consider wildcards. Just because "th*" matches "the" doesn't mean that "th*" should be removed completely.
The relative ranking of multiword searches does make it so that noisewords are automatically irrelevant for the query. The search takes the Google approach: instead of making sure /all/ results are relevant, we try to ensure the top 10-20 results are relevant.
Still, we do have total and count information. I'll see if I can implement a good condition to sort noisewords from real words directly in the SQL query.
Comment #9
Steven CreditAttribution: Steven commentedMe and Dries updated Drupal.org to use this patch, which has revealed some issues. Here's a patch to fix them:
- Display 'friendly' name rather than module name in search watchdog messages.
- Remove left-over from search_total table.
- Add index wipe button to the admin
- Moved the admin to admin/settings/search
- Prevented menu bug when node modules update the breadcrumb in view (thanks JonBob).
- Changed search_total table's word key to PRIMARY.
Comment #10
andremolnar CreditAttribution: andremolnar commentedRegarding Noise words:
Perhaps one of the reasons noise words were not used by users was that they didn't actually work. See http://drupal.org/node/11636
As for what words to use etc. This should be a user defined option as it was in the previous version of search. The addition of a page to the Handbook might help users choose their words wisely. Perhaps something along the lines of http://drupal.org/node/1202
As for how to go about removing the noise words: My thinking was that if the words are removed from the keys prior to the search query being built it would reduce an overly complicated SQL statement filled with exceptions.
Overall the pseudo code may be something like:
How the noise words are stored and the admin interface for the words are a design choice.
I also figure that this approach would allow users to use an operator like '+' to indicate that they know its a common word but would still like it included in the results. (e.g. "+the monkey" ala Google) Since "+the" would not match anything $noisewordpatterns it would NOT be removed from $keys in the suggest method above.
Then it would only be a matter of a quick str_replace("+", "", $keys) - just priror to building the search query
I wish I was ready to contribute a patch of my own, but I'm still learning the drupal api and hooks.
Any thoughts?
andre
Comment #11
Dries CreditAttribution: Dries commentedBefore reintroducing noise words I want to evaluate the current search module improvements. I didn't like the 'noise words' feature to begin with, and chances are the new search rating/ranking makes it (partly) redundant.
Comment #12
andremolnar CreditAttribution: andremolnar commentedI would certainly rank 'noise words' as a lower priority than the search features that are currently being introduced, but regardless of that noise words really do skew results.
And after scanning at the ranking code in the search module, I can see cases where noise words can further skew the results. For example
Will all bump words that need no help (in the last example 6 out of the 8 words that don't need help get improved scores).
Then again how noise words are indexed is a different feature request. i.e. Still index noise words, but DO NOT assign greater weight/score to them.
In my particular installation i'm likely to hard code a solution to remove noise words from searches for my own site - (mainly becuase my company Be Circle has a noise word right in the title and it shows up a great deal in my site - and any search that includes the word "be" is going to skew the results in ways that won't be useful for my visitors.)
Still, IMO noise words, as a feature, has many benefits to users that care to implement it. And while 99% of drupal sites didn't implement them, that 1% might be annoyed that a feature they use has been removed.
I'll leave it at that. After all this is just a feature request. I will try to provide a patches of my own when I am capable.
andre
Comment #13
Steven CreditAttribution: Steven commentedTo illustrate how noise words are implicitly taken into account now:
Search_total value for 'Drupal': 44384
Search_total value for 'the': 137428
Search_total value for 'release: 1914
Search_total value for 'buytaert': 4
When you search for a combination of words, the individual word scores are divided by their total before adding everythign up. This means that relatively, "buytaert" will add about 10000x more weight to the ranking than "drupal" and about 30000x more than "the". Such at 0.01% difference will have a negligable effect on ranking.
Here are the 20 words with the highest count on drupal.org at the moment:
As you can see, these include words that one would normally not consider to be regular noise words, but which can be considered noise words within the context of Drupal.org. These words will still be considered when searching, and still word when searched on exclusively, but when coupled together with other words they will have a very small effect on the results.
Comment #14
Steven CreditAttribution: Steven commentedSome more updates:
- When a comment is posted, a node needs to be re-indexed. Luckily, we can use node_comment_statistics for this easily.
- When a node is deleted, it should be deleted from the search index as well.
- The search wipe didn't properly remove links to nodes from the index.
- Section url was faulty in _help.
- Minor code rearrangement.
Comment #15
andremolnar CreditAttribution: andremolnar commentedRegarding these patches:
Wildcards do not appear to be working (in my installation or on drupal.org)
e.g. search for drup* does not return results
andre
Comment #16
andremolnar CreditAttribution: andremolnar commentedRegarding these patches:
Wildcards do not appear to be working (in my installation or on drupal.org)
e.g. search for drup* does not return results
andre
Comment #17
Steven CreditAttribution: Steven commentedThe UTF-8 in the file got corrupted, so the U+FFFD replacement character was borked. I fixed it in CVS.
Comment #18
Dries CreditAttribution: Dries commentedCommitted to HEAD! Thanks.
Comment #19
(not verified) CreditAttribution: commentedComment #20
adamrice CreditAttribution: adamrice commentedThis really caught my eye, as I'm working on a bilingual en/ja website, and search is really the achilles' heel. I've discovered a simple patch to the standard search.module that allows it to search Japanese content, but this would be much better. Any specific tips on hooking in, say, Namazu to handle searching?
It also occurs to me that noise-word handling could be treated as a different kind of lexical analyzer, although I have no idea exactly how that might be coded up. But I can imagine an array of search-augmenting plugins that handle noise words in different languages using the same interface.
Comment #21
TDobes CreditAttribution: TDobes commentedThe fact that the version of this issue was changed has caused confusion.
adamrice: The 4.5 branch is in a bugfix-only state. All new features and enhancements go on the HEAD branch. Many search improvements have already been committed to the CVS HEAD branch... perhaps you could try it out and see if it fits your needs? It sounds like your questions would be more appropriate for the forum or drupal-support list.
Comment #22
pyromanfo CreditAttribution: pyromanfo commentedIs anybody maintaining a version of this patch for the 4.5.x series? I know it can't go in the main distro but it'd be nice if somebody more familiar with how it works could update it for the latest 4.5.x release. Most of it applies except for a couple of lines in node.module.
Comment #23
pyromanfo CreditAttribution: pyromanfo commentedYou know what, I had to get it working anyway.
Here's a patch for this against 4.5.1
Be warned, if any modules implement the _search hook, it's probably wrong (such as the event module).
I just commented it out, does this mean I can't search events? Or do I need to copy the event_search from CVS?
Comment #24
Steven CreditAttribution: Steven commentedBeware, several fixes were applied to node.module and search.module in head to fix bugs and improve this patch further.
Comment #25
pyromanfo CreditAttribution: pyromanfo commentedCould you be a little more specific, I'd gladly add them to the patch.
Comment #26
pyromanfo CreditAttribution: pyromanfo commentedI searched the cvs logs for anything commited to the Drupal repository from when dries initially checked in your changes until now and manually added them to the code. Some of those were good changes, thanks for warning me.
Here's the updated patch. Seems to work well, let me know if anything important is missing.
Comment #27
km CreditAttribution: km commentedHi pyromanfo, thanks for the patch!
Has to be applied with 'patch -R -p0 < search_4.5.1_0.patch'. After recreating the DB tables all data was lost but search works fine;)
What about a 4.5.2 version?
Comment #28
pyromanfo CreditAttribution: pyromanfo commentedI can't seem to find any confirmation of this anywhere, but from what I can see 4.5.2 already includes the contents of this patch. It's not in the release notes but it seems to have everything this patch does.
Comment #29
Steven CreditAttribution: Steven commentedYou seem to be confusing 4.5.2 with the CVS/HEAD version. 4.5.x does not have this patch.
Comment #30
pyromanfo CreditAttribution: pyromanfo commentedI downloaded the 4.5.2 tarball from the download section here and everything from this patch seems to be already in there. The new database structure, changes to node.module, comment.module, search.module and common.inc are all actually in the code. If it wasn't meant to be in there, okay, but the contents of the 4.5.2 search.module is pretty much exactly what the 4.5.1 search.module with this patch looks like.
Comment #31
Steven CreditAttribution: Steven commentedIt most definitely does not. The version that I get in the 4.5.2 tarball is (check the top of search;module):
// $Id: search.module,v 1.88.2.2 2005/01/11 04:18:12 unconed Exp $
The one in CVS/HEAD is :
// $Id: search.module,v 1.112 2005/01/15 09:03:39 dries Exp $
The new version starts with:
Note that the 4.5.1 patch posted above seems to be in reverse. So you need to swap - + to go from old to new. The new one has tons of character class definitions near the top like "\x{0}-\x{23}\x{25}-\x{2a}\x{2c}-\x{2f}\x{3a}".
Comment #32
pyromanfo CreditAttribution: pyromanfo commentedYeah, that would be it. The patch was backwards and I had no idea which version I was talking about.
So here's the 4.5.2 patch for the search improvments.
Comment #33
pyromanfo CreditAttribution: pyromanfo commentedWhoops, uploaded wrong file. This one should work.
Comment #34
Mad Maks CreditAttribution: Mad Maks commentedtoday i upgrade a site from 4.5.2 to the cvs version, but now i gor a error with the search:
Compilation failed: characters with values > 255 are not yet supported in classes at offset 52 in /var/www/floor/modules/search.module on line 257.
does one one now's what is causing this?
greetings
MM
Comment #35
Steven CreditAttribution: Steven commentedThe search requires unicode support in perl-compatible regular expressions. What version of PHP are you running? The PHP documentation says it should work in 4.1 on Unices and 4.2.3+ on Windows (the "u" modifier for preg_replace), but I get the impression that the true unicode support wasn't included until later as more people have had problems with this.
Comment #36
Mad Maks CreditAttribution: Mad Maks commentedwhere can i find that? my pages are hosted at www.digitalus.nl.
thanks for the help
MM
Comment #37
Bèr Kessels CreditAttribution: Bèr Kessels commentedDigitalus is not a good host for Drupal. I have been hosting on them for a year, but moved. because of all sorts of issues, which they refused to fix.
Comment #38
Mad Maks CreditAttribution: Mad Maks commentedofftopic: what kind of problems. i have good experience with them and i was planing to move a onther site also toi them so i could use drupal. ( i can't at thew present host of that site)
Comment #39
AndriyM CreditAttribution: AndriyM commentedHow can I apply this search patch to my Drupal installation? My current search doesn't even work and this sounds like it's much better than the original search module included in (non-CVS) 4.5.2.
Comment #40
JohnG-1 CreditAttribution: JohnG-1 commentedWas all this wonderous tweeking included in the standard release 4.6 search.module?
or is (some of) it still a patch?
Comment #41
Steven CreditAttribution: Steven commentedThe issue is closed, it has been available in 4.6.0 since day one.