Search improvements [#12232]

This is the patch with search improvements I've been working on, and which has been discussed on the devel list. I think in its current state it's pretty good and ready to go into core.

More work can be done in the future, but I've done everything I've set out to do ;).

There is a demo site on http://unconed.drupaldevs.org/search

Changes since last patch on drupal-devel:
- Added comment count to node results.
- Used format_name() for author.
- Updated doxygen. I also have updated docs for the API reference ready.
- Tweaked search output.
- Moved search config to admin/search, fixed broken contextual help.
- Fixed outdated search forms in bluemarine/pushbutton.
- Added update path to updates.inc.

Overview of changes:

1) Clean up the text analyser: make it handle UTF-8 and all sorts of characters. The word splitter now does intelligent splitting into words and supports all Unicode characters. It has smart handling of acronyms, URLs, dates, ...

2) It now indexes the filtered output, which means it can take advantage of HTML tags. Meaningful tags (headers, strong, em, ...) are analysed and used to boost certain words scores. This has the side-effect of allowing the indexing of PHP nodes.

3) Link analyser for node links. The HTML analyser also checks for links. If they point to a node on the current site (handles path aliases) then the link's words are counted as part of the target node. This helps bring out commonly linked FAQs and answers to the top of the results.

4) Index comments along with the node. This means that the search can make a difference between a single node/comment about 'X' and a whole thread about 'X'. It also makes the search results much shorter and more relevant (before this patch, comments were even shown first).

5) We now keep track of total counts as well as a per item count for a word. This allows us to divide the word score by the total before adding up the scores for different words, and automatically makes noisewords have less influence than rare words. This dramatically improves the relevancy of multiword searches. This also makes the disadvantage of now using OR searching instead of AND searching less problematic.

6) Includes support for text preprocessors through a hook. This is required to index Chinese and Japanese, because these languages do not use spaces between words. An external utility can be used to split these into words through a simple wrapper module. Other uses could be spell checking (although it would have no UI).

7) Indexing is now regulated: only a certain amount of items will be indexed per cron run. This prevents PHP from running out of memory or timing out. This also makes the reindexing required for this patch automatic. I also added an index coverage estimate to the search admin screen.

8) Code cleanup! Moved all the search stuff from common.inc into search.module, rewired some hooks and simplified the functions used. The search form and results now also use valid XHTML and form_ functions. The search admin was moved from search/configure to admin/search for consistency.

9) Improved search output: we also show much more info per item: date, author, node type, amount of comments and a cool dynamic excerpt à la Google. The search form is now much more simpler and the help is only displayed as tips when no search results are found.

10) By moving all search logic to SQL, I was able to add a pager to the search results. This improves usability and performance dramatically.

Comment	File	Size	Author
#33	search_4.5.2_0.patch	70.31 KB	pyromanfo
#32	search_4.5.2.patch	71.78 KB	pyromanfo
#26	search_4.5.1_0.patch	71.4 KB	pyromanfo
#23	search_4.5.1.patch	69.78 KB	pyromanfo
#9	search_1.diff	7.92 KB	Steven
#5	search_0.diff	60.59 KB	Steven
	search.diff	59.82 KB	Steven

Comments

Comment #1

Steven commented 29 October 2004 at 00:57

Here's a screenshot:
http://acko.net/dumpx/searchpatch.png

Comment #2

nedjo

he/him/his

English

commented 29 October 2004 at 15:48

+1 This contribution makes substantial improvements to all aspects of search, transforming the core search module from a very limited tool to a refined one.

Comments:

With the ranking of results, it might be useful to provide settings in the admin settings page to tweak the score applied to specific parameters.
Could the scoring distinguish between fields that the term occurs in (e.g., title score is higher than body)?

Comment #3

Steven commented 29 October 2004 at 16:11

* With the ranking of results, it might be useful to provide settings in the admin settings page to tweak the score applied to specific parameters."

This is possible yes, there are avrious parameters in use. Central are the HTML tag scores, hardcoded in search_index()

* Could the scoring distinguish between fields that the term occurs in (e.g., title score is higher than body)?

It does, implicitly. Node and comment titles are wrapped in headers (h1/h2) so they receive a high score boost for that.

Comment #4

dries commented 29 October 2004 at 20:45

A couple of issues/comments:

The results of a 'user search' look somewhat dull: it is repeating each username (using Xtemplate).
Using the admin search forms (both admin/node/search and admin/user/search) result in a 'Call to undefined function'.

Comment #5

Steven commented 29 October 2004 at 21:57

Status	File	Size
new	search_0.diff	60.59 KB

I fixed those two search forms in admin and also removed the duplicate usernames in the results. My patch focuses on node/comment searching mostly; making user and profile search a quality feature is IMO a different patch to do. Combining profile.module's browsing feature with an integrated user search should be cool for social sites (especially with fancy stuff like FOAF profile exchange coming up).

Here's an updated patch.

Comment #6

dries commented 31 October 2004 at 03:05

Committed to HEAD. This is _big_ IMO. Thanks a bunch Steven.

Comment #7

andremolnar commented 3 November 2004 at 05:20

First off, the changes to search are amazing and work quite nicely.

This is a feature request. I did notice that noise word support has been taken out. I would like to make a request for noise words to come back.

As we all know a search for "the" on an english language site will return every single node - and a search for "the monkey" will likely return more results than just "monkey". Perhaps not the best use of processor time and bandwidth.

I think a good approach would be to:
1) Still index every word of every node.
2) Strip out noise words from queries before the sql query is built during search.
3) Return the results.

By indexing every word it would still allow for future improvements like allowing users to insist that a noise word be included in the search (e.g. +the monkey)

andre

Comment #8

Steven commented 3 November 2004 at 12:32

The reason the noisewords feature was removed was because on 99% of all Drupal sites, there were no noise words configured. It does not make sense to have a feature that no-one uses. Noisewords are language- and topic-dependant so we couldn't just define a set of words of our own.

Automatically removing noisewords from the query is not easy because you have to consider wildcards. Just because "th*" matches "the" doesn't mean that "th*" should be removed completely.

The relative ranking of multiword searches does make it so that noisewords are automatically irrelevant for the query. The search takes the Google approach: instead of making sure /all/ results are relevant, we try to ensure the top 10-20 results are relevant.

Still, we do have total and count information. I'll see if I can implement a good condition to sort noisewords from real words directly in the SQL query.

Comment #9

Steven commented 3 November 2004 at 16:13

Status	File	Size
new	search_1.diff	7.92 KB

Me and Dries updated Drupal.org to use this patch, which has revealed some issues. Here's a patch to fix them:

- Display 'friendly' name rather than module name in search watchdog messages.
- Remove left-over from search_total table.
- Add index wipe button to the admin
- Moved the admin to admin/settings/search
- Prevented menu bug when node modules update the breadcrumb in view (thanks JonBob).
- Changed search_total table's word key to PRIMARY.

Comment #10

andremolnar commented 3 November 2004 at 18:41

Regarding Noise words:

Perhaps one of the reasons noise words were not used by users was that they didn't actually work. See http://drupal.org/node/11636

As for what words to use etc. This should be a user defined option as it was in the previous version of search. The addition of a page to the Handbook might help users choose their words wisely. Perhaps something along the lines of http://drupal.org/node/1202

As for how to go about removing the noise words: My thinking was that if the words are removed from the keys prior to the search query being built it would reduce an overly complicated SQL statement filled with exceptions.

Overall the pseudo code may be something like:

create an array of noiseword patterns pulled from some kind of storage (e.g. a table or field originally populated by an admin configuration page for noisewords)
//array might look something like this
//$noisewordpatterns[0] = '/noiseword0/'
//$noisewordpatterns[1] = '/noiseword1/'
//...
//$nosiewordpatterns[n] = '/noisewordn/'

define the replacement values
// in this case it need only be $noisewordreplacement = '';

do a preg_replace on a $keys string
// $keys = preg_replace($noisewordpatterns, $noisewordreplacement, $keys);

optionally add some code to out put a message to the user which if any noisewords were not included in the search
// output may be something like explode($matchednoisewords)." are common words that were not included in your search";

build the sql query based on the new values in $keys

How the noise words are stored and the admin interface for the words are a design choice.

I also figure that this approach would allow users to use an operator like '+' to indicate that they know its a common word but would still like it included in the results. (e.g. "+the monkey" ala Google) Since "+the" would not match anything $noisewordpatterns it would NOT be removed from $keys in the suggest method above.
Then it would only be a matter of a quick str_replace("+", "", $keys) - just priror to building the search query

I wish I was ready to contribute a patch of my own, but I'm still learning the drupal api and hooks.

Any thoughts?

andre

Comment #11

dries commented 3 November 2004 at 18:49

Before reintroducing noise words I want to evaluate the current search module improvements. I didn't like the 'noise words' feature to begin with, and chances are the new search rating/ranking makes it (partly) redundant.

Comment #12

andremolnar commented 3 November 2004 at 20:50

I would certainly rank 'noise words' as a lower priority than the search features that are currently being introduced, but regardless of that noise words really do skew results.

And after scanning at the ranking code in the search module, I can see cases where noise words can further skew the results. For example

<h2>A list appart</h2> or
<a href="contact">Contact Be Circle</a> or
<stong>What to do in the event of fire</strong>

Will all bump words that need no help (in the last example 6 out of the 8 words that don't need help get improved scores).

Then again how noise words are indexed is a different feature request. i.e. Still index noise words, but DO NOT assign greater weight/score to them.

In my particular installation i'm likely to hard code a solution to remove noise words from searches for my own site - (mainly becuase my company Be Circle has a noise word right in the title and it shows up a great deal in my site - and any search that includes the word "be" is going to skew the results in ways that won't be useful for my visitors.)

Still, IMO noise words, as a feature, has many benefits to users that care to implement it. And while 99% of drupal sites didn't implement them, that 1% might be annoyed that a feature they use has been removed.

I'll leave it at that. After all this is just a feature request. I will try to provide a patches of my own when I am capable.

andre

Comment #13

Steven commented 3 November 2004 at 22:55

To illustrate how noise words are implicitly taken into account now:

Search_total value for 'Drupal': 44384
Search_total value for 'the': 137428
Search_total value for 'release: 1914
Search_total value for 'buytaert': 4

When you search for a combination of words, the individual word scores are divided by their total before adding everythign up. This means that relatively, "buytaert" will add about 10000x more weight to the ranking than "drupal" and about 30000x more than "the". Such at 0.01% difference will have a negligable effect on ranking.

Here are the 20 words with the highest count on drupal.org at the moment:

+-----------+--------+
| word      | count  |
+-----------+--------+
| the       | 137428 |
| and       |  53359 |
| http      |  49317 |
| updates   |  45045 |
| for       |  45018 |
| drupal    |  44384 |
| this      |  39785 |
| you       |  34332 |
| not       |  34076 |
| module    |  31123 |
| that      |  30232 |
| with      |  29599 |
| node      |  22873 |
| page      |  20749 |
| have      |  19988 |
| can       |  19520 |
| but       |  18496 |
| problem   |  17747 |
| drupalorg |  16137 |
| user      |  15849 |
+-----------+--------+

As you can see, these include words that one would normally not consider to be regular noise words, but which can be considered noise words within the context of Drupal.org. These words will still be considered when searching, and still word when searched on exclusively, but when coupled together with other words they will have a very small effect on the results.

Comment #14

Steven commented 4 November 2004 at 01:32

Some more updates:

- When a comment is posted, a node needs to be re-indexed. Luckily, we can use node_comment_statistics for this easily.
- When a node is deleted, it should be deleted from the search index as well.
- The search wipe didn't properly remove links to nodes from the index.
- Section url was faulty in _help.
- Minor code rearrangement.

Comment #15

andremolnar commented 4 November 2004 at 03:16

Regarding these patches:

Wildcards do not appear to be working (in my installation or on drupal.org)

e.g. search for drup* does not return results

andre

Comment #16

andremolnar commented 4 November 2004 at 03:17

Regarding these patches:

Wildcards do not appear to be working (in my installation or on drupal.org)

e.g. search for drup* does not return results

andre

Comment #17

Steven commented 4 November 2004 at 03:31

The UTF-8 in the file got corrupted, so the U+FFFD replacement character was borked. I fixed it in CVS.

Comment #18

dries commented 4 November 2004 at 06:44

Committed to HEAD! Thanks.

Comment #19

(not verified) commented 18 November 2004 at 07:15

Comment #20

adamrice commented 23 November 2004 at 15:22

Title:

Search improvements

» Lexical analyzer hooks

This really caught my eye, as I'm working on a bilingual en/ja website, and search is really the achilles' heel. I've discovered a simple patch to the standard search.module that allows it to search Japanese content, but this would be much better. Any specific tips on hooking in, say, Namazu to handle searching?

It also occurs to me that noise-word handling could be treated as a different kind of lexical analyzer, although I have no idea exactly how that might be coded up. But I can imagine an array of search-augmenting plugins that handle noise words in different languages using the same interface.

Comment #21

TDobes commented 27 November 2004 at 21:21

Title:

Lexical analyzer hooks

» Search improvements

The fact that the version of this issue was changed has caused confusion.

adamrice: The 4.5 branch is in a bugfix-only state. All new features and enhancements go on the HEAD branch. Many search improvements have already been committed to the CVS HEAD branch... perhaps you could try it out and see if it fits your needs? It sounds like your questions would be more appropriate for the forum or drupal-support list.

Comment #22

pyromanfo commented 12 December 2004 at 17:56

Is anybody maintaining a version of this patch for the 4.5.x series? I know it can't go in the main distro but it'd be nice if somebody more familiar with how it works could update it for the latest 4.5.x release. Most of it applies except for a couple of lines in node.module.

Comment #23

pyromanfo commented 12 December 2004 at 18:31

Status	File	Size
new	search_4.5.1.patch	69.78 KB

You know what, I had to get it working anyway.

Here's a patch for this against 4.5.1

Be warned, if any modules implement the _search hook, it's probably wrong (such as the event module).

I just commented it out, does this mean I can't search events? Or do I need to copy the event_search from CVS?

Comment #24

Steven commented 12 December 2004 at 22:18

Beware, several fixes were applied to node.module and search.module in head to fix bugs and improve this patch further.

Comment #25

pyromanfo commented 13 December 2004 at 00:39

Could you be a little more specific, I'd gladly add them to the patch.

Comment #26

pyromanfo commented 14 December 2004 at 02:34

Status	File	Size
new	search_4.5.1_0.patch	71.4 KB

I searched the cvs logs for anything commited to the Drupal repository from when dries initially checked in your changes until now and manually added them to the code. Some of those were good changes, thanks for warning me.

Here's the updated patch. Seems to work well, let me know if anything important is missing.

Comment #27

km commented 21 January 2005 at 15:43

Hi pyromanfo, thanks for the patch!

Has to be applied with 'patch -R -p0 < search_4.5.1_0.patch'. After recreating the DB tables all data was lost but search works fine;)

What about a 4.5.2 version?

Comment #28

pyromanfo commented 29 January 2005 at 18:26

I can't seem to find any confirmation of this anywhere, but from what I can see 4.5.2 already includes the contents of this patch. It's not in the release notes but it seems to have everything this patch does.

Comment #29

Steven commented 29 January 2005 at 21:35

You seem to be confusing 4.5.2 with the CVS/HEAD version. 4.5.x does not have this patch.

Comment #30

pyromanfo commented 29 January 2005 at 22:05

I downloaded the 4.5.2 tarball from the download section here and everything from this patch seems to be already in there. The new database structure, changes to node.module, comment.module, search.module and common.inc are all actually in the code. If it wasn't meant to be in there, okay, but the contents of the 4.5.2 search.module is pretty much exactly what the 4.5.1 search.module with this patch looks like.

Comment #31

Steven commented 29 January 2005 at 22:15

It most definitely does not. The version that I get in the 4.5.2 tarball is (check the top of search;module):
// $Id: search.module,v 1.88.2.2 2005/01/11 04:18:12 unconed Exp $

The one in CVS/HEAD is :
// $Id: search.module,v 1.112 2005/01/15 09:03:39 dries Exp $

The new version starts with:

/**
 * @file
 * Enables site-wide keyword searching.
 */

/**
 * Matches Unicode character classes to exclude from the search index.
 *
 * See: http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values
 *

Note that the 4.5.1 patch posted above seems to be in reverse. So you need to swap - + to go from old to new. The new one has tons of character class definitions near the top like "\x{0}-\x{23}\x{25}-\x{2a}\x{2c}-\x{2f}\x{3a}".

Comment #32

pyromanfo commented 29 January 2005 at 22:49

Status	File	Size
new	search_4.5.2.patch	71.78 KB

Yeah, that would be it. The patch was backwards and I had no idea which version I was talking about.

So here's the 4.5.2 patch for the search improvments.

Comment #33

pyromanfo commented 29 January 2005 at 23:01

Status	File	Size
new	search_4.5.2_0.patch	70.31 KB

Whoops, uploaded wrong file. This one should work.

Comment #34

Mad Maks commented 27 February 2005 at 18:42

today i upgrade a site from 4.5.2 to the cvs version, but now i gor a error with the search:

Compilation failed: characters with values > 255 are not yet supported in classes at offset 52 in /var/www/floor/modules/search.module on line 257.

does one one now's what is causing this?

greetings

Comment #35

Steven commented 27 February 2005 at 19:51

The search requires unicode support in perl-compatible regular expressions. What version of PHP are you running? The PHP documentation says it should work in 4.1 on Unices and 4.2.3+ on Windows (the "u" modifier for preg_replace), but I get the impression that the true unicode support wasn't included until later as more people have had problems with this.

Comment #36

Mad Maks commented 27 February 2005 at 23:24

where can i find that? my pages are hosted at www.digitalus.nl.

thanks for the help