This is the patch with search improvements I've been working on, and which has been discussed on the devel list. I think in its current state it's pretty good and ready to go into core.

More work can be done in the future, but I've done everything I've set out to do ;).

There is a demo site on http://unconed.drupaldevs.org/search

Changes since last patch on drupal-devel:
- Added comment count to node results.
- Used format_name() for author.
- Updated doxygen. I also have updated docs for the API reference ready.
- Tweaked search output.
- Moved search config to admin/search, fixed broken contextual help.
- Fixed outdated search forms in bluemarine/pushbutton.
- Added update path to updates.inc.

Overview of changes:

1) Clean up the text analyser: make it handle UTF-8 and all sorts of characters. The word splitter now does intelligent splitting into words and supports all Unicode characters. It has smart handling of acronyms, URLs, dates, ...

2) It now indexes the filtered output, which means it can take advantage of HTML tags. Meaningful tags (headers, strong, em, ...) are analysed and used to boost certain words scores. This has the side-effect of allowing the indexing of PHP nodes.

3) Link analyser for node links. The HTML analyser also checks for links. If they point to a node on the current site (handles path aliases) then the link's words are counted as part of the target node. This helps bring out commonly linked FAQs and answers to the top of the results.

4) Index comments along with the node. This means that the search can make a difference between a single node/comment about 'X' and a whole thread about 'X'. It also makes the search results much shorter and more relevant (before this patch, comments were even shown first).

5) We now keep track of total counts as well as a per item count for a word. This allows us to divide the word score by the total before adding up the scores for different words, and automatically makes noisewords have less influence than rare words. This dramatically improves the relevancy of multiword searches. This also makes the disadvantage of now using OR searching instead of AND searching less problematic.

6) Includes support for text preprocessors through a hook. This is required to index Chinese and Japanese, because these languages do not use spaces between words. An external utility can be used to split these into words through a simple wrapper module. Other uses could be spell checking (although it would have no UI).

7) Indexing is now regulated: only a certain amount of items will be indexed per cron run. This prevents PHP from running out of memory or timing out. This also makes the reindexing required for this patch automatic. I also added an index coverage estimate to the search admin screen.

8) Code cleanup! Moved all the search stuff from common.inc into search.module, rewired some hooks and simplified the functions used. The search form and results now also use valid XHTML and form_ functions. The search admin was moved from search/configure to admin/search for consistency.

9) Improved search output: we also show much more info per item: date, author, node type, amount of comments and a cool dynamic excerpt à la Google. The search form is now much more simpler and the help is only displayed as tips when no search results are found.

10) By moving all search logic to SQL, I was able to add a pager to the search results. This improves usability and performance dramatically.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Steven’s picture

nedjo’s picture

+1 This contribution makes substantial improvements to all aspects of search, transforming the core search module from a very limited tool to a refined one.

Comments:

  • With the ranking of results, it might be useful to provide settings in the admin settings page to tweak the score applied to specific parameters.
  • Could the scoring distinguish between fields that the term occurs in (e.g., title score is higher than body)?
Steven’s picture

* With the ranking of results, it might be useful to provide settings in the admin settings page to tweak the score applied to specific parameters."

This is possible yes, there are avrious parameters in use. Central are the HTML tag scores, hardcoded in search_index()

* Could the scoring distinguish between fields that the term occurs in (e.g., title score is higher than body)?

It does, implicitly. Node and comment titles are wrapped in headers (h1/h2) so they receive a high score boost for that.

Dries’s picture

A couple of issues/comments:

  • The results of a 'user search' look somewhat dull: it is repeating each username (using Xtemplate).
  • Using the admin search forms (both admin/node/search and admin/user/search) result in a 'Call to undefined function'.
Steven’s picture

FileSize
60.59 KB

I fixed those two search forms in admin and also removed the duplicate usernames in the results. My patch focuses on node/comment searching mostly; making user and profile search a quality feature is IMO a different patch to do. Combining profile.module's browsing feature with an integrated user search should be cool for social sites (especially with fancy stuff like FOAF profile exchange coming up).

Here's an updated patch.

Dries’s picture

Committed to HEAD. This is _big_ IMO. Thanks a bunch Steven.

andremolnar’s picture

First off, the changes to search are amazing and work quite nicely.

This is a feature request. I did notice that noise word support has been taken out. I would like to make a request for noise words to come back.

As we all know a search for "the" on an english language site will return every single node - and a search for "the monkey" will likely return more results than just "monkey". Perhaps not the best use of processor time and bandwidth.

I think a good approach would be to:
1) Still index every word of every node.
2) Strip out noise words from queries before the sql query is built during search.
3) Return the results.

By indexing every word it would still allow for future improvements like allowing users to insist that a noise word be included in the search (e.g. +the monkey)

andre

Steven’s picture

The reason the noisewords feature was removed was because on 99% of all Drupal sites, there were no noise words configured. It does not make sense to have a feature that no-one uses. Noisewords are language- and topic-dependant so we couldn't just define a set of words of our own.

Automatically removing noisewords from the query is not easy because you have to consider wildcards. Just because "th*" matches "the" doesn't mean that "th*" should be removed completely.

The relative ranking of multiword searches does make it so that noisewords are automatically irrelevant for the query. The search takes the Google approach: instead of making sure /all/ results are relevant, we try to ensure the top 10-20 results are relevant.

Still, we do have total and count information. I'll see if I can implement a good condition to sort noisewords from real words directly in the SQL query.

Steven’s picture

FileSize
7.92 KB

Me and Dries updated Drupal.org to use this patch, which has revealed some issues. Here's a patch to fix them:

- Display 'friendly' name rather than module name in search watchdog messages.
- Remove left-over from search_total table.
- Add index wipe button to the admin
- Moved the admin to admin/settings/search
- Prevented menu bug when node modules update the breadcrumb in view (thanks JonBob).
- Changed search_total table's word key to PRIMARY.

andremolnar’s picture

Regarding Noise words:

Perhaps one of the reasons noise words were not used by users was that they didn't actually work. See http://drupal.org/node/11636

As for what words to use etc. This should be a user defined option as it was in the previous version of search. The addition of a page to the Handbook might help users choose their words wisely. Perhaps something along the lines of http://drupal.org/node/1202

As for how to go about removing the noise words: My thinking was that if the words are removed from the keys prior to the search query being built it would reduce an overly complicated SQL statement filled with exceptions.

Overall the pseudo code may be something like:

create an array of noiseword patterns pulled from some kind of storage (e.g. a table or field originally populated by an admin configuration page for noisewords)
//array might look something like this
//$noisewordpatterns[0] = '/noiseword0/'
//$noisewordpatterns[1] = '/noiseword1/'
//...
//$nosiewordpatterns[n] = '/noisewordn/'

define the replacement values
// in this case it need only be $noisewordreplacement = '';

do a preg_replace on a $keys string
// $keys = preg_replace($noisewordpatterns, $noisewordreplacement, $keys);

optionally add some code to out put a message to the user which if any noisewords were not included in the search
// output may be something like explode($matchednoisewords)." are common words that were not included in your search";

build the sql query based on the new values in $keys

How the noise words are stored and the admin interface for the words are a design choice.

I also figure that this approach would allow users to use an operator like '+' to indicate that they know its a common word but would still like it included in the results. (e.g. "+the monkey" ala Google) Since "+the" would not match anything $noisewordpatterns it would NOT be removed from $keys in the suggest method above.
Then it would only be a matter of a quick str_replace("+", "", $keys) - just priror to building the search query

I wish I was ready to contribute a patch of my own, but I'm still learning the drupal api and hooks.

Any thoughts?

andre

Dries’s picture

Before reintroducing noise words I want to evaluate the current search module improvements. I didn't like the 'noise words' feature to begin with, and chances are the new search rating/ranking makes it (partly) redundant.

andremolnar’s picture

I would certainly rank 'noise words' as a lower priority than the search features that are currently being introduced, but regardless of that noise words really do skew results.

And after scanning at the ranking code in the search module, I can see cases where noise words can further skew the results. For example

<h2>A list appart</h2> or
<a href="contact">Contact Be Circle</a> or
<stong>What to do in the event of fire</strong>

Will all bump words that need no help (in the last example 6 out of the 8 words that don't need help get improved scores).

Then again how noise words are indexed is a different feature request. i.e. Still index noise words, but DO NOT assign greater weight/score to them.

In my particular installation i'm likely to hard code a solution to remove noise words from searches for my own site - (mainly becuase my company Be Circle has a noise word right in the title and it shows up a great deal in my site - and any search that includes the word "be" is going to skew the results in ways that won't be useful for my visitors.)

Still, IMO noise words, as a feature, has many benefits to users that care to implement it. And while 99% of drupal sites didn't implement them, that 1% might be annoyed that a feature they use has been removed.

I'll leave it at that. After all this is just a feature request. I will try to provide a patches of my own when I am capable.

andre

Steven’s picture

To illustrate how noise words are implicitly taken into account now:

Search_total value for 'Drupal': 44384
Search_total value for 'the': 137428
Search_total value for 'release: 1914
Search_total value for 'buytaert': 4

When you search for a combination of words, the individual word scores are divided by their total before adding everythign up. This means that relatively, "buytaert" will add about 10000x more weight to the ranking than "drupal" and about 30000x more than "the". Such at 0.01% difference will have a negligable effect on ranking.

Here are the 20 words with the highest count on drupal.org at the moment:

+-----------+--------+
| word      | count  |
+-----------+--------+
| the       | 137428 |
| and       |  53359 |
| http      |  49317 |
| updates   |  45045 |
| for       |  45018 |
| drupal    |  44384 |
| this      |  39785 |
| you       |  34332 |
| not       |  34076 |
| module    |  31123 |
| that      |  30232 |
| with      |  29599 |
| node      |  22873 |
| page      |  20749 |
| have      |  19988 |
| can       |  19520 |
| but       |  18496 |
| problem   |  17747 |
| drupalorg |  16137 |
| user      |  15849 |
+-----------+--------+

As you can see, these include words that one would normally not consider to be regular noise words, but which can be considered noise words within the context of Drupal.org. These words will still be considered when searching, and still word when searched on exclusively, but when coupled together with other words they will have a very small effect on the results.

Steven’s picture

Some more updates:

- When a comment is posted, a node needs to be re-indexed. Luckily, we can use node_comment_statistics for this easily.
- When a node is deleted, it should be deleted from the search index as well.
- The search wipe didn't properly remove links to nodes from the index.
- Section url was faulty in _help.
- Minor code rearrangement.

andremolnar’s picture

Regarding these patches:

Wildcards do not appear to be working (in my installation or on drupal.org)

e.g. search for drup* does not return results

andre

andremolnar’s picture

Regarding these patches:

Wildcards do not appear to be working (in my installation or on drupal.org)

e.g. search for drup* does not return results

andre

Steven’s picture

The UTF-8 in the file got corrupted, so the U+FFFD replacement character was borked. I fixed it in CVS.

Dries’s picture

Committed to HEAD! Thanks.

Anonymous’s picture

adamrice’s picture

Title: Search improvements » Lexical analyzer hooks

This really caught my eye, as I'm working on a bilingual en/ja website, and search is really the achilles' heel. I've discovered a simple patch to the standard search.module that allows it to search Japanese content, but this would be much better. Any specific tips on hooking in, say, Namazu to handle searching?

It also occurs to me that noise-word handling could be treated as a different kind of lexical analyzer, although I have no idea exactly how that might be coded up. But I can imagine an array of search-augmenting plugins that handle noise words in different languages using the same interface.

TDobes’s picture

Title: Lexical analyzer hooks » Search improvements

The fact that the version of this issue was changed has caused confusion.

adamrice: The 4.5 branch is in a bugfix-only state. All new features and enhancements go on the HEAD branch. Many search improvements have already been committed to the CVS HEAD branch... perhaps you could try it out and see if it fits your needs? It sounds like your questions would be more appropriate for the forum or drupal-support list.

pyromanfo’s picture

Is anybody maintaining a version of this patch for the 4.5.x series? I know it can't go in the main distro but it'd be nice if somebody more familiar with how it works could update it for the latest 4.5.x release. Most of it applies except for a couple of lines in node.module.

pyromanfo’s picture

FileSize
69.78 KB

You know what, I had to get it working anyway.

Here's a patch for this against 4.5.1

Be warned, if any modules implement the _search hook, it's probably wrong (such as the event module).

I just commented it out, does this mean I can't search events? Or do I need to copy the event_search from CVS?

Steven’s picture

Beware, several fixes were applied to node.module and search.module in head to fix bugs and improve this patch further.

pyromanfo’s picture

Could you be a little more specific, I'd gladly add them to the patch.

pyromanfo’s picture

FileSize
71.4 KB

I searched the cvs logs for anything commited to the Drupal repository from when dries initially checked in your changes until now and manually added them to the code. Some of those were good changes, thanks for warning me.

Here's the updated patch. Seems to work well, let me know if anything important is missing.

km’s picture

Hi pyromanfo, thanks for the patch!

Has to be applied with 'patch -R -p0 < search_4.5.1_0.patch'. After recreating the DB tables all data was lost but search works fine;)

What about a 4.5.2 version?

pyromanfo’s picture

I can't seem to find any confirmation of this anywhere, but from what I can see 4.5.2 already includes the contents of this patch. It's not in the release notes but it seems to have everything this patch does.

Steven’s picture

You seem to be confusing 4.5.2 with the CVS/HEAD version. 4.5.x does not have this patch.

pyromanfo’s picture

I downloaded the 4.5.2 tarball from the download section here and everything from this patch seems to be already in there. The new database structure, changes to node.module, comment.module, search.module and common.inc are all actually in the code. If it wasn't meant to be in there, okay, but the contents of the 4.5.2 search.module is pretty much exactly what the 4.5.1 search.module with this patch looks like.

Steven’s picture

It most definitely does not. The version that I get in the 4.5.2 tarball is (check the top of search;module):
// $Id: search.module,v 1.88.2.2 2005/01/11 04:18:12 unconed Exp $

The one in CVS/HEAD is :
// $Id: search.module,v 1.112 2005/01/15 09:03:39 dries Exp $

The new version starts with:

/**
 * @file
 * Enables site-wide keyword searching.
 */

/**
 * Matches Unicode character classes to exclude from the search index.
 *
 * See: http://www.unicode.org/Public/UNIDATA/UCD.html#General_Category_Values
 *

Note that the 4.5.1 patch posted above seems to be in reverse. So you need to swap - + to go from old to new. The new one has tons of character class definitions near the top like "\x{0}-\x{23}\x{25}-\x{2a}\x{2c}-\x{2f}\x{3a}".

pyromanfo’s picture

FileSize
71.78 KB

Yeah, that would be it. The patch was backwards and I had no idea which version I was talking about.

So here's the 4.5.2 patch for the search improvments.

pyromanfo’s picture

FileSize
70.31 KB

Whoops, uploaded wrong file. This one should work.

Mad Maks’s picture

today i upgrade a site from 4.5.2 to the cvs version, but now i gor a error with the search:

Compilation failed: characters with values > 255 are not yet supported in classes at offset 52 in /var/www/floor/modules/search.module on line 257.

does one one now's what is causing this?

greetings

MM

Steven’s picture

The search requires unicode support in perl-compatible regular expressions. What version of PHP are you running? The PHP documentation says it should work in 4.1 on Unices and 4.2.3+ on Windows (the "u" modifier for preg_replace), but I get the impression that the true unicode support wasn't included until later as more people have had problems with this.

Mad Maks’s picture

where can i find that? my pages are hosted at www.digitalus.nl.

thanks for the help

MM

Bèr Kessels’s picture

Digitalus is not a good host for Drupal. I have been hosting on them for a year, but moved. because of all sorts of issues, which they refused to fix.

Mad Maks’s picture

offtopic: what kind of problems. i have good experience with them and i was planing to move a onther site also toi them so i could use drupal. ( i can't at thew present host of that site)

AndriyM’s picture

How can I apply this search patch to my Drupal installation? My current search doesn't even work and this sounds like it's much better than the original search module included in (non-CVS) 4.5.2.

JohnG-1’s picture

Was all this wonderous tweeking included in the standard release 4.6 search.module?
or is (some of) it still a patch?

Steven’s picture

The issue is closed, it has been available in 4.6.0 since day one.