This patch adds spelling suggestions to the page that is returned if no search results are found.
Spelling suggestions

This is done by utilizing the Levenshtein algorithm for calculating the nearness of words. The process for accomplishing spelling suggestions is:

  1. Create a new database table, search_spellings. It has two columns, word and characters. word is a unique word found in the search dataset (ie not subject to processing like stemming). characters is the character count of the word. characters is used to restrict the set of words against which a spelling suggestion is sought, following the logic that a 3 letter word can't be a misspelling of a 10 letter word.
  2. On indexing, the words in the search_dataset are added to search_spellings making a globally unique list of words in the search index. A shutdown function is used for this purpose to avoid duplicate database queries.
  3. A new $op for hook_search has been added, 'no results'. This gives modules a chance to set their own message when no results are found. If the $op isn't implemented the famous blue smurfs message is seen. node_search now uses the 'no results' op to look for spelling suggestions. It utilizes the new search API function search_spelling_suggestion($word), a function which will return the top spelling suggestion and its score.

Please note that the spelling suggestions don't in any way try to guarantee that the words are actually spelled , but rather that if you type in a near miss, the system will suggest an alternate spelling based on what is in the index.

I didn't try to implement spelling suggestions on pages that have search results (a la Google), but that could be done later.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

robertDouglass’s picture

FileSize
658 bytes

CHANGELOG text.

robertDouglass’s picture

Note: the patch has a text difference to what is shown in the picture:

'Your search - @search - did not match any documents.'

I stole this from Google and think that it is better wording than results. Oh, yeah, and it is a usability feature to show the @search bit in the message.

Susurrus’s picture

Should this only be displayed on the no search results page? I know Google shows it even when results are found. I think it should only be displayed on the no search results page, but maybe other people have other ideas.

Also, big +1 to this issue. Searching would be much improved by this patch. I'll try to test it this weekend.

BlakeLucchesi’s picture

Havn't had the opportunity to install and check the results, but it appears that there is no check to make sure the search_spellings table doesn't have the same word twice. It looks like the script does this during indexing for all of the nodes that it is indexing on a particular cron job by using a keyed array, but not for all nodes in an index. Is this a feature to ensure faster indexing or an oversight?

Great work as always, I'll give more feedback when I get a chance to run some actual tests.

Blake

robertDouglass’s picture

@Blake: word is the primary key of the table guaranteeing each word is unique.

Susurrus’s picture

@robertDouglass: But if you try to insert a new row with the same primary key, you will trigger a MySQL FAIL...

robertDouglass’s picture

@Susurrus: this is a common pattern in Drupal and I don't think it's a cause for worry. Eventually, most of the inserts will fail because the word already exists. This is (I think) more efficient than first selecting from the db to check for the existence of a word and then inserting. Here's an example from current Drupal code that is similar:

  db_query("UPDATE {variable} SET value = '%s' WHERE name = '%s'", $serialized_value, $name);
  if (!db_affected_rows()) {
    @db_query("INSERT INTO {variable} (name, value) VALUES ('%s', '%s')", $name, $serialized_value);
  }
killes@www.drop.org’s picture

Interesting idea, however I wonder what the results will be like for languages other than English or languages which don't use an alphabet at all.

Freso’s picture

I think we should use "content" instead of documents. Taking drupal.org as a use case, we can search on "Content", "IRC nicks", "Users", and "Issues". So perhaps "content" isn't the best replacement either. :)

How about "Your search - @search - did not find anything."? (I'm not completely satisfied, but it's better than insisting they can only find documents with the search...)

robertDouglass’s picture

@killes: I tested with German and it was pretty successful. Köln and coeln and koeln will find each other, for example. No idea how asian languages will succeed. Don't know how to test, either.

catch’s picture

Googling suggests that Levenshtein works with katakana/hiragana - i.e. this PDF. And this page suggests it might work for Chinese and Japanese in general. I might be able to get this tested in Japanese at some point if that'd help - have a feeling the character count aspect might not work going by the above two articles.

edit: what about phrase matching?

robertDouglass’s picture

I agree that the character count will probably not bring a performance boost for languages where characters can be whole words. I don't think it will hurt anything because the range of lengths is pretty big.

One thing that I should change is the hardcoded 3 in the character range query. There's a configurable constant for that. Patch needs a reroll for this aspect.

BlakeLucchesi’s picture

In regards to the number being set for the edit distance, I tried varying the distance based on the word length in the fuzzy search module. Words under 5 characters had an allowable distance of 2, and above 5 had 3, or something similar to that. I guess my consideration was that if someone is spelling a long word there may be more than 3 letters difference, especially for someone spelling phonetically. I guess this really gets into linguistic statistics, etc. but at least its something to consider. Perhaps just letting the admin specify an edit distance would be the best choice.

robertDouglass’s picture

It's hard to really know how well things are working in other languages, but I copy/pasted some texts from arabic, japanese and german into the same test site, and then picked other words from the same sites until I found some that didn't return results. The sample set was very small... only a couple hundred words in each, so a bigger site will have better results. The encouraging thing is that the spelling suggestions never suggested the wrong language. I'm sure that if I had made an english+dutch+german site that would have been a different story, and it may be that we eventually add language features to this, but for now I see the potential for this patch doing good outweighing any of its possible shortcomings.

deutsch

arabic

japanese

It's hard to know, but the Japanese example doesn't look anywhere remotely related to the search term.

robertDouglass’s picture

FileSize
8.24 KB

This path adopts the "did not find anything" wording suggested, uses variable_get to get the shortest term number, and cleans up the code comments in the search_spelling_suggestion function.

moshe weitzman’s picture

nice feature - subscribe

Rowanw’s picture

After applying the patch, when I attempt to run the DB upgrade script I get a blank page and the upgrade seems to fail. I don't see any errors in my Apache error log. I'm not sure if it's due to this patch or not.

robertDouglass’s picture

Status: Needs review » Needs work

It has been pointed out to me:

The patch completely ignores unicode and relies on the byte-based levenshtein() function. The number of edits between words depends on the number of bytes used to encode each character.

This patch would also seem to break every single stemming module out there (or at least make it spew broken-looking suggestions).

Then the method of generating the suggested search keys involves repeated str_replace() calls over the entire search string, instead of only replacing the keyword in question, making the chance of collisions high. This code also ignores negative and phrase operators.

Dries’s picture

Robert, does the patch break any of the search module tests? If not, step one would be to extend the search module tests. :)

robertDouglass’s picture

I've ported the levenshtein function to be unicode aware by using drupal_substr and drupal_strlen. This is not a major concern and my function passes all of the tests that are included with the PHP source code for the levenshtein function. They don't test asian characters at all, and in fact, the algorithm doesn't work well with south asian languages to begin with. Thus, even though the new function correctly handles unicode, spelling suggestions for south asian languages is largely an unsolved problem in the computer science theoretical realm, and I'm not going to try to solve it here.

As for the stemming problem, this is a serious drawback that cannot be overcome easily. If stemming modules are enabled, the spelling suggestions would be in the form of "simpli" for "simply" and so forth. In other words, if I rewrite the patch to process search keywords using the search_simplify() function (where the preprocess hook is invoked), then the words will match what is found in the spelling suggestions table. What we will eventually need to do is refine the indexing workflow so that there are more points where you can hook into the process. I imagine a tokenizer->indexer system like lucene's where tokenizers and indexers are pluggable so the glue in between can be modified. It is in this glue that the dictionary for spelling can best be built. We'll need to store the original words... not the processed stemmed words.

The comment about str_replace() and collisions I don't understand. The code in the 'no results' part of node_search() provides for suggesting alternates for one or more misspelled words in a search query, and as far as I can see, this works fine. I'd need an example of said collision to know what is meant.

The point about negative operators and phrases is true, and hard to solve. The code for recognizing these aspects in search.module is not reusable. At this point we either need to refactor the query building code into something more modular and reusable, or copy and paste code from the search_parse_query function (not advisable). So to do spelling suggestions right at this time is somewhat blocked by some inflexibility in the search framework itself.

robertDouglass’s picture

FileSize
11.32 KB

This patch adds drupal_levenshtein() to the search.module, though it might seek a different home. Also fixed is a bug that didn't clear the spellings table upon reindexing. This patch attempts to keep the original words (as they are before calling preprocess hook or search_simplify) so that we can work around the stemming limitation, but the implementation isn't finished. I'm uploading the patch mostly so that the levenshtein function meets the bus test (ie I walk in front of a bus, you've still got the code ;-)

dami’s picture

subscribe

dami’s picture

My comments after some quick testing:

1. If more than one spellings have the same lowest score, do you just pick the last one that was compared?
How about giving out two or three suggestions instead of one?

2. Did some testing in Chinese searching. It works to some extent, but overall doesn't work quite well. I am not an expert on searching/indexing algorithm, but it's probably not directly related to Levenshtein algorithm. Drupal built-in search index doesn't really work efficiently for Chinese, it ends up with lots of nonsense words which makes search_index table quite huge. We have an excellent Chinese word splitter module developed by zealy. I think it's prudent approach if you we could pull him into the discussion.

One other item I can think of might be exclude comparing with utf-8 punctuations in Levenshtein algorithm (or is it already taken care of it?). Again, csplitter module mentioned above should help in this regard.

3. Still on Chinese searching, search.module line 1384

  $result = db_query("SELECT word FROM {search_spellings} WHERE characters > %d AND characters < %d", max(variable_get('minimum_word_size', 3), $length - 3), $length + 3);

The range has to increased from 3 to 4 for it to work for Chinese words, because strlen($onechineseword) = 3.

4. One question on search_spelling table. Is $spelling->word supposed to be the same as those in search_index table?
The way $spellings are concatenated, doesn't seem to work for languages that don't use spaces to separate each word.

        list($words, $original_words) = search_index_split($value);
        $spellings += $original_words;

5. Minor: system_update_7006 needs to be 7007 now :)

douggreen’s picture

LIVE FROM THE MINNESOTA SEARCH SPRINT, the help changes in this patch are now part of #256678

appel’s picture

Subscribing.

Dinis’s picture

Are there any plans to port this to D6?

Bartezz’s picture

Or D5 :)

Fixdit’s picture

5 indeed! Any plans to?

Fixdit’s picture

How complex would the port to 5.x be? Incredibly keen to see this implemented.

Dinis’s picture

I've started using the SOLR search engine which already has this (excellent) feature :)

jhodgdon’s picture

Version: 7.x-dev » 8.x-dev

Excellent idea, hope it gets into D8, too late for D7 now...
Regarding the comment above, new features go into only the in-development version. At this point, Drupal 7 is feature-frozen, so unless you can convince someone to make you a Drupal contributed module, you are out of luck for D5, D6, D7, etc.

jhodgdon’s picture

Title: Add spelling suggestions to the "no search results found" page. » Add spelling suggestions to the search results page.

Oh, and I think it should be on every search results page, not just when there are no results.

mikejonesok’s picture

Sub...hoping for 6.x port.

jhodgdon’s picture

That is unfortunately unlikely.

Feature requests are only considered during the early stages of developing a new Drupal major version. Drupal 7 is past that deadline, and new features are almost never back-ported to earlier versions.

If someone wants to build this as a contributed D6/D7 module, though, that would make getting it into Drupal core even easier for Drupal 8, and then even if it isn't included in Drupal 8 core, the contributed module would be available.

Daniel Wentsch’s picture

Subscribing and hoping for d6 module with this functionality :)

SchwebDesign’s picture

subscribing as well hoping there will be a Drupal 6 module. thanks for your great work everyone.

Bartezz’s picture

Hi SchwebDesign,

I've solved this by using Lucene API module and Lucene Did you mean, works pretty good!

Cheers

SchwebDesign’s picture

Thanks for letting me know. I'll check it out! I also tried out the fuzzysearch module which allows for returning results even if misspellings occured.

dandaman’s picture

I was trying to see if I could get this to work on my Drupal 6 site. (If I did get it working, I was hoping to make it into a stand-alone module.)

The Drupal 6 site took the patch pretty well although the line numbers didn't match up exactly. I added the search_spellings table but it never actually seems to run the code to insert the data into the table (the search_spelling_shutdown() function). I was testing it on a site with a very large amount of content; maybe it doesn't work that well with big sites yet?

dandaman’s picture

Dangit! After a day of looking at it and right after posting the above comment, I realized I was looking at the wrong database. Looks like it really is working after all.

jhodgdon’s picture

Status: Needs work » Closed (won't fix)

This sounds like a good thing to do in a contributed module. I don't think it is a good thing to add to Drupal Core's search module, which just has a very basic implementation of search but allows contributed modules and themes to influence how it works. Sorry...