I finally finished my implementation of the spellchecker component of solr in the apache solr module.

More information about the spellchecker component can be found here:
http://wiki.apache.org/solr/SpellCheckComponent

Benefits
Compared to the existing task containing a spellchecker for Drupal 5, this one has some important advantages:
- It does inline spellchecking. That means, that the spellchecking does not require an additional http request. its added to every search request.
- It can correct phrases! If you search for "ultrasharb knife" every word with suggestions will be replaced returning "ultrasharp knife"
- Spellchecking is always done. If you have wrong written words in your nodes, spellchecking would not occur because a result is found. This patch does also spellchecking if results are found (as Google does too).
- Using the json_decode function for unserializing request data is faster

Unfortunatly a lot of things have to be change before you can use the spellchecker.

Because it took me several days to find out all details, I will write a detailed guide. Hoping that this is useful for someone out there. Here we go:

Step 1: Download Solr 1.3 nightly build.

The spellchecker component is a part of the upcoming new version of solr 1.3. This means, that you have to install a new solr instance.

http://people.apache.org/builds/lucene/solr/nightly/

Step 2: Modify the schema.xml

You will have to add all your drupal fields to the schema. A new field is added to the schema called "spellchecker".

<!-- This field is used to build the spellchecker index -->
   <field name="spell" type="textSpell" indexed="true" stored="true" multiValued="true"/>

You can fill the field with field from your index by copying it to the "spell" field:

<copyField source="title" dest="spell"/>

I copy only the title field in the word index for spellchecking. If you want to add all words to the index, you may add the following line:

<copyField source="text" dest="spell"/>

I attached my schema.xml

Step 3: Replace the SolrPHPClient with the new version
The author of SolrPHPClient has released a new SolrPHPClient using json_decode instead of xml results. This is a performance gain.
https://issues.apache.org/jira/browse/SOLR-341

Unfortunately, the solr client does not allow to use different search servlets than "select". Because of this, I had to add the possibility of selecting a different search_servlet. The name of the spellchecker component search servlet is "spellCheckCompRH".

I modified the search function of Service.php accepting different search servlets.

The esiest way is to delete the existing SOlrPHPClient and to use my attached one. Only replacing Service.php will not work.

	public function search($query, $offset = 0, $limit = 10, $params = array(), $search_servlet = null)
	{ 
		if (!is_array($params))
		{
			$params = array();
		}


                          /* removed some lines 
                           *
                           */

		$queryString = preg_replace('/%5B(?:[0-9]|[1-9][0-9]+)%5D=/', '=', $queryString);
		
		//Enable different search request handlers
		$searchUrl = $search_servlet==null ? $this->_searchUrl : $this->_constructUrl($search_servlet);
		return $this->_sendRawGet($searchUrl . $this->_queryDelimiter . $queryString);
	}

Step 4: Patch apachesolr.module
I have added a new menu for the spellchecker.

Attached you will find a patch for new newest dev version and a standalone module.

Step 5: Patch the apachesolr_search.module
I have written a method that alter the search form and add the suggestion just below if there are any suggestions.
The index of searchword is rebuild every 24 hours.

Step 6: Delete and recreate the whole index
You may use the function under Site configuration -> Search settings ->Re-index site. Then rerun the cron job several times until all documents are indexed.

Step 7: Go to Site COnfiguration Apache Solr -> Spellchecker and rebuild the spellchecker index
You may use the function under Site configuration -> Search settings ->Re-index site. Then rerun the cron job several times until all documents are indexed.

Then activate the spellchecker and rebuild the spellchecker index under Site Configuration -> Apache Solr -> Spellchecker.

If the schema is correct, in the folder under /data/spellchecker1/ the index will be created. It should be bigger than 1kb.

Troubleshhoting

1. Check if you have started the new solr instance with solr 1.3

2. Check if there are any errors while starting the solr instance

3. Make sure you updated the whole index of the solr instance

4. Go to /solr/admin and enter a search keyword. Is the field "spell" filled with content?

5. Check if you have rebuild the spellchecker index. This is a must before first use. Go to the solr data folder and check folder /spellchecker1/. Does it contain updated index files?

6. Do a manual check if spellchecks are working:

//Build index
http://:/solr/spellCheckCompRH?q=sharb&spellcheck.q=sharb&spellcheck=true&spellcheck.build=true

//Query
http://:/solr/spellCheckCompRH?q=sharb&spellcheck.q=sharb&spellcheck=true

If you have the word "sharp" in your word index, a suggestion "sharp" for "sharb" should be added to the search results.

If any of these points isn't working, restart from the beginning :-)

It would be very useful if someone could set up a test environment and test this feature. I'm waiting for your feedback.

Comments

janusman’s picture

Sounds great! But I noticed you are only using the title field for building the dictionary. Is there a reason not to use the "text" field (and/or others)?

BTW am I right in that PHP5.2 required for the new SolrPhpClient?

brainski’s picture

Hi

I did not read about this requirement? Where was it written?

I have made many test with the spellchecker and finally I come to the conclusion, that including only the title field is more accurate than including the whole text.

In my case I have 750'000 nodes. All containing a lot of plain text. Because the user is looking for the title in the search field, the results of the spellchecker were much better. In my nodes are also some wrong written words and different languages. Because if this, I preferred to index only the title field. The title is in 99.9 % cases without a typo. But I described above how to add the whole text field.

At the moment, I only include one spellchecker result. Later it would be possible more spellchecker suggestions.

Did you test the whole thing?

BTW: I think solr 1.3 will be released soon.

robertdouglass’s picture

Category: support » feature
brainski’s picture

While testing it turns out, that the json_decode generates a slightl different response.

Because of this, facets are not working in my first version. Attached you will find the corrected module. Facets are now working.

brainski’s picture

While testing I found out, that the spellchecker is also spellchecking terms like type:page. Thats a bug.

I think a regex trick is necessary to remove all xx:yy elements from the querystring.

//"test type:page" -> "test"
$queryclean = somecleaning($querystring);

Maybe someone is good in regex and provides a simple solution?

drunken monkey’s picture

Well, a simple regex would be
/\w+:(\([^)]+\)|\w+)/i
But since colons inside quotes should remain (and parentheses can be nested, and...), a completely correct regex isn't possible, this would require a grammar and therefore a complete funciton dedicated to parsing the query for this syntax. But if the Solr_Base_Query::parse_query() functions correctly, this could easily be reduced to just checking each field for this case, which might even be possible with just a regex.

janusman’s picture

Status: Needs review » Closed (duplicate)

Although this seems very nice, I *feel* that hacking away at the SolrPhpClient code apart from using Solr1.3 (which is not really out yet?) adds a touch too much of complexity; I would move for us to work on the simpler solution on #230380: spelling suggestions instead, which relies on Solr1.2 and only slight modification; perhaps that other code can be worked on to provide some of the neat stuff this patch proposes?

What do you think? For now I am marking as duplicate while we come to an agreement (we can revert this).

brainski’s picture

Thats sad, but your decision. Testing would be great.

On 15 August was Solr freezed. Only the documentation need to be updated. If this task is finished, Solr 1.3 will come out.

Please be aware, that the Spellchecker Component also works on Solr 1.2. That means you can use als Solr 1.2 as a base. The configuration effort is only bigger. You habe to configure the Spellchecker component in the solr.conf.

Changing the SolrPhpClient is need to use different search handlers different than "/solr/select". The author has hardcoded this in the SolrPHPClient. Unfortunately the only way to go around this, is to modify the SolrPHPClient.

Anyway, its necassary that spellchecker is a future feature of drupal apache solr module.

I would like to have the benefits of:

- inline spellchecking (in the search result), only one request and not a request for the search and one for every search word.
- Checking phrases not only a simple search word

If you can provide these feature with your solution, I agree that we have a duplicate. If not, I do not agree.

janusman’s picture

Thank you for your comments.

BTW this is not my *decision* but rather my *opinion*. My suggestion is to continue this thread over in the (older) issue I mentioned to focus the discussion, and not at all to stop this (on the contrary!) =) I am just tyding up =)

As for the code, I realize these are exciting benefits, however code should be modularized (I know, slowing things a bit) to keep it sustainable; for example, when a new SolrPhpClient comes out, how would we manage this great code you put in? Right now the mantainers of this module have contacted the SolrPhpClient developer(s?) and in the future are going to try to influence that other project to put stuff like this in there.

For now, ?is it possible to extract the code you put into your modified SolrPhpClient to another place? To one of the existing modules, or a new module inside the contrib/ directory.

brainski’s picture

Category: feature » task

I would like to go further in this issue. I think the spellchecker is a widely needed feature.

Because solr depens on solr 1.3 I would like to know,if solr 1.3 will be used in the upcoming version 1.0.

Because solr 1.3 is already freezed and only the documentation part has to be done, I think we should buld the upcoming version 1.0 on solr 1.3. otherwise the new version has to be updated just shortly after the release of solr 1.3.

@janusman
I have no idea how to handle the modified solr client. I'm not able to commit to drupal cvs and I have no experience how to handle such a case.

The modification of the solrphpclient is only 2 lines. I added the possibility to select the requesthandler in the search url with a parameter.

hardcoded at the moment:
solr/select/?

for spellchecker needed:
solr/spellCheckCompRH

I would like to help you because I have the feeling, that the spellchecker feature got lost I we do not stick to getter and find a solution, that work for now and the future. And also for the future module users.

janusman’s picture

Status: Closed (duplicate) » Postponed (maintainer needs more info)

@brainski: Solr1.3.0 support is now go. Since you proposed this for 1.3 and I was working on the same thing for Solr 1.2, do you think you want to still collaborate on bringing this to the current dev version?

If not, and if it's ok, I will continue the great work you started... let us know, ok? Thanks.

brainski’s picture

@janusman

Of course I would like to help! I need this feature to.

An open point for me is still the regex issue to remove
//problematic search string: "test type:page" should become "test", because spellchecker will propose something for "type" and "page"
$queryclean = somecleaning($querystring);

I stopped with this issue because I had to modify the solr php client. As I described a default requesthandler is hardcoded in the solr php client.

hardcoded at the moment:
solr/select/?

for spellchecker needed:
solr/spellCheckCompRH

I would like to use inline spellchecking.

I studied your code and saw that you do an additional server call for every word. This is no more necessary. You can use the inline results.

Please contact me to discuss this points.

robertdouglass’s picture

Please feel free to propose changes to the core query (for example, a parameter that sets the handler) if it makes it easier to elegantly implement these features.

brainski’s picture

ok here is my proposed change to the SolrPhpClient:

Service.php

The function search() has the ability to set a different search_servlet.

This is only the patch for the SolrPhpClient.

Later I will post an implementation of the search_servlet "solr/spellCheckCompRH"

*edit: replaced file, had a typo*

brainski’s picture

Title: Implementation of solr spellchecker / "Did you mean" functionality! (looking for testers) » Implementation of solr spellchecker: New Version for Dev Version
StatusFileSize
new13.65 KB

Hi there!

I have finished the new version of solr spellchecker. I have rewritten some parts and fixed all known bugs.

To use this feature, you have to:

-install solr 1.3
-apply the patch
-copy schema.xml to solr /conf dir
-restart solr
-rebuild index
-enable spellchecker and spellchecker index under admin -> apache Solr -> spellchecker
-rebuild the spellchecker index once

I hope someone can test this feature!. I think its no more alpha its in beta status.

brainski’s picture

Status: Postponed (maintainer needs more info) » Needs review
pwolanin’s picture

This looks like a very nice feature, but will need to be reworked a bit because I'm about to put a patch in to use the dismax handler which changes a lot of code.

I also see some style/architecture nits - such as using a global var to pass around the suggestions.

JacobSingh’s picture

I don't think you need a new request handler, you just need to add spell checking capabilities to the standard handler.

pwolanin’s picture

Status: Needs review » Needs work

We should not alter Service.php.

@Jacob- the standard solrconfig.xml already defines the spellcheck handler:

  <!-- a request handler utilizing the spellcheck component -->
  <requestHandler name="/spellCheckCompRH" class="solr.SearchHandler">
JacobSingh’s picture

Yes, but you're not meant use it.

it is an example. You can copy the properties inside of it, and add it to your standard search handler if you want spell checking.

From http://svn.apache.org/repos/asf/lucene/solr/trunk/example/solr/conf/solr...

<!-- A request handler utilizing the spellcheck component.  
  ################################################################################################
  NOTE: This is purely as an example.  The whole purpose of the SpellCheckComponent is to hook it into
  the request handler that handles (i.e. the standard or dismax SearchHandler)
  queries such that a separate request is not needed to get suggestions.

  IN OTHER WORDS, THERE IS REALLY GOOD CHANCE THE SETUP BELOW IS NOT WHAT YOU WANT FOR YOUR PRODUCTION SYSTEM!
  ################################################################################################
  -->
pwolanin’s picture

Also, there are whitespace issues (tabs vs. spaces, I think) in the patch.

Looking here: http://svn.apache.org/repos/asf/lucene/solr/trunk/example/solr/conf/solr...

the HEAD version of this has more useful comments in solrconfig.xml:

  <!-- A request handler utilizing the spellcheck component.  
  ################################################################################################
  NOTE: This is purely as an example.  The whole purpose of the SpellCheckComponent is to hook it into
  the request handler that handles (i.e. the standard or dismax SearchHandler)
  queries such that a separate request is not needed to get suggestions.

  IN OTHER WORDS, THERE IS REALLY GOOD CHANCE THE SETUP BELOW IS NOT WHAT YOU WANT FOR YOUR PRODUCTION SYSTEM!
  ################################################################################################
  -->
  <requestHandler name="/spellCheckCompRH" class="solr.SearchHandler">
    <lst name="defaults">
      <!-- omp = Only More Popular -->
      <str name="spellcheck.onlyMorePopular">false</str>
      <!-- exr = Extended Results -->
      <str name="spellcheck.extendedResults">false</str>
      <!--  The number of suggestions to return -->
      <str name="spellcheck.count">1</str>
    </lst>
    <arr name="last-components">
      <str>spellcheck</str>
    </arr>
  </requestHandler>

I think this is what Jacob is referring to - we should add this component to our usual request handler, not query it separately.

pwolanin’s picture

StatusFileSize
new1.17 KB

i.e. like this.

pwolanin’s picture

whoops - cross-posted above.

pwolanin’s picture

StatusFileSize
new4.54 KB
pwolanin’s picture

Status: Needs work » Needs review
StatusFileSize
new7.57 KB

here's a patch w/ no UI - enable spelling via $conf['apachesolr_search_spellcheck'] = TRUE; in settings.php. You may also have to issue a query directly to solr to build the spelling index.

pwolanin’s picture

StatusFileSize
new7.51 KB

a little cleaner code and comments.

pwolanin’s picture

StatusFileSize
new9.21 KB

now w/ a little UI to enable/disable spellcheck.

pwolanin’s picture

Status: Needs review » Needs work

actually - I was trying to be too clever here. The substring replacement will fail (wrong offset) if there is more than 1 replacement and any but the last have a length different from the misspelled word

pwolanin’s picture

Status: Needs work » Needs review
StatusFileSize
new12.12 KB
pwolanin’s picture

StatusFileSize
new12.08 KB

don't need 'all' params in the response except for debugging.

pwolanin’s picture

StatusFileSize
new12.34 KB

need to move the build directive to make it work.

pwolanin’s picture

Status: Needs review » Patch (to be ported)

committed #31 to 6.x

robertdouglass’s picture

Status: Patch (to be ported) » Needs work

Nice. Two things.

1: Can we change

Did you mean:  food

to

Did you mean <em>food</em>?

2: Can we put the spelling suggestion in the current search block as well?

janusman’s picture

Indexing documents from an empty index is now über-slow.

The solr log shows that the spell index is being built continuously (for every document, it seems!)

I would sugest for spell index rebuilding be called with cron; also perhaps let admins rebuild the spell index manually (and advise them to do so in the Drupal status report?)

Will work on a patch for this.

pwolanin’s picture

@Janusman - yeah, it's set to rebuild on each commit now. Unfortunately, the option to build on optimize is not going to be available until Solr 1.4

An earlier patch above had a rebuild-on-cron option. Alternatively, you can omit the rebuild directive from your solrconfig.xml for now.

janusman’s picture

StatusFileSize
new6.37 KB

Included is a patch against a fresh checkout of DRUPAL-6--1 to address some of the previous concerns:

  • reverts rebuilding the spellchecker index to cron runs (once daily) only, and also adds a local menu task (tab) to manually rebuild by the admin.
  • adds the "Did you mean XXXX?" line to the Current Search block, like @robertDouglass requested. However I did not actually choose to do hard markup () to what @pwolanin already added, because it can be styled via CSS.
janusman’s picture

Status: Needs work » Needs review

Marking as Code needs review.

pwolanin’s picture

Status: Needs review » Needs work

It would be better to not have the rebuild be a GET request - it should happen on form (or confirm form) submission.

brainski’s picture

Thanks to all for working on this feature! It looks like this feature is almost complete!

robertdouglass’s picture

I would still include a question mark and get rid of the : in the did you mean sentence. You should also put the spelling suggestion inside of the t() function. Here's what that line should look like, in my opinion:


$output .= '<div class="spelling-suggestions">' . t('Did you mean %suggestion_link?', array('%suggestion_link' => $suggestions_search_link)) . '</div>';

Also note that I've generally adopted the D7 coding convention of surrounding the . operator with one space on each side.

janusman’s picture

Title: Implementation of solr spellchecker: New Version for Dev Version » Implementation of solr spellchecker

@robertDouglass: Re: #40; totally right, will fix.

I've been thinking a bit abot the UX side of things...

I think that the correct thing is not put this in the "Your Current Search" block; my reasoning is because a suggestion is not "Your Current Search" but actually a "future" search that (I think) actually belong with the facets (they also represent "future" searches, stemming from the current one).

I propose this:

  • When a search returns 0 results, the suggestion should be prominent, along/instead of the "Your search returned no results" message returned by search.module. Similar to this: http://biblioteca.mty.itesm.mx/pasteur/en/search/apachesolr_search/ererwe (an earlier version of this patch for D5).
  • When a search returns >0 results, also provide a block with "just" the suggestion, so that site admins can choose to place it wherever (with the facets, for example?). Perhaps this could then (later on) mutate into not just spellchecking but also recommending additional (popular? complementary?) searches.

Any thoughts? Will wait a few and if no comments then I'll roll a patch and see if y'all like it =)

robertdouglass’s picture

@janusman: I still like the idea of putting the spelling suggestion in current search for zero results. I dislike the idea of another block. We're drowning in blocks. It makes one more step that you have to administer and configure, whereas putting it in current search "just works".

anarchivist’s picture

Subscribing.

pwolanin’s picture

Since we are moving to 1.4, we should take probably advantage of the on-Optimize building of the search index instead of doing it on cron, since I added a cron hook to optimize once per day.

andreiashu’s picture

subscribing

pwolanin’s picture

opened a separate issue for building on optimize: #375991: Use 1.4 feature - generate spelling index on optimize

janusman’s picture

Status: Needs work » Needs review
StatusFileSize
new3.43 KB

New patch rolled for current 6.x-dev, following up on comment #42.

pwolanin’s picture

Status: Needs review » Needs work

"in current search for zero results"

I don't see this check.

janusman’s picture

StatusFileSize
new42.02 KB

Could you elaborate? @robertDouglass 's comment was "I still like the idea of putting the spelling suggestion in current search for zero results", and that's the way it is.

See attached screenshot, perhaps we can discuss over it.

pwolanin’s picture

@Janusman - it looks in your patch like the suggestions will always be in the current search block. My understanding of Robert's suggestion was to only display them in the block if there are zero results.

JacobSingh’s picture

IT seems to me we want to use spellcheck.onlyMorePopular

And I think we should use a threshold for showing. Only showing with zero results doesn't always work great. For instance, search for druppal on this site :)

spellcheck.extendedResults will also give us a frequency and the origFreq so

search for druppal

origFreq = 20 (if 20 nodes use the word druppal)
and
freq (for Drupal) = 100k or somehting

which would give us a good indicator of an error. However, this may incur a performance hit (I dunno), and I'm not certain how you know that 100k is a big # and 20 is a small one. What if the index is 50,000,000. In other words, if I have a lot of documents, then the difference is not as great.

drunken monkey’s picture

and I'm not certain how you know that 100k is a big # and 20 is a small one. What if the index is 50,000,000. In other words, if I have a lot of documents, then the difference is not as great.

I think the size of the index hasn't got much to do with it - a search with 100k results is in any case much more popular / likely to be correct than one with 20. I think just using the origFreq/freq ratio and maybe a threshold (e.g., no suggestions for origFreq > 1k) should work.

But the performance penalty sure is something to keep in mind and test before implementing.

pwolanin’s picture

@Jacob - I thin the idea was to show it in the current search result when there are zeros hits - it.e. as an extra hint that there might be an error.

Probably onlyMorePopular is a good idea too - otherwise we are likely to be returning as many spelling errors as corrections.

janusman’s picture

Status: Needs work » Closed (fixed)

This feature is already implemented. Closing out.