I finally finished my implementation of the spellchecker component of solr in the apache solr module.
More information about the spellchecker component can be found here:
http://wiki.apache.org/solr/SpellCheckComponent
Benefits
Compared to the existing task containing a spellchecker for Drupal 5, this one has some important advantages:
- It does inline spellchecking. That means, that the spellchecking does not require an additional http request. its added to every search request.
- It can correct phrases! If you search for "ultrasharb knife" every word with suggestions will be replaced returning "ultrasharp knife"
- Spellchecking is always done. If you have wrong written words in your nodes, spellchecking would not occur because a result is found. This patch does also spellchecking if results are found (as Google does too).
- Using the json_decode function for unserializing request data is faster
Unfortunatly a lot of things have to be change before you can use the spellchecker.
Because it took me several days to find out all details, I will write a detailed guide. Hoping that this is useful for someone out there. Here we go:
Step 1: Download Solr 1.3 nightly build.
The spellchecker component is a part of the upcoming new version of solr 1.3. This means, that you have to install a new solr instance.
http://people.apache.org/builds/lucene/solr/nightly/
Step 2: Modify the schema.xml
You will have to add all your drupal fields to the schema. A new field is added to the schema called "spellchecker".
<!-- This field is used to build the spellchecker index -->
<field name="spell" type="textSpell" indexed="true" stored="true" multiValued="true"/>
You can fill the field with field from your index by copying it to the "spell" field:
<copyField source="title" dest="spell"/>
I copy only the title field in the word index for spellchecking. If you want to add all words to the index, you may add the following line:
<copyField source="text" dest="spell"/>
I attached my schema.xml
Step 3: Replace the SolrPHPClient with the new version
The author of SolrPHPClient has released a new SolrPHPClient using json_decode instead of xml results. This is a performance gain.
https://issues.apache.org/jira/browse/SOLR-341
Unfortunately, the solr client does not allow to use different search servlets than "select". Because of this, I had to add the possibility of selecting a different search_servlet. The name of the spellchecker component search servlet is "spellCheckCompRH".
I modified the search function of Service.php accepting different search servlets.
The esiest way is to delete the existing SOlrPHPClient and to use my attached one. Only replacing Service.php will not work.
public function search($query, $offset = 0, $limit = 10, $params = array(), $search_servlet = null)
{
if (!is_array($params))
{
$params = array();
}
/* removed some lines
*
*/
$queryString = preg_replace('/%5B(?:[0-9]|[1-9][0-9]+)%5D=/', '=', $queryString);
//Enable different search request handlers
$searchUrl = $search_servlet==null ? $this->_searchUrl : $this->_constructUrl($search_servlet);
return $this->_sendRawGet($searchUrl . $this->_queryDelimiter . $queryString);
}
Step 4: Patch apachesolr.module
I have added a new menu for the spellchecker.
Attached you will find a patch for new newest dev version and a standalone module.
Step 5: Patch the apachesolr_search.module
I have written a method that alter the search form and add the suggestion just below if there are any suggestions.
The index of searchword is rebuild every 24 hours.
Step 6: Delete and recreate the whole index
You may use the function under Site configuration -> Search settings ->Re-index site. Then rerun the cron job several times until all documents are indexed.
Step 7: Go to Site COnfiguration Apache Solr -> Spellchecker and rebuild the spellchecker index
You may use the function under Site configuration -> Search settings ->Re-index site. Then rerun the cron job several times until all documents are indexed.
Then activate the spellchecker and rebuild the spellchecker index under Site Configuration -> Apache Solr -> Spellchecker.
If the schema is correct, in the folder under /data/spellchecker1/ the index will be created. It should be bigger than 1kb.
Troubleshhoting
1. Check if you have started the new solr instance with solr 1.3
2. Check if there are any errors while starting the solr instance
3. Make sure you updated the whole index of the solr instance
4. Go to /solr/admin and enter a search keyword. Is the field "spell" filled with content?
5. Check if you have rebuild the spellchecker index. This is a must before first use. Go to the solr data folder and check folder /spellchecker1/. Does it contain updated index files?
6. Do a manual check if spellchecks are working:
//Build index
http://:/solr/spellCheckCompRH?q=sharb&spellcheck.q=sharb&spellcheck=true&spellcheck.build=true
//Query
http://:/solr/spellCheckCompRH?q=sharb&spellcheck.q=sharb&spellcheck=true
If you have the word "sharp" in your word index, a suggestion "sharp" for "sharb" should be added to the search results.
If any of these points isn't working, restart from the beginning :-)
It would be very useful if someone could set up a test environment and test this feature. I'm waiting for your feedback.
| Comment | File | Size | Author |
|---|---|---|---|
| #49 | 303937_49.jpg | 42.02 KB | janusman |
| #47 | 303937_47.patch | 3.43 KB | janusman |
| #36 | apachesolr_D6_303937-35.patch | 6.37 KB | janusman |
| #31 | minimal-spelling-303937-31.patch | 12.34 KB | pwolanin |
| #30 | minimal-spelling-303937-30.patch | 12.08 KB | pwolanin |
Comments
Comment #1
janusman commentedSounds great! But I noticed you are only using the title field for building the dictionary. Is there a reason not to use the "text" field (and/or others)?
BTW am I right in that PHP5.2 required for the new SolrPhpClient?
Comment #2
brainski commentedHi
I did not read about this requirement? Where was it written?
I have made many test with the spellchecker and finally I come to the conclusion, that including only the title field is more accurate than including the whole text.
In my case I have 750'000 nodes. All containing a lot of plain text. Because the user is looking for the title in the search field, the results of the spellchecker were much better. In my nodes are also some wrong written words and different languages. Because if this, I preferred to index only the title field. The title is in 99.9 % cases without a typo. But I described above how to add the whole text field.
At the moment, I only include one spellchecker result. Later it would be possible more spellchecker suggestions.
Did you test the whole thing?
BTW: I think solr 1.3 will be released soon.
Comment #3
robertdouglass commentedComment #4
brainski commentedWhile testing it turns out, that the json_decode generates a slightl different response.
Because of this, facets are not working in my first version. Attached you will find the corrected module. Facets are now working.
Comment #5
brainski commentedWhile testing I found out, that the spellchecker is also spellchecking terms like type:page. Thats a bug.
I think a regex trick is necessary to remove all xx:yy elements from the querystring.
//"test type:page" -> "test"
$queryclean = somecleaning($querystring);
Maybe someone is good in regex and provides a simple solution?
Comment #6
drunken monkeyWell, a simple regex would be
/\w+:(\([^)]+\)|\w+)/i
But since colons inside quotes should remain (and parentheses can be nested, and...), a completely correct regex isn't possible, this would require a grammar and therefore a complete funciton dedicated to parsing the query for this syntax. But if the Solr_Base_Query::parse_query() functions correctly, this could easily be reduced to just checking each field for this case, which might even be possible with just a regex.
Comment #7
janusman commentedAlthough this seems very nice, I *feel* that hacking away at the SolrPhpClient code apart from using Solr1.3 (which is not really out yet?) adds a touch too much of complexity; I would move for us to work on the simpler solution on #230380: spelling suggestions instead, which relies on Solr1.2 and only slight modification; perhaps that other code can be worked on to provide some of the neat stuff this patch proposes?
What do you think? For now I am marking as duplicate while we come to an agreement (we can revert this).
Comment #8
brainski commentedThats sad, but your decision. Testing would be great.
On 15 August was Solr freezed. Only the documentation need to be updated. If this task is finished, Solr 1.3 will come out.
Please be aware, that the Spellchecker Component also works on Solr 1.2. That means you can use als Solr 1.2 as a base. The configuration effort is only bigger. You habe to configure the Spellchecker component in the solr.conf.
Changing the SolrPhpClient is need to use different search handlers different than "/solr/select". The author has hardcoded this in the SolrPHPClient. Unfortunately the only way to go around this, is to modify the SolrPHPClient.
Anyway, its necassary that spellchecker is a future feature of drupal apache solr module.
I would like to have the benefits of:
- inline spellchecking (in the search result), only one request and not a request for the search and one for every search word.
- Checking phrases not only a simple search word
If you can provide these feature with your solution, I agree that we have a duplicate. If not, I do not agree.
Comment #9
janusman commentedThank you for your comments.
BTW this is not my *decision* but rather my *opinion*. My suggestion is to continue this thread over in the (older) issue I mentioned to focus the discussion, and not at all to stop this (on the contrary!) =) I am just tyding up =)
As for the code, I realize these are exciting benefits, however code should be modularized (I know, slowing things a bit) to keep it sustainable; for example, when a new SolrPhpClient comes out, how would we manage this great code you put in? Right now the mantainers of this module have contacted the SolrPhpClient developer(s?) and in the future are going to try to influence that other project to put stuff like this in there.
For now, ?is it possible to extract the code you put into your modified SolrPhpClient to another place? To one of the existing modules, or a new module inside the contrib/ directory.
Comment #10
brainski commentedI would like to go further in this issue. I think the spellchecker is a widely needed feature.
Because solr depens on solr 1.3 I would like to know,if solr 1.3 will be used in the upcoming version 1.0.
Because solr 1.3 is already freezed and only the documentation part has to be done, I think we should buld the upcoming version 1.0 on solr 1.3. otherwise the new version has to be updated just shortly after the release of solr 1.3.
@janusman
I have no idea how to handle the modified solr client. I'm not able to commit to drupal cvs and I have no experience how to handle such a case.
The modification of the solrphpclient is only 2 lines. I added the possibility to select the requesthandler in the search url with a parameter.
hardcoded at the moment:
solr/select/?
for spellchecker needed:
solr/spellCheckCompRH
I would like to help you because I have the feeling, that the spellchecker feature got lost I we do not stick to getter and find a solution, that work for now and the future. And also for the future module users.
Comment #11
janusman commented@brainski: Solr1.3.0 support is now go. Since you proposed this for 1.3 and I was working on the same thing for Solr 1.2, do you think you want to still collaborate on bringing this to the current dev version?
If not, and if it's ok, I will continue the great work you started... let us know, ok? Thanks.
Comment #12
brainski commented@janusman
Of course I would like to help! I need this feature to.
An open point for me is still the regex issue to remove
//problematic search string: "test type:page" should become "test", because spellchecker will propose something for "type" and "page"
$queryclean = somecleaning($querystring);
I stopped with this issue because I had to modify the solr php client. As I described a default requesthandler is hardcoded in the solr php client.
hardcoded at the moment:
solr/select/?
for spellchecker needed:
solr/spellCheckCompRH
I would like to use inline spellchecking.
I studied your code and saw that you do an additional server call for every word. This is no more necessary. You can use the inline results.
Please contact me to discuss this points.
Comment #13
robertdouglass commentedPlease feel free to propose changes to the core query (for example, a parameter that sets the handler) if it makes it easier to elegantly implement these features.
Comment #14
brainski commentedok here is my proposed change to the SolrPhpClient:
Service.php
The function search() has the ability to set a different search_servlet.
This is only the patch for the SolrPhpClient.
Later I will post an implementation of the search_servlet "solr/spellCheckCompRH"
*edit: replaced file, had a typo*
Comment #15
brainski commentedHi there!
I have finished the new version of solr spellchecker. I have rewritten some parts and fixed all known bugs.
To use this feature, you have to:
-install solr 1.3
-apply the patch
-copy schema.xml to solr /conf dir
-restart solr
-rebuild index
-enable spellchecker and spellchecker index under admin -> apache Solr -> spellchecker
-rebuild the spellchecker index once
I hope someone can test this feature!. I think its no more alpha its in beta status.
Comment #16
brainski commentedComment #17
pwolanin commentedThis looks like a very nice feature, but will need to be reworked a bit because I'm about to put a patch in to use the dismax handler which changes a lot of code.
I also see some style/architecture nits - such as using a global var to pass around the suggestions.
Comment #18
JacobSingh commentedI don't think you need a new request handler, you just need to add spell checking capabilities to the standard handler.
Comment #19
pwolanin commentedWe should not alter Service.php.
@Jacob- the standard solrconfig.xml already defines the spellcheck handler:
Comment #20
JacobSingh commentedYes, but you're not meant use it.
it is an example. You can copy the properties inside of it, and add it to your standard search handler if you want spell checking.
From http://svn.apache.org/repos/asf/lucene/solr/trunk/example/solr/conf/solr...
Comment #21
pwolanin commentedAlso, there are whitespace issues (tabs vs. spaces, I think) in the patch.
Looking here: http://svn.apache.org/repos/asf/lucene/solr/trunk/example/solr/conf/solr...
the HEAD version of this has more useful comments in solrconfig.xml:
I think this is what Jacob is referring to - we should add this component to our usual request handler, not query it separately.
Comment #22
pwolanin commentedi.e. like this.
Comment #23
pwolanin commentedwhoops - cross-posted above.
Comment #24
pwolanin commentedhow about something like this for the changes to the config files?
Useful reference pages:
http://wiki.apache.org/solr/SpellCheckComponent
http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
http://wiki.apache.org/solr/SpellCheckerRequestHandler
Comment #25
pwolanin commentedhere's a patch w/ no UI - enable spelling via
$conf['apachesolr_search_spellcheck'] = TRUE;in settings.php. You may also have to issue a query directly to solr to build the spelling index.Comment #26
pwolanin commenteda little cleaner code and comments.
Comment #27
pwolanin commentednow w/ a little UI to enable/disable spellcheck.
Comment #28
pwolanin commentedactually - I was trying to be too clever here. The substring replacement will fail (wrong offset) if there is more than 1 replacement and any but the last have a length different from the misspelled word
Comment #29
pwolanin commentedComment #30
pwolanin commenteddon't need 'all' params in the response except for debugging.
Comment #31
pwolanin commentedneed to move the build directive to make it work.
Comment #32
pwolanin commentedcommitted #31 to 6.x
Comment #33
robertdouglass commentedNice. Two things.
1: Can we change
to
2: Can we put the spelling suggestion in the current search block as well?
Comment #34
janusman commentedIndexing documents from an empty index is now über-slow.
The solr log shows that the spell index is being built continuously (for every document, it seems!)
I would sugest for spell index rebuilding be called with cron; also perhaps let admins rebuild the spell index manually (and advise them to do so in the Drupal status report?)
Will work on a patch for this.
Comment #35
pwolanin commented@Janusman - yeah, it's set to rebuild on each commit now. Unfortunately, the option to build on optimize is not going to be available until Solr 1.4
An earlier patch above had a rebuild-on-cron option. Alternatively, you can omit the rebuild directive from your solrconfig.xml for now.
Comment #36
janusman commentedIncluded is a patch against a fresh checkout of DRUPAL-6--1 to address some of the previous concerns:
Comment #37
janusman commentedMarking as Code needs review.
Comment #38
pwolanin commentedIt would be better to not have the rebuild be a GET request - it should happen on form (or confirm form) submission.
Comment #39
brainski commentedThanks to all for working on this feature! It looks like this feature is almost complete!
Comment #40
robertdouglass commentedI would still include a question mark and get rid of the : in the did you mean sentence. You should also put the spelling suggestion inside of the t() function. Here's what that line should look like, in my opinion:
Also note that I've generally adopted the D7 coding convention of surrounding the . operator with one space on each side.
Comment #41
janusman commented@robertDouglass: Re: #40; totally right, will fix.
I've been thinking a bit abot the UX side of things...
I think that the correct thing is not put this in the "Your Current Search" block; my reasoning is because a suggestion is not "Your Current Search" but actually a "future" search that (I think) actually belong with the facets (they also represent "future" searches, stemming from the current one).
I propose this:
Any thoughts? Will wait a few and if no comments then I'll roll a patch and see if y'all like it =)
Comment #42
robertdouglass commented@janusman: I still like the idea of putting the spelling suggestion in current search for zero results. I dislike the idea of another block. We're drowning in blocks. It makes one more step that you have to administer and configure, whereas putting it in current search "just works".
Comment #43
anarchivist commentedSubscribing.
Comment #44
pwolanin commentedSince we are moving to 1.4, we should take probably advantage of the on-Optimize building of the search index instead of doing it on cron, since I added a cron hook to optimize once per day.
Comment #45
andreiashu commentedsubscribing
Comment #46
pwolanin commentedopened a separate issue for building on optimize: #375991: Use 1.4 feature - generate spelling index on optimize
Comment #47
janusman commentedNew patch rolled for current 6.x-dev, following up on comment #42.
Comment #48
pwolanin commented"in current search for zero results"
I don't see this check.
Comment #49
janusman commentedCould you elaborate? @robertDouglass 's comment was "I still like the idea of putting the spelling suggestion in current search for zero results", and that's the way it is.
See attached screenshot, perhaps we can discuss over it.
Comment #50
pwolanin commented@Janusman - it looks in your patch like the suggestions will always be in the current search block. My understanding of Robert's suggestion was to only display them in the block if there are zero results.
Comment #51
JacobSingh commentedIT seems to me we want to use spellcheck.onlyMorePopular
And I think we should use a threshold for showing. Only showing with zero results doesn't always work great. For instance, search for druppal on this site :)
spellcheck.extendedResults will also give us a frequency and the origFreq so
search for druppal
origFreq = 20 (if 20 nodes use the word druppal)
and
freq (for Drupal) = 100k or somehting
which would give us a good indicator of an error. However, this may incur a performance hit (I dunno), and I'm not certain how you know that 100k is a big # and 20 is a small one. What if the index is 50,000,000. In other words, if I have a lot of documents, then the difference is not as great.
Comment #52
drunken monkeyI think the size of the index hasn't got much to do with it - a search with 100k results is in any case much more popular / likely to be correct than one with 20. I think just using the origFreq/freq ratio and maybe a threshold (e.g., no suggestions for origFreq > 1k) should work.
But the performance penalty sure is something to keep in mind and test before implementing.
Comment #53
pwolanin commented@Jacob - I thin the idea was to show it in the current search result when there are zeros hits - it.e. as an extra hint that there might be an error.
Probably onlyMorePopular is a good idea too - otherwise we are likely to be returning as many spelling errors as corrections.
Comment #54
janusman commentedThis feature is already implemented. Closing out.