In Australia (and I suppose many English speaking parts of the world) we sometimes spell words with the "ise" suffix instead of the "ize" suffix.
I have attached a unified diff which I hope can be tested by someone who knows more about stemming than I do. (my limited time allowed me to test for "caramelise", "caramelised", "caramelize" and "caramelized", all resolving back to the word "caramel" in a node). All I can say is "it works for me". :)
| Comment | File | Size | Author |
|---|---|---|---|
| #5 | 335030.patch | 536 bytes | jhodgdon |
| #4 | 335030.patch | 564 bytes | jhodgdon |
| porterstemmer.module.diff | 1.57 KB | carneeki |
Comments
Comment #1
gregglesThat's an interesting idea. This module is largely based on an external porter-stemmer codebase. Maybe you could check that code to see if it has this capability? Perhaps it's time to update this code with a refresh of their latest version.
Comment #2
jhodgdonThe published Porter Stemmer algorithm is apparently only for American English (this is true of the Porter 2 algorithm). I think we should just update the documentation to state this clearly, rather than trying to modify the algorithm so it would maybe work for non-American English as well. The reason I think this is that the algorithm's decision process is quite complex, and I'm concerned that any modifications we would do would likely screw up the stemming of other words.
Places to fix documentation:
- Project page - http://drupal.org/project/porterstemmer
- README.txt file
Thoughts? Any other places to fix?
Comment #3
gregglesThat seems like a good solution to me.
Thanks!
Comment #4
jhodgdonI fixed the project page. Here's a patch for the README. Which branch(es) should we commit it to, if any?
Comment #5
jhodgdonMissing newline. Try this patch.
Comment #6
jhodgdonComment #7
gregglesLooks great to me. I guess commit to 5.x and 6.x branches which are DRUPAL-5 and DRUPAL-6--1.
Comment #8
gregglesI should add, if you want to commit things to HEAD as well, please do. Otherwise we can just merge everything from DRUPAL-6--1 into HEAD whenever we start working on 7.x compatibility.
Comment #9
jhodgdonDone.
Comment #11
gpk commentedIs this true? Martin Porter's definitive Porter 2 page http://snowball.tartarus.org/algorithms/english/stemmer.html doesn't mention being specific to one form of English, nor does the main snowball page http://snowball.tartarus.org/ mention American vs. English among all the languages listed. (And he seems to hail from somewhere this (British) side of the pond!!)
-ise and -ize are both valid British spellings:
http://en.wikipedia.org/wiki/American_and_British_English_spelling_diffe...
Interestingly though the Porter 2 (and the original Porter) algorithm doesn't look for -ise endings, only -ize. Bizarrely, in his actual prose Porter only uses -ise forms on that page, where he discusses -ize...!!?!
Looking at the sample vocabulary and its stemmed equivalent, I see the following:
apologise -> apologis
but
apologize -> apolog
I'm contacting the mailing list to see what Martin or others have to say about this!
Comment #12
jhodgdonSure, contact Porter... I'm going by his algorithm, not any statements he may have made. The algorithm doesn't stem -ise, as you've noted.
The existing Drupal Porter Stemmer module stems every word in Porter's word list correctly (it is fully tested), and can also use the official Snowball project's PECL implementation (if you have it on your system). Neither one stems -ise. Such is life.
But until the algorithm or official implementations are changed, adding this feature to the Drupal module is a non-starter.
Comment #13
gpk commentedWell having almost composed my message I found this in the mailing list archives for November 2008 (hoping that no one objects to my posting it here!):
Another correspondent agreed that removing the -ise ending, in the same way as -ize, actually made things worse.
So I think that the README and project page info need modifying again, maybe to say that the stemmer works for both British and American spellings, but that it is not an exact science and in fact it works better for the latter.
Comment #14
gpk commented> adding this feature to the Drupal module is a non-starter
Agreed!
Comment #15
jhodgdonComment #16
jhodgdonI've modified the project page and the README (at least in HEAD/branch; not enough of a change to merit releasing a new 6.x version in my opinion).
Comment #17
gpk commentedWow that was quick :-D
Thanks!