Closed (fixed)
Project:
Porter Algorithm Search Stemmer
Version:
6.x-1.x-dev
Component:
Code
Priority:
Normal
Category:
Task
Assigned:
Reporter:
Created:
6 Jul 2009 at 21:47 UTC
Updated:
18 Aug 2009 at 15:40 UTC
The author of the Porter stemming algorithm has released a "Porter 2" version of the algorithm, which he recommends using for "practical purposes", as it fixes some limitations of the original Porter algorithm.
New algorithm site: http://snowball.tartarus.org/algorithms/english/stemmer.html
Original algorithm site: http://tartarus.org/~martin/PorterStemmer/index.html
This issue is to track progress and comments on upgrading to the "Porter 2" algorithm.
Comments
Comment #1
jhodgdonThere is a test set of words that could be used to generate a test (using SimpleTest) for the Porter Stemmer as well.
Comment #2
goodeit commentedsubscribing
Comment #3
jhodgdonI have just checked in a totally new version of this module into the HEAD (development) branch -- see commit http://drupal.org/cvs?commit=237490
There are several changes:
a) Updated to the new Porter Stemmer 2 algorithm (see above).
b) Made the minimum word length after stemming be 3 characters (see #219335: If term gets stemmed to fewer than 3 characters, form validation fails)
c) Updated all the files to comply with coding standards, especially function naming (see #437094: Function m() creates issues with ubercart).
d) Added unit tests; check the README file for information on how to run the tests with the SimpleTest module. In my testing, all 29,000 words from the test word list available from the "new algorithm" web site above are stemmed correctly with the checked-in module. Yeah! The word list is included in the distribution... it is a fairly large file (about 1 MB), but it's necessary to have in order to run all the tests.
e) Updated the README and INSTALL files.
If you would like to test this new algorithm, you can check it out now from CVS (HEAD branch). If you are not comfortable using CVS, wait a few hours, and it should be available from the Porter Stemmer project page under "View All Releases" as a gzip file. Look for "porterstemmer 6.x-2.x-dev" and make sure the last updated date is sometime after now (i.e. July 14, 2009).
You will need to rebuild your search index after installing this module. See instructions in the INSTALL.txt file.
Comment #4
jhodgdonJust a note that this is now on the main Porter Stemmer project page, in the "Development releases", 6.x-2.x-dev. Thanks Greg!
Comment #5
jhodgdonThe 6.x-2.0 version of Porter Stemmer has now been released (should be on the Project page in the next 5 minutes). So this issue is now fixed.
I've also removed the 6.x-2.x-dev version from view on the Project page, since there is nothing currently in there that isn't in the official release. If we start developing again, I'll put it back in view.