Support PECL stem package
jmiccolis - October 21, 2009 - 16:25
| Project: | Porter-Stemmer |
| Version: | 6.x-2.x-dev |
| Component: | Code |
| Category: | feature request |
| Priority: | normal |
| Assigned: | jhodgdon |
| Status: | fixed |
Description
PECL has a stem package http://pecl.php.net/package/stem which is also based on the snowball code. Mostly for performance reasons it would really be great if this module could use that package when available, but also be able to fall back onto the current php implementation.
Does this sound like a reasonable addition?

#1
Seems reasonable to me.
#2
Thanks for pointing this out! I will try it out and see about the feasibility of using it in the module...
jmiccolis: if by some chance you have already tried this and have a patch, that would be welcome, of course!
#3
Hmmm...
Actually, it looks like that project contains stemmers for a dozen or so languages, and it also looks like the latest CVS messages on the project say "Convert everything to UTF-8 internally. Tests fail for now, will be fixed shortly". This was three months ago, but apparently after the last stable release... Do you know anything about the maturity of this project?
Anyway, I'm wondering if it might make sense to make one module that would use any/all of the included PECL stemmers, rather than just porting this Porter Stemmer module so it would use that one stemming algorithm in the PECL package. It seems like other stemming packages would also need/want to convert, though I don't know anything about the other languages' algorithms. Any thoughts?
#4
I actually have a version of this running now. Testing. Will report back...
#5
Testing results:
- The PECL implementation of the algorithm matches the output of my PHP implementation -- all the tests pass, which is to say that the PECL library implementation stems the words to match Porter's word list, as does my PHP implementation.
- The PECL library is significantly faster. I added some timing to the test runs, and they finish in about 25% of the time when using the PECL library's implementation. That's not too surprising, since theirs is written in C and mine is written in PHP. Possibly the stemming itself is even faster -- I took out the assert statements during the timing tests, but the test still had to read information from the word list file, so there was some overhead beyond just stemming words. So the factor of 4 speedup is probably a lower bound.
So I have just committed these changes to the development branch 6.x-2.x-dev of Porter Stemmer now -- should be available from CVS now, or on the Porter Stemmer module page in about 24 hours (check the last update time on the development download). Here's the commit log:
http://drupal.org/cvs?commit=278052
#6
jhodgdon: that was quick!
This really is a significant performance improvement. In my one-off test, I could measure about a quarter less processing time for the same search indexing task:
Setup:
* php 5.2.6, Apache 2, MySQL 5.0.77
* Latest stem extension installed with "pecl install"
* Drupal 6.14, devel, search, porterstemmer
* 200 nodes, 100 comments
* all search settings default
* xdebug
**Test 1:**
* stem extension disabled
* search index empty
* run cron with drush cron run once
Results:
* porterstemmer_search_preprocess (this is the function that wraps the call to either the stem extension or the porterstemmer PHP implementation) has an inclusive cost of 25 % (25 % of script execution time)
* porterstemmer_suffix is 3rd in self cost with 7 %, called 1,754,060 times
**Test 2:**
* stem extension disabled
* As in Test 1, search index empty
* As in Test 1, run cron with drush cron run once
Results:
* inclusive cost of porterstemmer_search_preprocess practically disappeared (0.3 %)
* porterstemmer_suffix is not being called anymore, as stem library is being used.
#7
This patch wraps loading the stem extension into its own function that only attempts to load once. Suppresses notices from dl() with @ as I found that the current code issues a warning if stem.so is not available.
I wonder whether porterstemmer should rely at dl() at all.
#8
Good idea, alex_b, on that static function and using @ if it's not there. And thanks for testing!
Regarding dl(), I am not sure about that either. Perhaps not, as php.net says it is going away. We can just use that library if it is already attached/running, and not otherwise.
I'll get out a new version shortly with something very much like your attached patch, but without using dl().
[Edit: Looks like "shortly" will be tomorrow probably, and "version" will mean another development release.]
#9
#8 @jhodgdon: We should simplify it further then and not support dynamic loading.
http://php.net/manual/en/function.dl.php does not state why dl() is being deprecated without successor. Is dynamic loading a deprecated feature? Either "Relying on this feature is highly discouraged." kinda rings the bell for me.
Here is a simplification that only checks for the presence of stem with extension_loaded(). I pushed this version through cachegrind, too: extension_loaded() is cheap, we can call it without a check-once wrapper.
#10
Yeah, agreed, looks like dynamic loading is going away, but hard to say for sure what the plan is unless we search around for a roadmap.
So do you think it's definitely safe to say that if extension_loaded('stem') returns TRUE, that the stem_english() function exists? I feel more comfortable, given that it's a contrib extension, leaving the function_exists() check in as well (and if we leave that in there, maybe putting it in a separate function is again more efficient).
Thoughts?
Also, if you are up for it, the porter_stemmer.test file needs a similar patch. :)
#11
I'm patching...
#12
I just checked in a new revision to the 6.x-2.x-dev branch of Porter Stemmer, which:
- Does not attempt to use dynamic loading
- Moves the check for the PECL library into a function with static caching of results
http://drupal.org/cvs?commit=278870
I tested that the module still functions correctly with and without the PECL library loaded with
extension=stem.so
in the php.ini file, and it also passes code review.
Comments welcome... If anyone has comments on the current versions of README and INSTALL files (which mention PECL), that would be good too.
#13
Could you roll a release with these changes?
#14
Yes, it's on my agenda for today... Thanks for the reminder!
#15
Release 6.x-2.4 has been created, with the PECL support included.
I suggest waiting for at least 10 minutes before attempting to download it from the Porter Stemmer project page, as sometimes the download link appears before the zip file is actually there.
http://drupal.org/project/porterstemmer