Attached a patch that enables multilingual and therefore English (before it defaulted to German). It permits the use of regional dialects (ie. en-US, en-UK) and will fall back the first 2 characters ( en, de etc). This allows the definition of Stop words for these dialects or a generic one.

I changed the English stop words, a merger of two lists I found on the web somewhere. I try to get the sites to release them to me under GPL 2 or higher so they could be included into the repository. May be a better solution is to make them configurable and save them in the variable table or create a stop word table?

Comments

ñull’s picture

The stop words are merged from http://www.webconfs.com/stop-words.php and http://www.ranks.nl/resources/stopwords.html

From both sites I received an email in response to my request to allow it to be included in this module under GPL 2 license and they have given me permission.

eugenmayer’s picture

Status: Needs review » Patch (to be ported)

looks good to me, thanks for the contribution. No reason to no include this one.

ñull’s picture

Status: Patch (to be ported) » Needs review
StatusFileSize
new37.4 KB

I added better multilingual support of stop words. In my case I have only one language active and as default. This led to all kind of errors. The changes work for me, but would need to be tested in multilingual set ups too (I had no time to set up a test environment for this). In this patch I also added stop words for Dutch, German, English and Spanish.

Thomas_Zahreddin’s picture

hi,

i think a list of stopword is a good idea.

there exists inside of de_stemmer a module stemmer_api with stopword support, maybe you want to give it a chance:
carve out the stemmer_api to a full module

eugenmayer’s picture

be sure to create patches against https://github.com/EugenMayer/tagging, not d.o dev

The API Thomas suggests sound like a pretty good approach so we can include more of those. I case, i want to make it a submodule then. So we create a tagging_stopwords module, depending on those 2 above ( so we dont need to make tagging depending on those 2)