Implement partial word or wildcard searches

augustd - December 15, 2006 - 23:01
Project:Drupal
Version:7.x-dev
Component:search.module
Category:feature request
Priority:normal
Assigned:Unassigned
Status:active
Description

Search is working, but I noticed that it doesn't pick up on partial words. For example, if you search on 'quake' you would expect to get back results containing the term 'earthquakes' but there are no results.

This behavior is also the case with plurals: Searching on 'engineer' when a node only includes 'engineers' will not return that node in the results. It is pretty standard searching behaviour for people to omit plurals and expect to see them in results. For example, searching on 'engineer' should return:

engineers
engineering
engineer's

I have attached a patch to enable partial word searching.

AttachmentSize
search.module_9.patch355 bytes

#1

pwolanin - December 15, 2006 - 23:57

see the porter-stemmer module

#2

pwolanin - December 16, 2006 - 00:22
Version:4.7.4» 6.x-dev
Category:bug report» feature request
Status:patch (reviewed & tested by the community)» patch (code needs work)

Also, a "feature" of the Drupal development cycle is that new features are only considered for the latest version in development.

#3

pwolanin - December 16, 2006 - 00:26

Also, please supply patches in unified diff format, and it's considered bad form to RTBC your own patch. It needs to be reviewd by others. See: http://drupal.org/patch

#4

augustd - December 19, 2006 - 02:09

Porter-Stemmer works for engineer->engineers, but not for quake->earthquake.

Attached is a unified diff of the same code.

AttachmentSize
search.module_10.patch726 bytes

#5

pwolanin - December 19, 2006 - 02:16

I agree with you that this would be a nice feature, but the key question that will be asked by the people who might actually accept this change is the relative speed/efficiency of this query compared to the existing query.

#6

RobRoy - December 19, 2006 - 02:22

This is something I brought up a while back (in some issue I can't find) and was told that partial matching was too db-intensive. But, for 6.x I think we need to incorporate this. Most users expect partial matching to work. Sure, that's a blanket statement, but IMO it's valid. Ironically enough, I was searching for porter-stemmer module and searches for "porter stemmer", "porter", and "stemmer" all came up false for the project page. Only "porter-stemmer" brought it up. It seems we are expecting that searchers know almost exactly what they are searching for when that is main reason one searches in the first place. :)

What is a practical and feasible way to improve partial matching without sacrificing performance? Should we have an option in search settings to turn on partial-word matching?

#7

spjsche - December 19, 2006 - 02:50

Another point to note is that the search module does not perform searches within revisions.

#8

jmanico - December 19, 2006 - 17:40

It's heartbreaking that we need to wait until 6 for this basic feature that most users expect to see. Is drupal going to be a blogging engine or a enterprise ready system?

Might I suggest we roll this key feature into 4.7 but add a switch so the admin can turn this feature on-or-off if performance is a problem? Good job, augustd for bringing this up.

#9

pwolanin - December 19, 2006 - 18:05

Such a thing can be added to any version as a contributed module - take the current search module and tweak it and offer it as an alternative.

#10

augustd - December 19, 2006 - 23:22

I'd like to avoid fragmenting the code base as much as possible. Besides if I was going to offer a separate search module I would rather create one to take advantage of MySQL FULLTEXT searching using MATCH ... AGAINST:

http://dev.mysql.com/doc/refman/5.0/en/fulltext-search.html

What I'm offering here is just a quick fix. How hard is it to roll this into the source tree? I can make the partial word feature an option if needed.

#11

RobRoy - December 19, 2006 - 23:25

I'm pretty sure the only way it will get considered for core is if it's an option, defaulting to whole word matching. The setting could go under the Performance section. I'd say go ahead and roll a patch with a setting so it can get some reviews.

#12

Steven - December 21, 2006 - 20:01

Wildcard matching destroys the efficiency of the search index.

Is drupal going to be a blogging engine or a enterprise ready system?

Have you asked Google when they are going to implement wildcard searching?

#13

RobRoy - December 21, 2006 - 20:48

So if we are to not enable wildcard matching (even as an option), we would need to make improvements to the indexer, correct? So we could match quake to earthquake, develop to development, etc. That's probably a pretty gnarly task, no?

#14

Steven - December 22, 2006 - 02:27
Status:patch (code needs work)» active

Not really. All you need is a synonym list. You can solve this problem two ways:
- When searching replace a word by 'word OR synonym OR synonym'
- When indexing, replace a word by 'word synonym synonym'

The first is more exact, the second is probably faster (but does not behave well for e.g. phrase searches).

By the way, I just checked out MySQL FULLTEXT in more detail, and as far as I can tell it has no generic substring matching either.

The only thing we can do (and which MySQL FULLTEXT does) is to support wildcards of the form foo* where the beginning is fixed. These queries still use indexes to some degree. However, the shorter the fixed string, the slower it will be.

However, when I did the 4.7 search module update, I thought that such wildcards would not be very useful and better replaced by stemming. Stemming also has loads of other benefits, hence the decision was made to simplify the code and remove wildcards altogether. So far, nothing has really happened to change that decision.

Note that all of the existing search code is aimed at making word searches efficient. If you simply replace the matches by wildcard matches, the result will be incredibly slow because table indices can no longer be used. This is what the proposed patch does.

What you should do is skip the entire first pass of the search in (which was changed into a full table scan by this patch) and simply do a full table scan in the second pass in do_search(). However, then you might have trouble getting a good ranking going (as it is based on the search_index table).

Again, I doubt the usefulness of this feature. It is not easy to implement, it is slow and causes additional complications (ranking).

#15

RobRoy - December 22, 2006 - 04:17

I was imagining some type of synonym list, just thought that might be intense to implement. So in terms of synonym list functionality the first option you mentioned is the way you'd want to go with this? This means a user searches for 'develop', and then we expand that to 'developer OR development OR ...'? In terms of translation, I guess we'd check any searches against the current locale's synonym table. I'm not super savvy on the translator end so maybe we could flesh that out a bit. I think this is an important improvement to the search module and would like to get some ideas rolling.

And for MySQL, if we used FULLTEXT we can get wildcard matching so we could even require at least 3 character root. Anyone comment on a comparable PGSQL solution?

@Steven, do you see both these as being core-worthy?

#16

augustd - January 5, 2007 - 01:25

Postgres has a contributed module called TSearch2 that does full text searching.

#17

nedjo - January 5, 2007 - 05:21

The contrib sql search/trip_search module uses MySQL full text indices, http://drupal.org/project/trip_search.

#18

ray007 - July 18, 2007 - 07:13

subscribing.
Partial word search is really something that should be supported nowadays.

#19

JoepH - April 10, 2008 - 12:41

+1 from me.

Please take the following into consideration as well.

There are languages (Dutch and German for example) where the English version of a phrase would be seperate words, the word in these languages would be one long single word.

For example:
The phrase "department secretary" would be in Dutch and German "departmentsecretary" (one long word).

If one would search on department in Dutch or German, the word departmentsecretary would not be found.
If one would search on secretary in Dutch or German, the word departmentsecretary would not be found.

Partial search is very much needed in these languages!!!

#20

JirkaRybka - September 11, 2007 - 22:12

Feedback from east Europe: Our languages (my native Czech, but also Slovak, Polish, Russian and many more) build words on a prefix+base+suffix principle, so if going in the "synonyms list" way, we would need about 10 synonyms for almost every single word in the language! For example "Engineer", depending on gender, amount and context, may in Czech become Inženýr Inženýra Inženýrovi Inženýre Inženýrem Inženýři Inženýrů Inženýrům Inženýry Inženýrech Inženýrka Inženýrku Inženýrce Inženýrko Inženýrkou Inženýrky Inženýrek Inženýrkám Inženýrkách Inženýrkami - and it's just one word, far from being extra complicated. Currently, the search in Drupal simply *doesn't work* in my language - the search box in fact requires to know full context where the word is used, to write a correctly-formed question. I'm even unable to find my own posts. No searching user thinks of all the possible variations. The common way across the web here is to specify basic form of the term (without pre-/suffixes), and expect all the results to show. As for synonyms list - I think it's no good: I can't imagine a translator person able to give correctly all the possible variations, and even then, special terms, colloquials and such won't work.

So I strongly recommend to include partial string matching. Might be configurable to use it or not, but without this feature, Drupal's search is unusable for half the Europe, I believe.

#21

puchal - February 4, 2008 - 18:15

I second that. In Polish i.e. verb "to drive" (jechać) depending on the direction and form of movement can take forms like: dojechać, przyjechać, odjechać, wyjechać, zajechać, najechać, podjechać, nadjechać etc. Also the end of the verb gets modified depending on gender, tense and quantity. Each of above forms can take shape of i.e.: przyjechał, przyjechała, przyjechało, przyjechaliśmy, przyjechaliście, przyjechali etc. etc.

There are several methods of code optimizing. One of them is "Add more RAM, faster disk and more processors" :-). I think many would cope with performance issues as long as we'd have a partial word search capability.

#22

jsmithx70 - April 4, 2008 - 20:43
Version:6.x-dev» 5.x-dev

Is there any solution to be able to create search in 5.x of drupal.
so i can type things like this on the search box :
"george w"

and I'd like to have results like these:

"george w. bush"
"george washington"
"george wanders"
everything with the W, because right now it just does a word by word search and doesnt include the W (like with a wildcard).
It would rock

that's what I need.

I'm urgently needing it and I don't know how to do it. . . with Drupal 5.x
The porter-stemmer is not working for me as I'd like...

#23

jsmithx70 - April 4, 2008 - 20:47

I'm basically using

Drupal Biblio
Drupal Biblio Facets
(Faceted Search)

and whenever I do search of biblio references with the Bibio Module, i get the righ tammount of results for a name.
And when I do search with the FAceted Search it brings less results even tho the words are there...

I'm sure im doing everything well..
What can I do to achieve great results and as I posted on the previous comment it would have to search like if it have wildcards.
I'm kinda frustrated.

#24

robertDouglass - April 10, 2008 - 11:08
Title:Search on partial words does not work» Implement partial word or wildcard searches
Version:5.x-dev» 7.x-dev

I've updated the title to reflect that this is a feature request. I've updated the version to reflect that no features will be added to Drupal 5 or 6.

This thread has very valuable analysis from Steven as well as important counter arguments from some non-English speakers who maintain that partial-word or wildcard searches are more valuable in Dutch/German/Polish etc.

#25

JoepH - April 10, 2008 - 12:42

Thank you.

#26

Jesterw00t - April 14, 2008 - 19:25

Let me ask this question, what good is a search function if you have to know exactly what your searching for?

If i look at search terms used to find my website, 90% of them are partial word searches that resulted in something useful for the person searching.

If I have a site selling football equipment and someone types in ball, Drupal's search will come up with nothing. Doesn't make much sense from a user standpoint. It doesn't matter how "proper" or "slow" it could make code.. what matters is functionality, if a search function doesn't actually search anything but keywords, its useless.

Talk about Web .0...sheesh.

#27

dww - May 10, 2008 - 16:42

How about a setting to toggle if the query against the {search_index} table should be exact match or use an RLIKE? Sure, RLIKE might be too DB intensive for some sites, but for many, having better search results is more important than having performance/scalability. So, give people the choice. That'd be a healthy start, IMHO.

#28

JoepH - May 14, 2008 - 07:55

+1 for dww's comment #27

#29

pwolanin - May 14, 2008 - 21:51

Hmm, from the MySQL docs:

Warning

The REGEXP and RLIKE operators work in byte-wise fashion, so they are not multi-byte safe and may produce unexpected results with multi-byte character sets. In addition, these operators compare characters by their byte values and accented characters may not compare as equal even if a given collation treats them as equal.

http://dev.mysql.com/doc/refman/5.0/en/string-comparison-functions.html

#30

robertDouglass - May 16, 2008 - 10:34

I wonder if the example of the Czech words could be fixed by Czech stemming?

Inženýr
Inženýra
Inženýrovi
Inženýre
Inženýrem
Inženýři
Inženýrů
Inženýrům
Inženýry
Inženýrech
Inženýrka
Inženýrku
Inženýrce
Inženýrko
Inženýrkou
Inženýrky
Inženýrek
Inženýrkám
Inženýrkách
Inženýrkami

To my eye, these all have the same stem: Inženýr with one variant, Inženýř. I don't see the eastern european language argument in http://drupal.org/comment/reply/103548/479591#comment-479591 as being a compelling reason to implement wildcard searches. The earthquake argument is more compelling, but would require some sort of *foo + foo* type search which would never be fast. I think the fuzzysearch module with trigrams would be the best bet, but that module needs some algorithm love.

#31

JirkaRybka - May 17, 2008 - 17:03

Actually, that "Inženýr" example was just a quick one. In our languages also the begin of some words is variable - #21 applies perfectly to Czech, too. While "Dělat" may become "Přidělat", or "Dělali", but also "Přidělali" (and many more), we definitely need to search for something like *děla*, to have some usable results. Even that won't catch everything (the flexibility is quite big, we can also have "Přiděláno", and related is "Dělník"), but still it would be usable at least. I don't think we can implement Czech language rules / full dictionary, like Google seems to have, but having just partial words matching, the results will become more or less usable.

So, I support #27, unless someone come with a better choice.

#32

robertDouglass - May 17, 2008 - 17:24

@JirkaRybka: stemming could affect the prefix of a word, too. Please read the code here and tell me if it would handle your described cases?
http://members.unine.ch/jacques.savoy/clef/CzechStemmerLight.txt

#33

JirkaRybka - May 17, 2008 - 18:26

That code looks all fine, quite nice stuff going down the route of knowing the logic of the language, but I can't verify it's completness - I'm no language scientist, I don't know all the rules in abstract myself, there are lots of them. My impression is, that it's going to normalize (remove) the endings (suffixes) - at least the most common ones for sure. But the code doesn't seem to do anything about prefixes, and I suspect it might fail on some uncommon/atypic cases.

But most of all I'm worried about the code being Czech-specific solution. That won't work for Polish, Russian, German, whatever. This doesn't look like a solution for Drupal core then, and I'm not sure if a bunch of language-specific contribs is a way to go, to make a core module (search) usable. There's no need for other language-specific modules yet, as far as I see. That would be a lot of maintenance overhead (if someone steps out to do that), people will need to seek for contribs...

A simple, configurable *pattern* match (however slow it might be, although I'm unsure how the speed compares to the mentioned stemming code) looks to me much better, universal solution. I'm in favor of simplicity here.

#34

robertDouglass - May 17, 2008 - 18:32

Stemming will always be language specific. The challenge for D7 is to put the language awareness of our content to good use and make sure that stemming can be done in a language specific way. The first step is to build language specific stemmers. There are already Chinese, English and German stemmers. Please lead the effort to get Czech and Polish. The second challenge is to make the way search terms and indexed text are handled more flexible. I've already started this in the patch that makes both rely on the input format and filter system. With such a system we can switch processing based on language. http://drupal.org/node/257007

#35

JirkaRybka - May 17, 2008 - 18:51

Although I've still a vague impression of seeing an overkill here, I must admit that this (if finished) will be a great thing to have. But my time and knowledge is way too limited to "lead the effort" to anything this size - Czech is a bit complex language. (Honestly, I'm now more or less leaving the Drupal development, for a variety of reasons.) If I can't have a simple partial- or wildcard-matching, I'll most likely disable search module and link to Google instead - don't take that as a complaint, that's just how my time-budget go.

Good luck, folks, anyway.

Edit: I've re-read that linked issue, and I want to add that it looks really very promising. But still I'm afraid that a sufficient Czech stemmer is not a trivial thing.

#36

robertDouglass - May 17, 2008 - 20:26

@JirkaRybka: have you tried trip_search? Doing things the right way is not overkill. We're talking about building correct tools, not about the quickest approximate way to a somewhat acceptable result.

#37

dww - May 21, 2008 - 18:55

@robertDouglass: We're talking about building correct tools, not about the quickest approximate way to a somewhat acceptable result.

That's all fine and dandy. ;) However, I don't think an optional partial word search on the index is only a "somewhat acceptable" result. For some (many?, most?) sites it would be acceptable performance, and _better_ functionality than even Google can provide. While I don't intend to jump into the "we need more flexible indexing" issues and shoot those down saying "partial searches on the index is the solution to all our problems", I also don't think it's fair to kill this issue using the reverse logic, either. I really don't see the harm in this. It can certainly default to 'off'. It'd add ~15 lines of code, and would be a great solution for many, many sites.

#38

samc - May 23, 2008 - 18:47

I'd love to see partial word search available as module. One of the ways we use Drupal is as a hub for a software development community and forums are a major feature. The lack of partial word search poses serious usability problems for us when, for example, a user searches for the term "libxerces" and can not find a reference to "libxerces-c.so.2.7.0".

I don't know how the porter-stemmer module works, but if there is a way to take a similar approach to enabling partial word search, that would be great. In the meantime, I will be investigating the patch to core, which I hate to do.

If someone wants to organize a bounty for a clean solution to this problem, I'd be willing to contribute as long as it works on 5.x.

#39

robertDouglass - May 23, 2008 - 19:10

Ok, so the libxerces example is a case where the handling of the - is at fault. In Drupal's current indexing routine, the - is stripped and the resulting word in the index will be libxercescso270. One of the patches being worked on is repurposing the current input format and filter system to define how the text is handled during indexing. This would allow you to analyze the problem and decide that - shouldn't be replaced witn '' but rather with ' ' , at which point, your search for libxerces would work. So again, this isn't necessarily a problem that should be solved with partial word search.

#40

samc - May 24, 2008 - 03:24

Point taken. If you've got a link to that issue handy, please post.

OTOH, we'd also like a search for "xerces" to pull up the "libxerces-..." article, which does seem like a legitimate use case for partial word search.

#41

robertDouglass - May 27, 2008 - 17:44

Point taken. My proposed solution would be to extend the query builder in a way that adds an OR word LIKE '%%s%' segment to the query. To make this effective stemmers would have to stop replacing the words they stem and start duplicating them. %s in the query segment would have to be processed the same way, though. If the keyword is "matches", instead of stemming it to "match", it would have to be stemmed and added to the original with and OR, like this: "match OR matches"

#42

Frank Steiner - June 13, 2008 - 21:15

Just in case someone wants this for drupal 6, here's the patch adjusted for 6.2. Additionally, it highlights partial words in the snippets.

AttachmentSize
partial_word_search_6_2.patch1.26 KB
 
 

Drupal is a registered trademark of Dries Buytaert.