When breaking words using a hyphen the search module will not pick up on the word parts. For example, create the following story:

Title: Workflow-based Systems
Body: This is really new age. Let's make workflow-based systems.

Searching for "systems" or "age" will correctly return the story as a search result.

Searching for "workflow" however will not return any results. I would expect the search module to treat hyphens the same as spaces and correctly break on them. It should return the story when searching for "workflow".

CommentFileSizeAuthor
#3 search-dash-boundary.patch752 bytesrobertdouglass

Comments

robertdouglass’s picture

Version: 5.3 » 7.x-dev

Confirmed.

robertdouglass’s picture

The word that gets indexed is workflowbased. It seems unlikely that this will be useful in many cases. It's possible that workflowbased and work and flow need to be indexed (in the absence of some other partial word matching).

robertdouglass’s picture

StatusFileSize
new752 bytes

The easy "fix" to the problem is to consider the dash a word boundary. We have to be mindful of the consequences, however. This will now index workflow and based as two separate words. If there are words that should have dashes in them this will then break.

robertdouglass’s picture

Status: Active » Needs work
catch’s picture

So this means that "e-mail" becomes "e" "mail". That's not good ;) Doubling up the indexing so we have "workflow" "based" and "workflowbased" as suggested in #3 sounds like it might work. That'd mean nodes with "e-mail" would show up in searches for "mail", doesn't seem like such a bad thing.

BlakeLucchesi’s picture

Status: Needs work » Closed (won't fix)

This will be fixed with a new patch that makes use of input filters to process text for search and indexing. This would allow anyone specific language problems to disable the default input filters provided by search and assign their own.

http://drupal.org/node/257007

Anonymous’s picture

Status: Closed (won't fix) » Active

Replacing "-" with a different search character wouldn't help in this case, if I understand correctly.

e-mail
Workflow-based

replace dash with space
e mail
workflow based

(although I heard e-mail is deprecated in the English language and has now become email ;-)

Possible solutions:

  1. is to accept the technical limitations for a search function (either way).
  2. add the possibility of a native language dictionary which controls the spelling. (and cache this for all content) I don't know if there is a open source dictionary though (for all native languages) But maybe work together with somebody like OpenOffice.org on this (Do they do that?), and possible commercial third parties.
  3. give a list manually (but I don't see this happening in a workflow and on a large scale)
  4. Split the words when indexing but then combine them again in the results. So index on cron
    as workflow - based and display as search result as workflow-based. Something like that anyway.

p.s: I set to active, because I'm not sure what happens to a post is left with the previous setting.

jhodgdon’s picture

Bump. This is still an issue.

kingandy’s picture

FWIW, I agree with #3 and #5. Indexing xxx-yyy as xxx, yyy and xxxyyy might increase the size of the search index but would bring many advantages.

This seems to be a general search issue rather than a filter- or language-based thing, so I wouldn't lump it in with the other issue mentioned in #6. (And even if it was, it would be "Duplicate" rather than "Won't Fix" ;)

kingandy’s picture

Incidentally, I would suggest keeping the restriction on indexed string lengths, so with "e-mail" the "e" would not get indexed - only "email" and "mail".

jhodgdon’s picture

A few comments:
- The idea of doing all preprocessing with filters (see #6 above) never got into D7 (would have been a major API change, and it's long past the deadline).
- The idea of breaking on hyphens as if they were spaces is the only viable plan.
- There is generally a restriction in search to only index words of x characters or less (configurable, defaults to 3).

Given all of that, if we patch so that - is considered a word boundary like spaces/punctuation are now, then
* "highly-valued" would be broken up as "highly valued"
* "e-mail" would be broken up as "e mail", and only "mail" would make it into the index.
* We would need to do the same process of breaking up during keyword searching, so that if someone searches for "e-mail", they would end up searching for "mail". Which is fine, because right now the same processing to break the keywords up into words is used for indexing and searching.

We need a patch...

jhodgdon’s picture

One more idea:

If you have "e-mail", we could decide not to split it by the hyphen, because at least one of the parts would be below the threshold of 3 characters minimum for search indexing (or whatever the site has it set to).

If you have "work-flow", we could decide to split it, because both parts are above the word threshold.

Then we'd index "e-mail" and "work" and "flow", and perhaps we should also index email (or index email in place of e-mail?).

The same decision tree would happen when we are doing a search, so that if someone searches for "e-mail" or "work-flow", it should treat it the same as it did in indexing. Otherwise it wouldn't match.

jhodgdon’s picture

jhodgdon’s picture

jhodgdon’s picture

Just a note that this should only apply to non-numeric data. For numbers, the convention is that (number)(punctuation)(number) is simplified to (number)(number), so that e.g. 10/02/2010 can match 10-02-2010, and 123,456 can match 123456.

We don't want to change that, in my opinion.

jhodgdon’s picture

I'm going to combine this back with
#108100: Need smarter search splitting on underscores, hyphens, apostrophes and other characters
because the hyphen and underscore code is exactly the same.

jhodgdon’s picture

Status: Active » Closed (duplicate)

see prev comment

riversidekid’s picture

Is there a "best practice" guide for data handling?

I just uploaded 8300+ pictures named Location-date.jpg for each so there are no spaces in my names. Should I have left spaces in the file name?

Example: usa-georgia-atlanta-20100923.jpg

Now when I search for "atlanta" nothing comes up! When I search for "usa-georgia-atlanta-20100923.jpg" the image comes up in search results.