Closed (duplicate)
Project:
Drupal core
Version:
7.x-dev
Component:
search.module
Priority:
Normal
Category:
Bug report
Assigned:
Unassigned
Reporter:
Created:
2 Jan 2008 at 18:08 UTC
Updated:
23 Sep 2010 at 21:47 UTC
Jump to comment: Most recent, Most recent file
Comments
Comment #1
robertdouglass commentedConfirmed.
Comment #2
robertdouglass commentedThe word that gets indexed is workflowbased. It seems unlikely that this will be useful in many cases. It's possible that workflowbased and work and flow need to be indexed (in the absence of some other partial word matching).
Comment #3
robertdouglass commentedThe easy "fix" to the problem is to consider the dash a word boundary. We have to be mindful of the consequences, however. This will now index workflow and based as two separate words. If there are words that should have dashes in them this will then break.
Comment #4
robertdouglass commentedComment #5
catchSo this means that "e-mail" becomes "e" "mail". That's not good ;) Doubling up the indexing so we have "workflow" "based" and "workflowbased" as suggested in #3 sounds like it might work. That'd mean nodes with "e-mail" would show up in searches for "mail", doesn't seem like such a bad thing.
Comment #6
BlakeLucchesi commentedThis will be fixed with a new patch that makes use of input filters to process text for search and indexing. This would allow anyone specific language problems to disable the default input filters provided by search and assign their own.
http://drupal.org/node/257007
Comment #7
Anonymous (not verified) commentedReplacing "-" with a different search character wouldn't help in this case, if I understand correctly.
e-mail
Workflow-based
replace dash with space
e mail
workflow based
(although I heard e-mail is deprecated in the English language and has now become email ;-)
Possible solutions:
as workflow - based and display as search result as workflow-based. Something like that anyway.
p.s: I set to active, because I'm not sure what happens to a post is left with the previous setting.
Comment #8
jhodgdonBump. This is still an issue.
Comment #9
kingandyFWIW, I agree with #3 and #5. Indexing xxx-yyy as xxx, yyy and xxxyyy might increase the size of the search index but would bring many advantages.
This seems to be a general search issue rather than a filter- or language-based thing, so I wouldn't lump it in with the other issue mentioned in #6. (And even if it was, it would be "Duplicate" rather than "Won't Fix" ;)
Comment #10
kingandyIncidentally, I would suggest keeping the restriction on indexed string lengths, so with "e-mail" the "e" would not get indexed - only "email" and "mail".
Comment #11
jhodgdonA few comments:
- The idea of doing all preprocessing with filters (see #6 above) never got into D7 (would have been a major API change, and it's long past the deadline).
- The idea of breaking on hyphens as if they were spaces is the only viable plan.
- There is generally a restriction in search to only index words of x characters or less (configurable, defaults to 3).
Given all of that, if we patch so that - is considered a word boundary like spaces/punctuation are now, then
* "highly-valued" would be broken up as "highly valued"
* "e-mail" would be broken up as "e mail", and only "mail" would make it into the index.
* We would need to do the same process of breaking up during keyword searching, so that if someone searches for "e-mail", they would end up searching for "mail". Which is fine, because right now the same processing to break the keywords up into words is used for indexing and searching.
We need a patch...
Comment #12
jhodgdonOne more idea:
If you have "e-mail", we could decide not to split it by the hyphen, because at least one of the parts would be below the threshold of 3 characters minimum for search indexing (or whatever the site has it set to).
If you have "work-flow", we could decide to split it, because both parts are above the word threshold.
Then we'd index "e-mail" and "work" and "flow", and perhaps we should also index email (or index email in place of e-mail?).
The same decision tree would happen when we are doing a search, so that if someone searches for "e-mail" or "work-flow", it should treat it the same as it did in indexing. Otherwise it wouldn't match.
Comment #13
jhodgdonAlso, there are some other punctuation issues related to this:
#427504: Why does indexing split on apostrophe?
#365661: Search module doesn't allow index/search of "c++" or "c#" or similar
Comment #14
jhodgdonThis also should probalby apply to underscores.
#108100: Need smarter search splitting on underscores, hyphens, apostrophes and other characters
Comment #15
jhodgdonJust a note that this should only apply to non-numeric data. For numbers, the convention is that (number)(punctuation)(number) is simplified to (number)(number), so that e.g. 10/02/2010 can match 10-02-2010, and 123,456 can match 123456.
We don't want to change that, in my opinion.
Comment #16
jhodgdonI'm going to combine this back with
#108100: Need smarter search splitting on underscores, hyphens, apostrophes and other characters
because the hyphen and underscore code is exactly the same.
Comment #17
jhodgdonsee prev comment
Comment #18
riversidekid commentedIs there a "best practice" guide for data handling?
I just uploaded 8300+ pictures named Location-date.jpg for each so there are no spaces in my names. Should I have left spaces in the file name?
Example: usa-georgia-atlanta-20100923.jpg
Now when I search for "atlanta" nothing comes up! When I search for "usa-georgia-atlanta-20100923.jpg" the image comes up in search results.