When building a search engine there are two stages at which text processing happens. At indexing time text gets tokenized (broken into small units) and analyzed (punctuation removed, lowercased and porter stemmed are three examples). Then, again, at search time, the keywords being searched for undergo the same processing so that they match the form of the texts in the index.
In Drupal (up to 6), these are hardcoded in a function called search_simplify. The ramification is that the processing is a "best guess" of what people actually need. It causes problems when the assumptions don't match the needs. Some asian languages need other processing, for example.
The proposal is to replace the processing in search_simplify with check_markup where a search-specific input format is used. The filters would handle all of the analysis. The input format could be controlled by the administrator just like they are now, so a site with special needs could use different filters in a different order, with different configurations.
| Comment | File | Size | Author |
|---|---|---|---|
| #4 | search_input_filter.patch | 8.54 KB | BlakeLucchesi |
| #3 | search_input_filter.patch | 10.39 KB | BlakeLucchesi |
| #1 | search-input-formats.patch | 5.07 KB | robertdouglass |
Comments
Comment #1
robertdouglass commentedThe patch is a proof of concept for the idea. It makes the various bits of search_simplify into filters. I then went and put them together into an input format by hand (thus the hardcoded "3" in the call to check_markup).
TODO:
- Need to come up with an admin interface for filters and input formats that separates the search related filters from the normal node related filters. Some filters may have utility in both spaces, so this should be possible too.
- Need to figure out how to specify which input format is needed for search.
Comment #2
BlakeLucchesi commentedContext aware filters, for reference: http://drupal.org/node/226963. Currently doesn't patch to core cleanly so it will have to be rewritten if this is still something to pursue.
Comment #3
BlakeLucchesi commentedRe-rolled with updates to the search.test to provide test coverage for search_simplify and included admin interface to search.admin.inc that allows administrators to choose a different input format to process the text being indexed.
Comment #4
BlakeLucchesi commentedOk this has been rerolled without the search.test file because we are trying to get that added separately and then prove that this new patch does not break the previous search_simplify implementation.
Test coverage is now located here: http://drupal.org/node/257033
Comment #5
sun.core commentedProper component.
And. This is a won't fix for me.
You would need to duplicate each existing input format instead. So the approach is wrong.
Comment #6
robertdouglass commentedsun.core - while the original idea was to use the existing filter system, it wouldn't necessarily have to be within the same UI as core. The most important part is that there are chainable filters available for both indexing and search query parsing, and that they can be extended.
Comment #7
sunThe idea of using filters is interesting. However, my point is that you cannot use a single input format for all content, because each content has to use the input format it was input/stored in. So, to do this, you would have to "attach" those search filter settings to each input format (instead of introducing a new, dedicated one).
Comment #8
robertdouglass commentedYeah. Actually, the content gets rendered with its normal input formats before even going into the indexer. My idea was that the filter chain would get applied to the tokenized text during indexing. This would allow stemming, punctuation handling, synonym additions, and any other lexographical changes to be made. Right now it's all hard-coded. Then, the same filter chain is used on the tokenized search string to pre-process it for matching. It would allow people to set up totally new chains for stuff like stemming, or handling numbers, or for building a source code search engine.
Comment #9
robertdouglass commentedComment #10
sunComment #11
jhodgdonI'm revisiting old Search module issues... This still seems like kind of a reasonable idea, but I'm not convinced that making it use regular text filters would be worth the trouble we'd have in trying to keep the search filters out of the regular filters/formats pages. Also, text filters have all kinds of special behavior, like figuring out what HTML tags they support for purposes of WYSIWYG editors, etc. It just seems like we probably would want our own plugin type instead.
If we do this, then the issue becomes actually "Convert hook_search_preprocess() into a plugin", right?
Comment #12
jhodgdonComment #13
jhodgdonIt is too late and too disruptive to do this for 8.0.x. It would be pretty difficult to maintain backwards compatibility and have both the hook and the plugin available... so this seems like 9.0 material.
Comment #14
catchWhile bc would be difficult, it might be worth it to have forwards compatibility with 9.x, so moving back to a minor version for now.
Comment #22
andypostComment #25
andypostAs #3075703: Move search text processing to a service commited, I think it could be closed
Comment #26
andypostRemaining work has own issue #2552497: [PP-1] Convert search_excerpt() to a filter