When building a search engine there are two stages at which text processing happens. At indexing time text gets tokenized (broken into small units) and analyzed (punctuation removed, lowercased and porter stemmed are three examples). Then, again, at search time, the keywords being searched for undergo the same processing so that they match the form of the texts in the index.

In Drupal (up to 6), these are hardcoded in a function called search_simplify. The ramification is that the processing is a "best guess" of what people actually need. It causes problems when the assumptions don't match the needs. Some asian languages need other processing, for example.

The proposal is to replace the processing in search_simplify with check_markup where a search-specific input format is used. The filters would handle all of the analysis. The input format could be controlled by the administrator just like they are now, so a site with special needs could use different filters in a different order, with different configurations.

Comments

robertdouglass’s picture

StatusFileSize
new5.07 KB

The patch is a proof of concept for the idea. It makes the various bits of search_simplify into filters. I then went and put them together into an input format by hand (thus the hardcoded "3" in the call to check_markup).

TODO:
- Need to come up with an admin interface for filters and input formats that separates the search related filters from the normal node related filters. Some filters may have utility in both spaces, so this should be possible too.
- Need to figure out how to specify which input format is needed for search.

BlakeLucchesi’s picture

Context aware filters, for reference: http://drupal.org/node/226963. Currently doesn't patch to core cleanly so it will have to be rewritten if this is still something to pursue.

BlakeLucchesi’s picture

StatusFileSize
new10.39 KB

Re-rolled with updates to the search.test to provide test coverage for search_simplify and included admin interface to search.admin.inc that allows administrators to choose a different input format to process the text being indexed.

BlakeLucchesi’s picture

StatusFileSize
new8.54 KB

Ok this has been rerolled without the search.test file because we are trying to get that added separately and then prove that this new patch does not break the previous search_simplify implementation.

Test coverage is now located here: http://drupal.org/node/257033

sun.core’s picture

Component: search.module » filter.module
Status: Active » Needs work

Proper component.

And. This is a won't fix for me.

You would need to duplicate each existing input format instead. So the approach is wrong.

robertdouglass’s picture

sun.core - while the original idea was to use the existing filter system, it wouldn't necessarily have to be within the same UI as core. The most important part is that there are chainable filters available for both indexing and search query parsing, and that they can be extended.

sun’s picture

Component: filter.module » search.module

The idea of using filters is interesting. However, my point is that you cannot use a single input format for all content, because each content has to use the input format it was input/stored in. So, to do this, you would have to "attach" those search filter settings to each input format (instead of introducing a new, dedicated one).

robertdouglass’s picture

Yeah. Actually, the content gets rendered with its normal input formats before even going into the indexer. My idea was that the filter chain would get applied to the tokenized text during indexing. This would allow stemming, punctuation handling, synonym additions, and any other lexographical changes to be made. Right now it's all hard-coded. Then, the same filter chain is used on the tokenized search string to pre-process it for matching. It would allow people to set up totally new chains for stuff like stemming, or handling numbers, or for building a source code search engine.

robertdouglass’s picture

Title: Use input formats for search query/index processing » Use a filter chain for processing tokens during indexing and searching
sun’s picture

Version: 7.x-dev » 8.x-dev
Issue tags: +FilterSystemRevamp
jhodgdon’s picture

Title: Use a filter chain for processing tokens during indexing and searching » Convert hook_search_preprocess() into a plugin
Category: feature » task

I'm revisiting old Search module issues... This still seems like kind of a reasonable idea, but I'm not convinced that making it use regular text filters would be worth the trouble we'd have in trying to keep the search filters out of the regular filters/formats pages. Also, text filters have all kinds of special behavior, like figuring out what HTML tags they support for purposes of WYSIWYG editors, etc. It just seems like we probably would want our own plugin type instead.

If we do this, then the issue becomes actually "Convert hook_search_preprocess() into a plugin", right?

jhodgdon’s picture

Issue summary: View changes
Issue tags: +beta target
jhodgdon’s picture

Version: 8.0.x-dev » 9.x-dev
Issue tags: -beta target

It is too late and too disruptive to do this for 8.0.x. It would be pretty difficult to maintain backwards compatibility and have both the hook and the plugin available... so this seems like 9.0 material.

catch’s picture

Version: 9.x-dev » 8.1.x-dev
Status: Needs work » Postponed

While bc would be difficult, it might be worth it to have forwards compatibility with 9.x, so moving back to a minor version for now.

Version: 8.1.x-dev » 8.2.x-dev

Drupal 8.1.0-beta1 was released on March 2, 2016, which means new developments and disruptive changes should now be targeted against the 8.2.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.2.x-dev » 8.3.x-dev

Drupal 8.2.0-beta1 was released on August 3, 2016, which means new developments and disruptive changes should now be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.3.x-dev » 8.4.x-dev

Drupal 8.3.0-alpha1 will be released the week of January 30, 2017, which means new developments and disruptive changes should now be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.4.x-dev » 8.5.x-dev

Drupal 8.4.0-alpha1 will be released the week of July 31, 2017, which means new developments and disruptive changes should now be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.5.x-dev » 8.6.x-dev

Drupal 8.5.0-alpha1 will be released the week of January 17, 2018, which means new developments and disruptive changes should now be targeted against the 8.6.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.6.x-dev » 8.7.x-dev

Drupal 8.6.0-alpha1 will be released the week of July 16, 2018, which means new developments and disruptive changes should now be targeted against the 8.7.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.7.x-dev » 8.8.x-dev

Drupal 8.7.0-alpha1 will be released the week of March 11, 2019, which means new developments and disruptive changes should now be targeted against the 8.8.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

andypost’s picture

Version: 8.8.x-dev » 8.9.x-dev

Drupal 8.8.0-alpha1 will be released the week of October 14th, 2019, which means new developments and disruptive changes should now be targeted against the 8.9.x-dev branch. (Any changes to 8.9.x will also be committed to 9.0.x in preparation for Drupal 9’s release, but some changes like significant feature additions will be deferred to 9.1.x.). For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Version: 8.9.x-dev » 9.1.x-dev

Drupal 8.9.0-beta1 was released on March 20, 2020. 8.9.x is the final, long-term support (LTS) minor release of Drupal 8, which means new developments and disruptive changes should now be targeted against the 9.1.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

andypost’s picture

As #3075703: Move search text processing to a service commited, I think it could be closed

andypost’s picture

Status: Needs work » Closed (outdated)