Use a filter chain for processing tokens during indexing and searching

robertDouglass - May 10, 2008 - 19:39
Project:Drupal
Version:8.x-dev
Component:search.module
Category:feature request
Priority:normal
Assigned:Unassigned
Status:needs work
Issue tags:FilterSystemRevamp
Description

When building a search engine there are two stages at which text processing happens. At indexing time text gets tokenized (broken into small units) and analyzed (punctuation removed, lowercased and porter stemmed are three examples). Then, again, at search time, the keywords being searched for undergo the same processing so that they match the form of the texts in the index.

In Drupal (up to 6), these are hardcoded in a function called search_simplify. The ramification is that the processing is a "best guess" of what people actually need. It causes problems when the assumptions don't match the needs. Some asian languages need other processing, for example.

The proposal is to replace the processing in search_simplify with check_markup where a search-specific input format is used. The filters would handle all of the analysis. The input format could be controlled by the administrator just like they are now, so a site with special needs could use different filters in a different order, with different configurations.

#1

robertDouglass - May 10, 2008 - 19:43

The patch is a proof of concept for the idea. It makes the various bits of search_simplify into filters. I then went and put them together into an input format by hand (thus the hardcoded "3" in the call to check_markup).

TODO:
- Need to come up with an admin interface for filters and input formats that separates the search related filters from the normal node related filters. Some filters may have utility in both spaces, so this should be possible too.
- Need to figure out how to specify which input format is needed for search.

AttachmentSize
search-input-formats.patch 5.07 KB

#2

BlakeLucchesi - May 10, 2008 - 19:43

Context aware filters, for reference: http://drupal.org/node/226963. Currently doesn't patch to core cleanly so it will have to be rewritten if this is still something to pursue.

#3

BlakeLucchesi - May 10, 2008 - 21:11

Re-rolled with updates to the search.test to provide test coverage for search_simplify and included admin interface to search.admin.inc that allows administrators to choose a different input format to process the text being indexed.

AttachmentSize
search_input_filter.patch 10.39 KB

#4

BlakeLucchesi - May 10, 2008 - 21:38

Ok this has been rerolled without the search.test file because we are trying to get that added separately and then prove that this new patch does not break the previous search_simplify implementation.

Test coverage is now located here: http://drupal.org/node/257033

AttachmentSize
search_input_filter.patch 8.54 KB

#5

sun.core - July 8, 2009 - 23:22
Component:search.module» filter.module
Status:active» needs work

Proper component.

And. This is a won't fix for me.

You would need to duplicate each existing input format instead. So the approach is wrong.

#6

robertDouglass - July 9, 2009 - 10:30

sun.core - while the original idea was to use the existing filter system, it wouldn't necessarily have to be within the same UI as core. The most important part is that there are chainable filters available for both indexing and search query parsing, and that they can be extended.

#7

sun - July 10, 2009 - 20:06
Component:filter.module» search.module

The idea of using filters is interesting. However, my point is that you cannot use a single input format for all content, because each content has to use the input format it was input/stored in. So, to do this, you would have to "attach" those search filter settings to each input format (instead of introducing a new, dedicated one).

#8

robertDouglass - July 10, 2009 - 21:21

Yeah. Actually, the content gets rendered with its normal input formats before even going into the indexer. My idea was that the filter chain would get applied to the tokenized text during indexing. This would allow stemming, punctuation handling, synonym additions, and any other lexographical changes to be made. Right now it's all hard-coded. Then, the same filter chain is used on the tokenized search string to pre-process it for matching. It would allow people to set up totally new chains for stuff like stemming, or handling numbers, or for building a source code search engine.

#9

robertDouglass - July 10, 2009 - 21:21
Title:Use input formats for search query/index processing» Use a filter chain for processing tokens during indexing and searching

#10

sun - September 10, 2009 - 17:03
Version:7.x-dev» 8.x-dev
 
 

Drupal is a registered trademark of Dries Buytaert.