Problem/Motivation

While investigating #1964428: Search module should keep language information on snippet content., I noticed that search_simplify is being called with a $langcode argument during node indexing, but then during searching it is called without $langcode when it is used to preprocess the submitted keywords from the search form. Comment #1 has the details of the function call structure.

This is a problem for sites using language-aware search preprocessing, such as stemming modules. You need to have the submitted search keywords and the text in the search index preprocessed the same way in order to have successful matching. So for example, if an English stemmer module is saying "Only process English text", it will operate on the indexing, but not on the submitted keywords, and they will not match.

Proposed resolution

The proposed fix is just to update the documentation. The only thing a preprocessing module could do is get the language from the current environment, and we should let them make that decision rather than having the Search module decide for them. But we need to update some API docs to make this clear.

Remaining tasks

Documentation patch.

User interface changes

No.

API changes

No.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

jhodgdon’s picture

Issue summary: View changes

First we need to establish that this is actually a problem.

So, we have this function search_simplify() in search.module, which takes a $langcode argument, and that is what eventually calls hook_search_preprocess with the $langcode argument ($langcode is not currently otherwise used in search_simplify()). The idea is that both during indexing and during search keyword submit, you should do the same preprocessing (called "simplify" for historical reasons) so that the search will find matches.

So, during indexing: NodeSearch plugin finds node translations and calls search_index(), passing in the language, for each one. This calls search_index_split(), passing $langcode, and that calls search_simplify(), passing in $langcode. So far, so good. #1964428: Search module should keep language information on snippet content. may modify this slightly.

During searching: SearchController gets the submitted keywords, and calls the setSearch() method on the active search plugin. The base method (not overridden by NodeSearch plugin) saves the keywords and request attributes on the plugin, which might or might not include a language? Not sure.

Then during execute() on the plugin, NodeSearch actually does the work of searching in its findResults() method. It passes the keywords to the SearchQuery extender class by calling its searchExpression() method. Only the submitted keyword string is passed in there.

SearchQuery::parseSearchExpression() is what parses this. But it just breaks things into words and calls search_simplify() without a $langcode parameter, since it has no idea about the language.

So. This does seem to be a problem. Updating summary. Next will think about how to fix it properly.

jhodgdon’s picture

It seems to me there are two possible ways to solve this:

a) Documentation: Document in hook_search_preprocess() that if $langcode is NULL, this is user input in an unknown language, and let modules decide what to do. They can presumably (??how??) figure out some default langcode to use and act accordingly.

b) Change SearchQuery::setSearch() to take a $langcode argument, and have it pass this into search_simplify(). Then we'd have to figure out what language NodeSearch should give it, or maybe have SearchController make the decision. Based on what? I'm not sure, but the same question as "act accordingly" in (a) except we would have the Search system make the decision instead of leaving it up to the individual preprocessing module.

As a note, it is only hook_search_preprocess that is language-dependent. The other parts of search_simplify(), at least so far, are not.

Gábor Hojtsy’s picture

I don't see anything in search_invoke_preprocess() that would use a langcode. It passes on that it did not get an explicit langcode to search_invoke_preprocess(). The services invoked there can then assume LanguageManager::getCurrentlanguage() for the language if they do something language specific. I think its better if we don't obscure the fact that there is no explicit language but the services need to assume, no?

jhodgdon’s picture

Yes, the only use is in the hook implementations.

I think I agree with this. (a) is the thing to do -- let the hook implementations decide.

In which case, I think all we need to do here is improve the documentation. I can make a quick patch. I noticed there is not much stated in the current docs on any of these that says that $langcode could be null or what to do about it.

Gábor Hojtsy’s picture

So looks like the docs need changing:

 * @param $langcode
 *   The language code of the entity that has been found.
 *
 * @return
 *   The text after preprocessing. Note that if your module decides not to
 *   alter the text, it should return the original text. Also, after
 *   preprocessing, words in the text should be separated by a space.
 *
 * @ingroup search
 */
function hook_search_preprocess($text, $langcode = NULL) {

The langcode docs are not good, it does not say this is used for query. The $text docs do say that. The NULL value is not dealt with in the docs or the example.

jhodgdon’s picture

Assigned: Unassigned » jhodgdon

Agreed, docs! I will make a patch.

jhodgdon’s picture

Title: Language is not set on search keyword preprocessing » Document that language is not set on search keyword preprocessing
Issue summary: View changes
jhodgdon’s picture

Status: Active » Needs review
FileSize
6.42 KB

OK, here's a patch that fixes up the docs headers for all the functions/hooks in search.module and search.api.php that have $langcode parameters. I think it makes it clearer when/why this would be unknown in the hook. Review welcome!

Gábor Hojtsy’s picture

Status: Needs review » Needs work
  1. +++ b/core/modules/search/search.api.php
    @@ -14,7 +14,8 @@
    + * needs to be applied to both so that searches will return results.
    

    "so that searches will return results"? they would not otherwise?

  2. +++ b/core/modules/search/search.api.php
    @@ -22,14 +23,23 @@
    + *   implementation can call the getCurrentLanguage() method on the
    + *   'language_manager' service to determine the current language and act
    + *   accordingly.
    

    What about doing this in the sample code too so 'en' is retrieved dynamically too.

jhodgdon’s picture

Status: Needs work » Needs review
FileSize
7.03 KB
1.2 KB

Good ideas. So yeah, the search will not find results if the same preprocessing is not applied to both keywords and indexed text. Reworded slightly. And added sample code.

Gábor Hojtsy’s picture

Status: Needs review » Reviewed & tested by the community
Issue tags: +sprint

Looks fine with me.

Gábor Hojtsy’s picture

Assigned: jhodgdon » Unassigned

Deassign from jhodgdon because she does not qualify to commit it.

alexpott’s picture

Status: Reviewed & tested by the community » Fixed

This issue is a normal bug fix, and doesn't include any disruptive changes, so it is allowed per https://www.drupal.org/core/beta-changes. Committed f4a9d47 and pushed to 8.0.x. Thanks!

  • alexpott committed f4a9d47 on 8.0.x
    Issue #2459325 by jhodgdon: Document that language is not set on search...
Gábor Hojtsy’s picture

Issue tags: -sprint

Superb, thanks!

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.