Problem/Motivation
While investigating #1964428: Search module should keep language information on snippet content., I noticed that search_simplify is being called with a $langcode argument during node indexing, but then during searching it is called without $langcode when it is used to preprocess the submitted keywords from the search form. Comment #1 has the details of the function call structure.
This is a problem for sites using language-aware search preprocessing, such as stemming modules. You need to have the submitted search keywords and the text in the search index preprocessed the same way in order to have successful matching. So for example, if an English stemmer module is saying "Only process English text", it will operate on the indexing, but not on the submitted keywords, and they will not match.
Proposed resolution
The proposed fix is just to update the documentation. The only thing a preprocessing module could do is get the language from the current environment, and we should let them make that decision rather than having the Search module decide for them. But we need to update some API docs to make this clear.
Remaining tasks
Documentation patch.
User interface changes
No.
API changes
No.
Comment | File | Size | Author |
---|---|---|---|
#10 | interdiff.txt | 1.2 KB | jhodgdon |
#10 | 2459325-fix-docs-10.patch | 7.03 KB | jhodgdon |
Comments
Comment #1
jhodgdonFirst we need to establish that this is actually a problem.
So, we have this function search_simplify() in search.module, which takes a $langcode argument, and that is what eventually calls hook_search_preprocess with the $langcode argument ($langcode is not currently otherwise used in search_simplify()). The idea is that both during indexing and during search keyword submit, you should do the same preprocessing (called "simplify" for historical reasons) so that the search will find matches.
So, during indexing: NodeSearch plugin finds node translations and calls search_index(), passing in the language, for each one. This calls search_index_split(), passing $langcode, and that calls search_simplify(), passing in $langcode. So far, so good. #1964428: Search module should keep language information on snippet content. may modify this slightly.
During searching: SearchController gets the submitted keywords, and calls the setSearch() method on the active search plugin. The base method (not overridden by NodeSearch plugin) saves the keywords and request attributes on the plugin, which might or might not include a language? Not sure.
Then during execute() on the plugin, NodeSearch actually does the work of searching in its findResults() method. It passes the keywords to the SearchQuery extender class by calling its searchExpression() method. Only the submitted keyword string is passed in there.
SearchQuery::parseSearchExpression() is what parses this. But it just breaks things into words and calls search_simplify() without a $langcode parameter, since it has no idea about the language.
So. This does seem to be a problem. Updating summary. Next will think about how to fix it properly.
Comment #2
jhodgdonIt seems to me there are two possible ways to solve this:
a) Documentation: Document in hook_search_preprocess() that if $langcode is NULL, this is user input in an unknown language, and let modules decide what to do. They can presumably (??how??) figure out some default langcode to use and act accordingly.
b) Change SearchQuery::setSearch() to take a $langcode argument, and have it pass this into search_simplify(). Then we'd have to figure out what language NodeSearch should give it, or maybe have SearchController make the decision. Based on what? I'm not sure, but the same question as "act accordingly" in (a) except we would have the Search system make the decision instead of leaving it up to the individual preprocessing module.
As a note, it is only hook_search_preprocess that is language-dependent. The other parts of search_simplify(), at least so far, are not.
Comment #3
Gábor HojtsyI don't see anything in search_invoke_preprocess() that would use a langcode. It passes on that it did not get an explicit langcode to search_invoke_preprocess(). The services invoked there can then assume LanguageManager::getCurrentlanguage() for the language if they do something language specific. I think its better if we don't obscure the fact that there is no explicit language but the services need to assume, no?
Comment #4
jhodgdonYes, the only use is in the hook implementations.
I think I agree with this. (a) is the thing to do -- let the hook implementations decide.
In which case, I think all we need to do here is improve the documentation. I can make a quick patch. I noticed there is not much stated in the current docs on any of these that says that $langcode could be null or what to do about it.
Comment #5
Gábor HojtsySo looks like the docs need changing:
The langcode docs are not good, it does not say this is used for query. The $text docs do say that. The NULL value is not dealt with in the docs or the example.
Comment #6
jhodgdonAgreed, docs! I will make a patch.
Comment #7
jhodgdonComment #8
jhodgdonOK, here's a patch that fixes up the docs headers for all the functions/hooks in search.module and search.api.php that have $langcode parameters. I think it makes it clearer when/why this would be unknown in the hook. Review welcome!
Comment #9
Gábor Hojtsy"so that searches will return results"? they would not otherwise?
What about doing this in the sample code too so 'en' is retrieved dynamically too.
Comment #10
jhodgdonGood ideas. So yeah, the search will not find results if the same preprocessing is not applied to both keywords and indexed text. Reworded slightly. And added sample code.
Comment #11
Gábor HojtsyLooks fine with me.
Comment #12
Gábor HojtsyDeassign from jhodgdon because she does not qualify to commit it.
Comment #13
alexpottThis issue is a normal bug fix, and doesn't include any disruptive changes, so it is allowed per https://www.drupal.org/core/beta-changes. Committed f4a9d47 and pushed to 8.0.x. Thanks!
Comment #15
Gábor HojtsySuperb, thanks!