Only default language is indexed for search

Frank Steiner - January 19, 2009 - 15:55
Project:Language Sections
Version:6.x-1.x-dev
Component:Code
Category:feature request
Priority:normal
Assigned:Unassigned
Status:needs work
Description

Hi,

if the default language of a drupal installation is e.g. english, and you add sth. like

=en=
Hello
=de=
Hallo
=qq=
in a body, the string "Hallo" will not be found when searching through the site.

As far as I see that's because drupal builds the node body for the search index by

  $node->build_mode = NODE_BUILD_SEARCH_INDEX;
  $node = node_build_content($node, FALSE, FALSE);
...

and in node_build_content the filters are applied via check_markup and so all content for the non-default language gets lost.

I'm thinking about a solution, maybe it is possible to use the nodeapi_hook in language_sections to reload the node body if node->build_mode is NODE_BUILD_SEARCH_INDEX, and apply only all but the language sections filter. But I'm not yet sure.

Has anyone ever thought about this problem? Any ideas or proposals how we could get around this?

#1

netgenius - January 19, 2009 - 16:06
Category:bug report» feature request

I've changed this to "feature request" as I don't really think counts as a bug in LS as such.

We would need to understand better how the built-in search facility works in a multi-language environment. I for one don't claim to understand it. At a guess, LS might be able to detect that the cron indexing process is running, and so behave differently, presumably presenting *all* text *all languages) for indexing. Need input here from folks who understand how that stuff (the indexing) works, I don't have time to go investigating, sorry.

#2

Frank Steiner - January 25, 2009 - 22:38
  • With multi-language support every page in every language is indexed. And you find any page no matter what language you are currently using when viewing the search page. I.e. you find english pages when searching for english words even if your current language is german.
    When you click on the search result you will be switched to the right language.
  • WIth i18n things are different as you can chose which content you want to see depending on the current language. If you select e.g. "Current language and no language" and you current language is german, you will not find any content of english pages as i18n rewrites all the queries to match the current language (or language-neutral) when selecting nodes.
  • As far as I understand the code, drupal doesn't tag anything with a language in the search index.

    For language sections this will mean that we would find pages with english content when searching for it, but when clicking on the result, we would see the german content if german was our current language.
    I'm not sure if we can manipulate search results to switch to the language in which the string was found (we could figure out that). We have hooks but I'm not sure how to fetch the correct URL information as there can be different situations (language prefix in the path or in the domain etc.). I feel this should be doable.

  • To get all languages indexed we need just one little change:
    function language_sections_filter($op, $delta = 0, $format = -1, $text = '') {
      ...
      switch ($op) {
        case 'process':
          if (request_uri() == '/cron.php') return $text;

    Using the request_uri was the only way I found to figure out that we are demanded from the cron script. Everything else (cron_semaphore etc.) would conflict with normal page loading, especially if a cron gets stuck.
    We could also use nodeapi('update index'). But that would match only nodes, and it would be more complicated to figure out which parts to add due to other filters. We could e.g. remove the separators and call check_markup again so that php code etc. is removed.
    Just checking for the request uri works for all type of text, not only nodes and is considerably simpler.

  • We would still need to hook into manually called indexing (e.g. the biblio module indexes pages when inserting/updating them). Here I'm not sure how to do this efficiently. We can't hook into search_index but any module could call this. We could capture nodeapi by using some global flag or sth. but I think we won't find a clean solution. So we might have to accept that multilanguage content is only indexed during cron runs.

The cron.php patch is working fine for my site. I could try to alter the search result links for changing to the language of the result, but first I'd like to hear your opinion on all this.

#3

stratosgear - August 14, 2009 - 21:15

subscribing

#4

netgenius - September 22, 2009 - 16:12
Status:active» needs work

I've finally got round to going through the issue queue :) Thanks to Frank for that info and thought. It occurs to me that simply testing that cron.php is running might have unwanted side effects, i.e. some code could be running (other than search indexing) which needs to see the language-specific text. So, if this feature were implemented, it probably needs to be a configurable option.

It seems to me that using hook_nodeapi and checking for $op == "update index" may provide a solution: "The node is being indexed. If you want additional information to be indexed which is not already visible through nodeapi "view", then you should return it here." As Frank points out, there are limitations and potential problems, but the advantage is that it should work with any type of search module, not just the standard Drupal search. Further input welcome.

#5

Frank Steiner - October 8, 2009 - 10:51

I see the problems with the cron check. Surely the nodeapi way is better, I'm just not that sure how to build the node body. I guess we would need to refetch the whole body and apply all filters usually applied, except for the language sections filter, would that be right?

#6

netgenius - October 8, 2009 - 18:21

Applying filters is normally done by node_prepare - I suppose that node_prepare is already called during the indexing process, before the actual index updating is done - that must be the case otherwise the indexing would never see filtered text. So I think all we need to do is:

1. Use hook_nodeapi $op == "update index" to set a flag for LS.
2. In LS standard processing (will be called by node_prepare), check for the flag and if it is set do no filtering.
3. Clear the flag.

Easy!?

#7

Frank Steiner - October 8, 2009 - 20:41

I'm not sure it's that easy because of the following code in _node_index_node:

  // Build the node body.
  $node->build_mode = NODE_BUILD_SEARCH_INDEX;
  $node = node_build_content($node, FALSE, FALSE);
  $node->body = drupal_render($node->content);

  $text = '<h1>'. check_plain($node->title) .'</h1>'. $node->body;

  // Fetch extra data normally not visible
  $extra = node_invoke_nodeapi($node, 'update index');

Thus, node_prepare will be called (from node_build_content) before nodeapi('update index'), so we cannot set the flag.

#8

netgenius - October 15, 2009 - 17:34

Anybody want to try and make a patch?

#9

Frank Steiner - October 20, 2009 - 13:26

I thought about this a long time but I don't see any clean way to handle this stuff with the flag. The other apporach, i.e. re-fetching the body and applying all filters but language sections seems to complicated to me because of roles. I wouldn't know which filters to apply as there is no "standard" role.
That's why I ended up with the cron solution: I just couldn't do any better :-) So I'm stepping back :-)

#10

netgenius - October 21, 2009 - 18:39

Well, thanks for trying! From your #7 it seems that testing for $node->build_mode == NODE_BUILD_SEARCH_INDEX might work and would not have the unwanted side-effects that testing for a cron run could have.

#11

Jurgen8en - October 27, 2009 - 16:07

Hi,

For Drupal 5.x, looks like the function language_sections_filter is never called through cron.
There is also no build mode or ...

In my situation I only use language sections for one node-type: uc_product

I added

function uc_product_nodeapi(&$node, $op, $a3 = NULL, $a4 = NULL) {
  if ($op == 'update index') {
  return db_result(db_query('SELECT body FROM {node_revisions} WHERE nid = %d', $node->nid));
     //return '<strong>('. implode(', ', $output) .')</strong>';
  }
}

Run cron, updated a node. Searched, but it doesn´t work.
Any help is welcome..

Jurgen
www.cardKeyfinder.com

#12

Frank Steiner - October 30, 2009 - 14:16

> From your #7 it seems that testing for $node->build_mode == NODE_BUILD_SEARCH_INDEX
> might work and would not have the unwanted side-effects that testing for a cron run could have.

I'm afraid it won't, because node_build_content calls node_prepare which calls check_markup before we can intervent with any nodeapi function or sth. similar.

 
 

Drupal is a registered trademark of Dries Buytaert.