Problem/Motivation

The search module is putting out snippet spans with xml:lang attributes on them, which is generally good. But if the original content has xml:lang spans in it, they are not reflected in the snippets.

As noted in #6, the solution should handle the use of lang attributes on any tag, and probably doesn't need to handle xml:lang attributes.

Steps to reproduce

  1. Install Drupal 10.1.x with Umami profile (Demo: Umami Food Magazine (Experimental))
  2. From the Manage menu, navigate to Content, search for "pasta" in the Title field filter.
  3. Edit the node with the title "Pasta vegetariana al horno súper fácil". (es/node/3/edit)
  4. For the field, Resumen, update the Formato de texto) to HTML completo. Click the blue Continuar button when the Change text format pop-up warning displays.
  5. On the Resumen field, in the CKEditor toolbar, click Origen to edit the HTML.
  6. Added the following inside the <p></p> tags: <span lang="en">Pasta is delicious!</span> <span xml:lang="en">XML Pasta is tasty!</span> so that the contents of the field are now:

    <p>
    	Una pasta al horno es la comida más fácil y saludable. Este delicioso plato es súper rápido de preparar y una comida ideal entre semana para toda la familia. <span lang="en">Pasta is delicious!</span> <span xml:lang="en">XML Pasta is tasty!</span>
    </p>
    
  7. Save the changes (at the bottom of the Edit (Editar) form, click the blue button Guardar (esta traducción).)
  8. Return to the site and ensure that you're on the Español version (which should be the case since you just edited a Spanish language node).
  9. Use the search field (Buscar) to search for the text "pasta".
  10. In the search result snippet for "Pasta vegetariana al horno súper fácil" (the 2nd result), the search result snippet displays: "Pasta vegetariana al horno súper fácil Una pasta al horno es la comida más fácil y saludable. Este … y una comida ideal entre semana para toda la familia. Pasta is delicious! XML Pasta is tasty! …"
  11. Inspect the HTML for this snippet. Notice that both <span> tags have been stripped.

Proposed resolution

I think we won't try to do the snippet processing really carefully -- it's just too complicated to break the text up and preprocess it. But we can fix the snippet creator to put out spans with xml:lang attributes.

Remaining tasks

1. Manually apply changes from the patch in #3 to core/modules/search.module, since that file has been massively refactored since that patch was made.
2. Add test(s)
3. Review, etc.
4. Commit

User interface changes

Snippets will have the right language attributes on them.

API changes

Original report by Heine

#867114: Search results should add lang tag if language of search result differs introduced an xml:lang attribute on certain search results whose declared language differed from the currently displayed language. This prevents snippet title and snippet text from inheriting the ancestral language.

I don't think search/node module can claim to know the language of the snippet though as it will just strip tags, removing any language context from the content. Consider the following node in language "nl":

<p>De titel van ons nieuwste boek: <span xml:lang="en">The Late Roman Cemeteries; Stray Finds and Excavations</span></p>

Searching for "boek" gives the following, incorrect fragment:

<p class="search-snippet" xml:lang="nl">        De titel van ons nieuwste <strong>boek</strong>:   The Late Roman Cemeteries; Stray Finds and Excavations            ...</p>

Should it not specify xml:lang=""? (or maybe xml:lang="und" ? need to digg for the correct xhtml spec).

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

BarisW’s picture

Would be good if the snippet could contain specific tags, like <span> and <a>.

Then the snippet would become:

<p class="search-snippet" xml:lang="nl"> De titel van ons nieuwste <strong>boek</strong>: <span lang="en">The Late Roman Cemeteries; Stray Finds and Excavations</span> ...</p>

Which is correct HTML again.

Heine’s picture

Title: Search module declares potentially incorrect snippet language » Search module should keep language information on snippet content.
Assigned: Unassigned » Heine

I'm pretty close to enable search_excerpt to insert appropriate language spans in the fragments (even if the language declaration is outside of the fragment), just need to work through some edge cases and look into directionality, so assigning to self.

Heine’s picture

Assigned: Heine » Unassigned
Status: Active » Needs work
FileSize
5.87 KB

Here's the first rough attempt. Some awkwardness due to the choice of array format for language stretches. Also, excerpt fragments are marked-up individually even if they are in the same language.

Does not properly support languages in tablecells on tables that use @lang on col or colgroups.

Heine’s picture

Issue summary: View changes

Corrected attribute to xml:lang.

jhodgdon’s picture

Version: 7.x-dev » 8.x-dev
Issue summary: View changes
Status: Needs work » Postponed
Issue tags: +Needs tests, +Needs backport to D7
Related issues: +#916086: search_excerpt() doesn't highlight words that are matched via search_simplify()

This is an interesting issue. The patch would need to be ported to Drupal 8 and go in there first, before we could consider it for Drupal 7. And it's pretty hard to review now, due to lack of comments and explanations. Plus it needs a test.

In any case, there are some major reworkings of the search_excerpt() function going on in #916086: search_excerpt() doesn't highlight words that are matched via search_simplify() right now, so we should probably postpone this issue until that one goes in.

ianthomas_uk’s picture

jhodgdon’s picture

I took a look at the patch that was posted a while back. I think that, at least in Drupal 8, we actually need to approach this a little differently.

First of all, the function search_simplify(), which is used in search indexing, searching, and search highlighting, is now language-aware. Functions search_index_split() and search_excerpt(), which call this function, are also language-aware. And search_index(), which calls search_index_split(), is also language-aware. All of these functions take the language of the text as a parameter, and assume that the entire text is in a single language.

Second, it appears that in HTML5, the correct way to specify language is with a "lang" attribute, as "xml:lang" is not supported any more. This attribute may go on any tag, and they can be nested. It looks like the patch is doing this part mostly right -- detecting it on any tag, and then in the output making spans, but the difference is we should output "lang" not "xml:lang".

Third, we cannot rely on the Filter module being turned on. The patch uses the function filter_dom_load() to load the DOM object. This doesn't exist in Drupal 8... maybe we can use this class instead though:
https://api.drupal.org/api/drupal/core!vendor!masterminds!html5!src!HTML...

So... It looks like if we want to do this right in Drupal 8, before calling search_simplify(), both search_index_split() and search_excerpt() would need to look for lang attributes on HTML tags that specify areas in the text that have different languages, and they should then call search_simplify() on each language part separately. Then in the case of search_excerpt(), you'd want to reassemble the tags after excerpting, similar to what is done in the patch.

jhodgdon’s picture

Issue tags: +D8MI
Gábor Hojtsy’s picture

Issue tags: +language-content

Yeah search did not consider yet that parts of the content may be in different language. If we can split them and process them per language that would be amazing. Adding more D8MI tags.

jhodgdon’s picture

Issue summary: View changes

Talked to pwolanin about this at DDD today...

The point of this issue is that since the NodeSearch plugin and/or the Search module is putting an xml:lang attribute on the result snippet, it needs to respect the xml:lang attributes in the text.

So I think we won't try to do the snippet processing really carefully -- it's just too complicated to break the text up and preprocess it. But we can fix the snippet creator to put out spans with xml:lang attributes.

Version: 8.0.x-dev » 8.1.x-dev

Drupal 8.0.6 was released on April 6 and is the final bugfix release for the Drupal 8.0.x series. Drupal 8.0.x will not receive any further development aside from security fixes. Drupal 8.1.0-rc1 is now available and sites should prepare to update to 8.1.0.

Bug reports should be targeted against the 8.1.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.2.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.1.x-dev » 8.2.x-dev

Drupal 8.1.9 was released on September 7 and is the final bugfix release for the Drupal 8.1.x series. Drupal 8.1.x will not receive any further development aside from security fixes. Drupal 8.2.0-rc1 is now available and sites should prepare to upgrade to 8.2.0.

Bug reports should be targeted against the 8.2.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.2.x-dev » 8.3.x-dev

Drupal 8.2.6 was released on February 1, 2017 and is the final full bugfix release for the Drupal 8.2.x series. Drupal 8.2.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.3.0 on April 5, 2017. (Drupal 8.3.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.3.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.3.x-dev » 8.4.x-dev

Drupal 8.3.6 was released on August 2, 2017 and is the final full bugfix release for the Drupal 8.3.x series. Drupal 8.3.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.4.0 on October 4, 2017. (Drupal 8.4.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.4.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.4.x-dev » 8.5.x-dev

Drupal 8.4.4 was released on January 3, 2018 and is the final full bugfix release for the Drupal 8.4.x series. Drupal 8.4.x will not receive any further development aside from critical and security fixes. Sites should prepare to update to 8.5.0 on March 7, 2018. (Drupal 8.5.0-alpha1 is available for testing.)

Bug reports should be targeted against the 8.5.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.6.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.5.x-dev » 8.6.x-dev

Drupal 8.5.6 was released on August 1, 2018 and is the final bugfix release for the Drupal 8.5.x series. Drupal 8.5.x will not receive any further development aside from security fixes. Sites should prepare to update to 8.6.0 on September 5, 2018. (Drupal 8.6.0-rc1 is available for testing.)

Bug reports should be targeted against the 8.6.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.7.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.6.x-dev » 8.8.x-dev

Drupal 8.6.x will not receive any further development aside from security fixes. Bug reports should be targeted against the 8.8.x-dev branch from now on, and new development or disruptive changes should be targeted against the 8.9.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Version: 8.8.x-dev » 8.9.x-dev

Drupal 8.8.7 was released on June 3, 2020 and is the final full bugfix release for the Drupal 8.8.x series. Drupal 8.8.x will not receive any further development aside from security fixes. Sites should prepare to update to Drupal 8.9.0 or Drupal 9.0.0 for ongoing support.

Bug reports should be targeted against the 8.9.x-dev branch from now on, and new development or disruptive changes should be targeted against the 9.1.x-dev branch. For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

Version: 8.9.x-dev » 9.2.x-dev

Drupal 8 is end-of-life as of November 17, 2021. There will not be further changes made to Drupal 8. Bugfixes are now made to the 9.3.x and higher branches only. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.2.x-dev » 9.3.x-dev

Version: 9.3.x-dev » 9.4.x-dev

Drupal 9.3.15 was released on June 1st, 2022 and is the final full bugfix release for the Drupal 9.3.x series. Drupal 9.3.x will not receive any further development aside from security fixes. Drupal 9 bug reports should be targeted for the 9.4.x-dev branch from now on, and new development or disruptive changes should be targeted for the 9.5.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Version: 9.4.x-dev » 9.5.x-dev

Drupal 9.4.9 was released on December 7, 2022 and is the final full bugfix release for the Drupal 9.4.x series. Drupal 9.4.x will not receive any further development aside from security fixes. Drupal 9 bug reports should be targeted for the 9.5.x-dev branch from now on, and new development or disruptive changes should be targeted for the 10.1.x-dev branch. For more information see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.

Amber Himes Matz’s picture

Issue summary: View changes
Issue tags: +Bug Smash Initiative, +Needs reroll

We triaged this issue as part of the Bug Smash Initiative in the #bugsmash Drupal Slack channel.

I think I reproduced the issue with the following steps:

  1. Install Drupal 10.1.x with Umami profile (Demo: Umami Food Magazine (Experimental))
  2. From the Manage menu, navigate to Content, search for "pasta" in the Title field filter.
  3. Edit the node with the title "Pasta vegetariana al horno súper fácil". (es/node/3/edit)
  4. For the field, Resumen, update the Formato de texto) to HTML completo. Click the blue Continuar button when the Change text format pop-up warning displays.
  5. On the Resumen field, in the CKEditor toolbar, click Origen to edit the HTML.
  6. Added the following inside the <p></p> tags: <span lang="en">Pasta is delicious!</span> <span xml:lang="en">XML Pasta is tasty!</span> so that the contents of the field are now:

    <p>
    	Una pasta al horno es la comida más fácil y saludable. Este delicioso plato es súper rápido de preparar y una comida ideal entre semana para toda la familia. <span lang="en">Pasta is delicious!</span> <span xml:lang="en">XML Pasta is tasty!</span>
    </p>
    
  7. Save the changes (at the bottom of the Edit (Editar) form, click the blue button Guardar (esta traducción).)
  8. Return to the site and ensure that you're on the Español version (which should be the case since you just edited a Spanish language node).
  9. Use the search field (Buscar) to search for the text "pasta".
  10. In the search result snippet for "Pasta vegetariana al horno súper fácil" (the 2nd result), the search result snippet displays: "Pasta vegetariana al horno súper fácil Una pasta al horno es la comida más fácil y saludable. Este … y una comida ideal entre semana para toda la familia. Pasta is delicious! XML Pasta is tasty! …"
  11. Inspect the HTML for this snippet. Notice that both <span> tags have been stripped.

Note to re-rollers! The code that the last patch modifies still exists in core/modules/search/search.module, but that file has been massively refactored, so I would be very surprised if a regular re-roll would work. I think what probably needs to happen is someone needs to take the latest patch's changes and manually apply them to core/modules/search/search.module and create a new patch.

Tests will also be needed.

I updated the issue summary with my steps to reproduce and also added remaining tasks.

Amber Himes Matz’s picture

Issue summary: View changes

Updated issue summary with:

As noted in #6, the solution should handle the use of lang attributes on any tag, and probably doesn't need to handle xml:lang attributes.

Version: 9.5.x-dev » 11.x-dev

Drupal core is moving towards using a “main” branch. As an interim step, a new 11.x branch has been opened, as Drupal.org infrastructure cannot currently fully support a branch named main. New developments and disruptive changes should now be targeted for the 11.x branch. For more information, see the Drupal core minor version schedule and the Allowed changes during the Drupal core release cycle.