Problem/Motivation

Æ has a few accented variants, these do not get their accents removed because the transliteration component returns AE and the removeDiacritics method only consider single letter replacements exactly in order to avoid replacing Æ itself with AE while removing diacritics.

Proposed resolution

Map these back. The consequences are minor: searching for Æ will now find Ǣ where previously it wouldn't.

I attached an image showing these were the only ones having accents on the expected side and what the change is really.

Remaining tasks

Does this need a change notice ?

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Charlie ChX Negyesi created an issue. See original summary.

Ghost of Drupal Past’s picture

Status: Active » Needs review
FileSize
2.26 KB
Ghost of Drupal Past’s picture

Issue summary: View changes
FileSize
145.09 KB
Ghost of Drupal Past’s picture

FileSize
2.27 KB

Moved the map into a constant.

jhodgdon’s picture

Status: Needs review » Reviewed & tested by the community

This change makes a ton of sense. It even has a test.

As a note, at least on my system, when I looked at the patch file in the browser, the characters were weird. But when I applied the patch and looked at the files in my terminal or text editor, the characters were readable and the patch looks correct to me.

alexpott’s picture

Status: Reviewed & tested by the community » Needs review
+++ b/core/lib/Drupal/Component/Transliteration/PhpTransliteration.php
@@ -23,6 +23,21 @@
+  const DIACRITICS_MAP = [

This map is now added to the global namespace but it is (necessarily) a very incomplete map. It'd be nice to add protected in front of it but then it's going to be Drupal 9 only because scoped constants were only introduced in PHP 7.1. Maybe we can consider a name that doesn't imply a complete map or somehow describes what it maps.

Another question I have is will this list grow? Because if if does then we might want to consider placing it somewhere else because then we don't have to instantiate it when we load this class.

Ghost of Drupal Past’s picture

This list can't grow because removeDiacritics is operating on a very small set of characters (323 of them) and of those, these six have this problem.

A better name eludes me. fix_transliterate_for_remove_diacritics?? The crux of the matter is removeDiacritics using the transliteration datasets for its work which is basically a crafty trick which happens to be correct except for these six. We totally could supply a different dataset for removeDiacritics but it'd be a waste to double maintain that. Look at Component/Transliteration/data/x01.php on how the transliterate dataset is supplied.

Let me describe the bug again by describing what happens when removeDiacritics is ran over a few specific characters:

  1. U+00C6 Æ gets transliterated to AE
  2. removeDiacritics only uses the output of transliteration if it's a single character and so it leaves Æ alone. This is correct.
  1. U+01E2 Ǣ gets transliterated to AE
  2. removeDiacritics only uses the output of transliteration if it's a single character and so it leaves Ǣ alone. This, however, is incorrect: we want the output to be U+00C6 Æ.

I achieved this by using a simple map: AE gets mapped back to Æ. The new code flow is:

  1. U+00C6 Æ gets transliterated to AE
  2. removeDiacritics finds AE in the map. The output is Æ. This is correct.
  1. U+01E2 Ǣ gets transliterated to AE
  2. removeDiacritics finds AE in the map. The output is Æ. This is correct..

Note the test is total. Every character affected by removeDiacritics is tested.

We could eliminate the constant and make it a local variable again, it's just an array of six, not a biggie, I felt a constant is more expressive.

alexpott’s picture

How about making the list a protected property - that's supported and then it does not bleed into the public API.

Ghost of Drupal Past’s picture

FileSize
2.35 KB

Moved to a property. Thanks for the guidance. The interdiff is not meaningful so I haven't attached it.

jhodgdon’s picture

Status: Needs review » Reviewed & tested by the community

It looks to me as though the concerns of alexpott have been addressed. I still like the patch. The tests still pass. No coder errors. And this is a sensible fix to removeDiacritics(), with a very small impact (removes diacritics from a total of 4 characters that were incorrectly being passed up in the previous code). Back to RTBC.

alexpott’s picture

Status: Reviewed & tested by the community » Needs work

One more thing to consider. Re-indexing. I think this change makes a re-index necessary because it changes search input and the text that is indexed... yes there probably aren't too many ǢǣǼǽǮǯ characters in people's text (this statement might very well be a statement of my ignorance) but this change will breaking searching for them on an already indexed site.

Ghost of Drupal Past’s picture

Status: Needs work » Needs review
FileSize
2.96 KB
527 bytes
alexpott’s picture

Status: Needs review » Needs work
+++ b/core/modules/search/search.install
@@ -154,3 +154,13 @@ function search_requirements($phase) {
+/**
+ * Mark everything for reindexing after diacritics removal rule change.
+ */
+function search_update_9000() {

Let's do this in a post update - this is not a schema change and it also doesn't force our hand wrt to D8 and D9 update numbers for the search module.

Ghost of Drupal Past’s picture

Status: Needs work » Needs review
FileSize
3 KB
Ghost of Drupal Past’s picture

FileSize
3.21 KB
Ghost of Drupal Past’s picture

FileSize
3.19 KB
jpatel657’s picture

Assigned: Unassigned » jpatel657
jpatel657’s picture

Version: 8.9.x-dev » 9.1.x-dev
jpatel657’s picture

Assigned: jpatel657 » Unassigned
FileSize
3.19 KB

Patch is working for 9.1.x

Ghost of Drupal Past’s picture

I do not quite understand why #16-19 was posted. The patches are identical in #16 and #19.

md5sum 3151364_15.patch 3151364-18.patch
48685a2a596a4f489f51dcdc9d64011d  3151364_15.patch
48685a2a596a4f489f51dcdc9d64011d  3151364-18.patch
Ghost of Drupal Past’s picture

alexpott’s picture

@jpatel657 the patch in #16 applied to 9.1.x already. You could have used the add test button to kick off an 9.1.x test against that branch. Since #19 is exactly the same as #16 I'm removing issue credit.

alexpott’s picture

Version: 9.1.x-dev » 8.9.x-dev

Also since this is a bug that occurs in 8.9.x that's the correct branch for this patch. It's great that the same patch applies to 9.1.x so we can applied the same patch to all branches.

jhodgdon’s picture

Apparently, post update functions go into the module_name.post_update.php file. Other than that, looks OK to me.

alexpott’s picture

Status: Needs review » Needs work

@jhodgdon is correct - that's exactly where they go. Nice catch.

munish.kumar’s picture

Assigned: Unassigned » munish.kumar
Ghost of Drupal Past’s picture

Assigned: munish.kumar » Ghost of Drupal Past
Status: Needs work » Needs review
FileSize
3.28 KB
840 bytes

The interdiff is against #9 when the patch was last RTBC.

jibran’s picture

Status: Needs review » Reviewed & tested by the community

Back to RTBC as all the outstanding feedback is addressed.

catch’s picture

Status: Reviewed & tested by the community » Needs work
Issue tags: +Needs reroll

This looks good to me, but there's a conflict in search.post_update.php now.

Ghost of Drupal Past’s picture

Assigned: Ghost of Drupal Past » Unassigned

Alas, someone else will need to fix this now.

amateescu’s picture

Status: Needs work » Reviewed & tested by the community
Issue tags: -Needs reroll
FileSize
3.25 KB

Here's a patch for 9.1.x, and #27 still applies to 8.9.x.

Krzysztof Domański’s picture

As a note, at least on my system, when I looked at the patch file in the browser, the characters were weird. But when I applied the patch and looked at the files in my terminal or text editor, the characters were readable and the patch looks correct to me.

@jhodgdon #2922638: No charset on response for patch & text files

Krzysztof Domański’s picture

Component: base system » transliteration system

  • catch committed 920d170 on 9.1.x
    Issue #3151364 by Charlie ChX Negyesi, amateescu, alexpott, jhodgdon:...
catch’s picture

Status: Reviewed & tested by the community » Fixed

Committed/pushed to 9.1.x and backported to 8.9.x, thanks!

  • catch committed 13da1a7 on 9.0.x
    Issue #3151364 by Charlie ChX Negyesi, amateescu, alexpott, jhodgdon:...

  • catch committed 33b050c on 8.9.x
    Issue #3151364 by Charlie ChX Negyesi, amateescu, alexpott, jhodgdon:...

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.