I have relatively recently updated my Drupal 5.7 site to use Pathauto 5.x-2.1 and Token 5.x-1.10. Following this update I am now seeing an issue where strings in the "Strings to Remove" field of "General settings" in the Pathauto admin page for my site are not being removed.

The relevant elements of my configuration are:

General settings
Set to default values - specifically the word "the" is in the list of "Strings to Remove"

Node path settings
"Default path pattern" set to "[termpath-raw]/[title-raw]"

Category path settings
"Default path pattern" set to "[vocab-raw]/[catpath-raw]"
"Pattern for all Category paths" set to "[catpath-raw]"

I have a top-level term in the default Category vocabulary called "About The School" which previously had an alias of "about-school" as expected i.e. the "the" is being stripped from the alias. If I now save this term the alias is being updated to "about-the-school" i.e. the "the" is NOT being stripped from the alias any more.

Likewise, aliases for content assigned to this term are no longer having listed words removed from their aliases. For example, a page with title "About The Web Site" is now being given an alias of "about-the-school/about-the-web-site" instead of "about-school/about-web-site" as previously generated, and as expected.

Please help! Is this an issue or something relating to my possible misunderstanding of how the "-raw" tokens work? Many thanks.

Comments

greggles’s picture

Status: Active » Postponed (maintainer needs more info)

I followed these same steps and was unable to repeat the behavior. Can you try on a brand new site and see if the problem persists?

Note that this textfield should have all the words together separated by commas (a,an,the,whatever) without any spaces or anything.

darrenscott’s picture

Thanks for the quick reply.

If you have tried these steps with no issues on a different site I'm sure it will work as expected on a new clean site. What I'd like to understand though is what it is about my particular site that was previously working absolutely fine with an earlier version of Pathauto which is causing it to experience problems after upgrading to this version of Pathauto.

As a couple of extra tests I've tried changing other settings, such as the separator character and these changes take effect when updating terms as expected. I've also tried replacing the "Strings to Remove" value with just "the,The" (I know case sensitivity shouldn't matter but just wanted to cover all bases) and this still doesn't result in the "The" of "About The School" being removed.

Can you think of any other steps I can try to get to the bottom of what is causing this on my site?

Also, is it worth trying the latest 5.x-2.x-dev version? (this is a live site though!)

Finally, as you've proved this version works on a different site would you like me to change the issue category to "support request" as it's most likely not a bug in the module?

greggles’s picture

Category: bug » support

I'm not sure of many more debugging steps to do aside from adding print statements into the actual code to see what is going on.

davegan’s picture

I can verify that I'm also experiencing this bug... is it possible that aggressive caching could have fouled this up?

davegan’s picture

ok, I've locked down the behavior...

the issue occurs when you have capital words in the node title.

For instance:
"Here is a node title"

Will create the url "here-node-title".

However

"Here Is A Node Title"

Will create the url "here-is-a-node-title".

It's an easy bug to miss, as the second option is an improper title, according to the English language. The reason I caught it on my site is that the removal word "The" is at the start of many of my titles.

I've fixed the problem on my end for now with a couple of strtolower() calls in pathauto_cleanstring, but it would be good to see a module developer fix this up in the next version.

darrenscott’s picture

Thanks for confirming I wasn't alone in experiencing this issue davegan - and good work tracking down the exact cause.

Would you mind posting exactly the changes you've made to pathauto_cleanstring to get this to work for you? Is it worth proposing these fixes as a patch?

Do any of the module developers know what changes in this version have broken this and are there plans to fix the module any time soon? Would be great to get an official fix in for this and a new version released...

greggles’s picture

Title: "Strings to Remove" not being removed » "Strings to Remove" not being removed if they are capitals
Category: support » bug
Status: Postponed (maintainer needs more info) » Active

@davegan - can you please provide a patch based on your changes?

Now that there is a simple repeatable test case this should be easy to fix. Thanks.

darrenscott’s picture

Excellent news - many thanks.

greggles’s picture

I did some more testing with this and found this feature to be enormously inconsistent. Words in my "words to remove" list were sometimes left in even if they were lowercase. I'm starting to think this needs to be reworked a bit...

davegan’s picture

I could provide a patch, but I suspected there could be more problems that wouldn't be addressed with my quick fix. I'm a little tight for time at the moment to test and debug completely.

Previous versions of Pathauto seemed to be fine - perhaps the key would be going back and figuring out which change introduced the bug?

darrenscott’s picture

Thanks davegan. From what greggles said earlier, it sounds like there could be more problems lurking with this...

Do you know if it's safe to go back to the previous version of the pathauto module? I don't think I've ever had to downgrade a module before! Would there be any special steps required to downgrade?

Greggles - do you have any update on any reworking or official patch for this?

greggles’s picture

@darrenscott which previous version worked for you?

darrenscott’s picture

I can't be 100% sure but I think it was the previous version, 5.x-2.0 as I remember all modules I use being up-to-date until the point this latest version 5.x-2.1 was released.

greggles’s picture

Well, if you can pin down a version where it worked correctly and the new 5.x-2.1 where it doesn't it will be easy to figure out what changes were made that broke this which will make it enormously simple to fix :-)

darrenscott’s picture

As I said, I think I was using 5.x-2.0 previously. Unfortunately I don't have time right now to confirm one way or the other but I will try to do so tomorrow.

I wonder if anyone else experiencing these problems can be more specific on versions - @davegan - how about you?

Assuming it was 5.x-2.0, can you see any obvious changes in 5.x-2.1 which would cause the problems?

Oh, BTW - thanks again for taking the time to look into this issue :-)

moonray’s picture

Same issues (subscribing)

moonray’s picture

I believe I might have found the cause of the problem: the mb_ereg_replace function.

This is from a comment on PHP.net:

'i' option does not work correctly with multibyte characters. The function does not locate/replace the multibyte string if it's different case then specified on multibyte needle which is in different case.

Luckily there is a mb_eregi_replace function.

moonray’s picture

One more thing about this, now that it works...

A path like "blogs/The Team" gets cleaned up to "blogs/-team". It would be nice to get rid of the separator at the beginning.

moonray’s picture

Looks like the separator remaining at the beginning only happens with a pattern like "blogs/[catpath-raw]".

darrenscott’s picture

@moonray - when you say "now that it works...", is that by replacing references to mb_ereg_replace with mb_eregi_replace in your local pathauto.module?

Further to the original problems I reported, I've also been experiencing other strange problems with pathauto relating to the updating of existing aliases. I don't want to go too off-topic but am interested if anyone else has experienced the following:

I create a term for a vocabulary e.g. ABC
An alias of abc gets created for this term.
I then update the term from ABC to ABCD.
The term is updated correctly, but the alias appears to remain as abc.
I then update the term again from ABCD to ABCDE.
The term is updated correctly, but the alias gets updated to abcd! i.e. one update behind

I've tried this with both 5.x-2.1 and 5.x-1.0.

Has anyone else experienced this?

greggles’s picture

@darrenscott - please, please open a new issue for a new problem.

@everyone else - let's stay focused. thanks.

darrenscott’s picture

@greggles - sure, no problem, I'll raise another issue

Back to the original issue... do you have any update? Are we any closer to having a fix for it? Does @moonray's observations about mb_ereg_replace help?

greggles’s picture

I don't have time to do much development now, so I haven't tested it. I'm relying on Pathauto's users to develop patches and then tell me which patches make sense or not.

darrenscott’s picture

As per moonray's post #17 above I've just applied this same change to my installation of pathauto 5.x-2.1 and can confirm the problem I initially reported appears to be resolved by it.

To summarise, I replaced the two references to mb_ereg_replace in pathauto.inc with mb_eregi_replace instead. (For the actual call to this function I used the same parameters as before but I guess the "/i" suffix in the first param is now redundant)

@greggles - would you have time to create a patch based on this change and test yourself? Thanks.

greggles’s picture

As I said in #23 of this issue, no, I don't have time nor priority on these issues to roll nor test patches now. I will in a month. If you can roll and test a patch that helps me.

Anonymous’s picture

subscribing

I edited the function to use mb_eregi. Yet ...

With subsequent words to remove, the function will only remove the last listed in the settings.

Example:
Words to remove are: "dr,lic,phil,prof"
Page title is: "Prof. Dr. Albert Einstein"
Results in URL: "dr-albert-einstein"

Changing the order of words in the preferences to "prof,dr,lic,phil" will result in the URL being "prof-albert-einstein"

goldstone’s picture

We also experienced the same issue/resolution identified in #17. Wherein certain words are not removed as per the PHP note. In addition - there is a separate issue described in #26 in which some of the words on the list are not being removed (besides those that are not removed due to the first issue). With a few tests I found that the first and last words in the list were not being removed. I didn't continue testing to see what other patterns would result in words not being removed nor under what conditions.

Example:
Words to remove: "a,the,of,test"
Page title: a test of the emergency broadcast system
Resulting URL Alias: a-test-emergency-broadcast-system

Example:
Words to remove: "a,a,the,of,test,test"
Page title: a test of the emergency broadcast system
Resulting URL Alias: emergency-broadcast-system

In the second example I doubled up on the first and last words as a work-around. I purposefully left the title lower case to separate the issues.

Leeteq’s picture

Subscribing.

greggles’s picture

Status: Active » Needs review
StatusFileSize
new1.03 KB

Ok, could folks please test this out? I have a feeling that most of the problems are caused by 1) subtle differences in php version/configuration/compilation 2) the fact that pathauto will attempt to use preg_replace sometimes and mb_ereg_replace in others which may be further affected by those subtle differences.

@george.goldstone and @xeophin - note that your experiences are similar, but completely the opposite. So...more fun!

WeRockYourWeb.com’s picture

Just tried the patch on 5.x-2.2 and it gave the message "Hunk #1 failed at 161 / 1 out of 1 hunk failed"

Duplicating the first term in the list of strings to remove solved the issue - ie. using "a,a,..." helped remove lower case a's. Just for kicks, I tried "a,A,a,..." and now it removes both lower and uppercase a's. It appears that 1) it ignores the first item in the list of strings to remove, and 2) it's case sensitive?

jenlampton’s picture

@asterix - the version of this issue is 5.x-2.1, so that's where the patch should be applied, not 5.x-2.2

I am suffering from the same problem (in versions 2.1 and 2.2), words with capitals not being removed, as well as lower-case a (the first word in the list) not being replaced.

I've downgraded my site to 5.x-2.0 and wanted to confirm that all the above problems seem not to be problems in this version.

@greggles, I'm not sure how much help that is, but maybe it's something?

Jen

greggles’s picture

Version: 5.x-2.1 » 5.x-2.x-dev
StatusFileSize
new877 bytes

@jenlampton - yes, that gives a great clue. It seems that this bug was introduced in #169250: transliteration failed with non-latin characters which should have used mb_eregi_replace instead.

There seems to be a related but separate issue about the first and last words in the list either not working or being the only ones that do work.

In general, patches should be against the tip of the branch, so for 5.x that is the tip of DRUPAL-5--2 but in this case it fell out of date with all the recent wonderful cleanup work of Freso. Updated patch attached.

greggles’s picture

Title: "Strings to Remove" not being removed if they are capitals » "Strings to Remove" not being removed if they are capitals or the first/last word in the box.
Status: Needs review » Active

I fixed the first part of this - that it was broken for capital words.

Now, to figure out what is wrong with the regex for the first/last words in the group.

greggles’s picture

Status: Active » Needs review

So, it turns out that mb_eregi_replace seems to take a different pattern style than preg_replace. I didn't see anything on the docs page to suggest this (http://ar2.php.net/preg_replace vs. http://ar2.php.net/preg_replace) but...at least in my tests this seems to work.

Since regex tends to be platform specific I'd love it if a couple of people could try this out. The change is very simple even if you are not comfortable applying patches: just change

Old:

   $output = mb_eregi_replace("/$ignore_re/i", '', $output);

New:

   $output = mb_eregi_replace("$ignore_re", '', $output);
greggles’s picture

Somehow I forgot to attach the patch...

@george.goldstone, @xeophin, @jenlampton, can any of you test this new fix for the first/last words in the box?

Freso’s picture

Version: 5.x-2.x-dev » 6.x-1.x-dev
Status: Needs review » Reviewed & tested by the community
StatusFileSize
new729 bytes

Tested and works on 6.x. A few comments though. :)

  • Is there a reason to keep the quotes around the variable? The patch seems to work just as well without it (see attached one). :)
  • Is preg_replace("/$ignore_re/i", '', $output); still proper, or should this be fixed as well? (For cases where PHP isn't built with multibyte character support.)
Freso’s picture

Oh, and for platform info: I tested on Gentoo Linux (kernel: 2.6.24-gentoo-r8), using "PHP 5.2.6RC4-pl0-gentoo with Suhosin-Patch 0.9.6.2".

greggles’s picture

Your patch seems reasonable to me. AFAIK the preg_replace side works relatively well (it was the more tested way of doing it - the mb_ereg_replace was added more recently, I think between 5.x-2.1 and 5.x-2.2).

Freso’s picture

Status: Reviewed & tested by the community » Needs review

Alright, since we need to test this on multiple platforms (this being operating systems and PHP versions), I decided that I might as well go ahead and write how to test it.

  1. Look at admin/build/path/pathauto under "General settings" -> "Strings to Remove" and check the first argument (by default, this is "a").
  2. Make some new content, making excessive use of this argument (on a fresh install, something like "A page to test without a patch"). This should generate the path "content/a-page-to-test-without-a-path".
  3. Apply the patch.
  4. Make a new piece of content, again making excessive use of the the argument (on a fresh install, something like a "A new page to test with a patch could be used). This should generate something like "content/new-page-to-test-with-patch". If it does, the patch is working!
  5. Make a reply here stating whether the patch works, as well as your operating system (Windows, Linux, Mac OS, *BSD, ...) and what version of PHP you tested with.
moonray’s picture

As I'd mentioned previously, this patch fixes the main issue for me (though, I wasn't having the first/last term not being replaced issue).
drupal 5.7, php 5.2.1, mac os x

Freso’s picture

Ooh. I didn't notice it didn't catch the last word either. Well, I guess I'll have to test again (and update my test description). Oh well. :)

@moonray: The issue with capital letters' fix is already checked in, the patch is now only dealing with the first/last term replacement. If you haven't tested this, please do.

Freso’s picture

Version: 6.x-1.x-dev » 5.x-2.x-dev
Status: Needs review » Patch (to be ported)

Alrighty! Committed to HEAD/6.x-1.x. Now we just need a back-port for 5.x-2.x. :)

Freso’s picture

Status: Patch (to be ported) » Fixed

As the patch from #36 applies cleanly to 5.x-2.x as well, I've just committed it.

One more issue closer to 6.x-1.1 release! Now's an excellent time to go review #208860: Add per-language patterns for multilingual node types! :D

Anonymous’s picture

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for two weeks with no activity.