New pathauto user. My punctuation settings are set for all characters to be replaced. I did a bulk update for all nodes and found that a title like 'The "Thhwwwpt" Factor' (don't ask :) would be aliased as 'quotthhwwwptquot-factor'.

Don

Comments

greggles’s picture

Status: Active » Postponed (maintainer needs more info)

Literally the string "quot" is in there?

Are you using token 5.x-1.x-dev with the [title-raw] token?

mrtoner’s picture

Yes, the literal string "quot" was in there. Also, a blog post I created today replaced an apostrophe with 039, "writer039s-block". I am using the [title] token.

As I noted in another issue a few moments ago, I have now updated to 5.x-1.x-dev of token (from 5.x-1.8); I will let you know if I have any problems with that.

Don

greggles’s picture

This is why you need to use the -raw tokens ;)

Can you give those a try?

Thanks.

mrtoner’s picture

Changing to the [title-raw] token solved that problem. Sorry -- given the warning on the settings page I thought it wasn't a good idea to use the -raw tokens.

Don

m3avrck’s picture

I ran into this same problem and immediately thought this was a bug.

Looks like this is a major usability issue and should be on by default IMO.

suzanne.aldrich’s picture

I concur about the [x-raw] tokens needing to be explained more; the punctuation issue wasn't completely fixed until after I replaced my pre-existing paths with the raw versions after the upgrade to the latest dev.

tormu’s picture

I guess this was working allright in the past but yeah I second this opinion that the [x] should produce the desired action instead of [x-raw]. Doesn't using the raw indicate more possibilities for ugly paths or some kind of security risks - not sure about that - but it does lack usability at this point.. Can you, greggles tell the downsides of sorting this 'flaw' by using the [raw]'s? I mean why is the warning there exactly?

I tested the characters and all is set to remove:
' x " x % x ¤ x # x / x ( x ) x = x + , ` x . x - x _ x : x ; x * x ^ x ~ x ? x $ x ! x < x > x \ produces
-x-quot-x-x-%C2%A4-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-lt-x-gt-x with [title] and
-x-x-x-%C2%A4-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x-x with [title-raw], which also shows that maybe the character shift-4 should be handled too.

greggles’s picture

Title: Double Quotes Not Replaced » make the need for the use of -raw more obvious
Status: Postponed (maintainer needs more info) » Active

Here's a little more information than you probably wanted, but I think it's necessary to get good progress on this issue.

Why we need to use the -raw tokens:
Pathauto now relies on token module for the pattern replacement. Token module used to only provide one set of tokens that was already "checked" i.e. sent through check_plain. Sending it through check_plain means that, for example the title tokens for a page with the title {string" with & and'stuff}

[title] => string" with & and'stuff
[title-raw] => string" with & and'stuff

Pathauto 1) already had code to deal with punctuation characters in their normal format (i.e. to deal with ' but not ') and 2) I didn't really fancy trying to figure out the html encoded versions of all these characters to have pathauto deal with them and 3) there was already this issue where admins had said that they wanted control over exactly what happens to punctuation so I felt that looking for "'" was a potential data loss issue where if the admin says replace ' then it's clear what they mean.

Why it's OK to use the -raw tokens:
These aliases are used by the alias system. You can type whatever you want into the alias input form - there is no validation aside from the uniqueness of the alias. That's OK because url which handles every url in Drupal uses drupal_urlencode so the data is safe to be used in the context of links.

Defaults for new installs:
The current 5.x-2.x-dev will 1) default to use -raw tokens where possible and 2) defaults to replacing the hyphen with a separator. It also checks to make sure that your separator punctuation character hasn't also been set for "remove from string" which can cause some problems. I believe that the defaults are now sane on a fresh install of Pathauto.

Help for Upgrades:
This is the part that needs the most help. I don't really want to add code that will validate the tokens input by the user, but I think I might need to do that to make it easier for users. I thought that I had added text to the CHANGELOG.txt and INSTALL.txt about this but apparently not. I don't like the idea of automatically replacing tokens because that is basically a fancy data loss bug.

I'm *very* open to ideas for how to improve this. I'll be fixing some other pathauto bugs today and will look forward to your feedback on where/how to best let upgrading users know about this change.

tormu’s picture

at least for me the basic question without answer is:
Why does the -raw exist? If it's safe to use the unfiltered users inputs on the url paths, then why doesn't [title] produce the exact same as does the [title-raw] and therefore why not get rid of the "-raw"s and make the plain ones do the same the -raws would do?
I know I might be asking things that are obvious to you, but I'm trying to give the users opinion :)

greggles’s picture

It seems my example was filtered or I pasted it wrong something. I should have said:

[title] => string" with & and'stuff
[title-raw] => string" with & and'stuff

I'lll try posting again before I explain...

greggles’s picture

So, in fact the issue queue replaces the html character codes with the actual character. Sweet.

[title] => string" with & and'stuff

But the " shows as & quot;
The & shows as & amp;
The ' shows as & #039;

greggles’s picture

Ok, so by inserting a space between the & and the code I was able to show you what is really coming through inside of the strings. If you can take that knowledge from number 11 of what the punctuation characters look like and my explanation from #8 together, perhaps that explains it.

If not, perhaps the shorter explanation over here will help: http://groups.drupal.org/node/6706

greggles’s picture

Status: Active » Needs review
StatusFileSize
new7.17 KB

The attached patch does a couple things.

First, it documents the use of the -raw tokens in the Admin > Help > Pathauto

Second, it checks each of your patterns and if there is a "-raw" companion to the token you are using it will warn you that you probably really want to use the -raw token.

Third, it makes sure that the token you are using makes sense for the name space where you are using it.

There is a slight bug which is that if you have an error and you save the configuration then it will display the error message twice. This has something to do with the static variable I'm using which is just a little beyond me at the moment.

greggles’s picture

Title: make the need for the use of -raw more obvious » make the need for the use of -raw more obvious and warn if users use the wrong tokens
Assigned: Unassigned » greggles

updating title to reflect everything this does.

greggles’s picture

An improved version
1) I fixed the whole "double message" problem so it only displays the messages once - this fixed a rather confusing bug where it would nag you about problems you had just fixed.
2) I added a bit of help text underneath the textboxes that are in error
3) spacing and some other stuff I think

I believe this is ready to go but it would be nice to get a second review.

greggles’s picture

StatusFileSize
new8.65 KB

and the patch...

greggles’s picture

Status: Needs review » Fixed

I've been running this patch on my personal sites since I made it and it seems to be quite helpful without causing any errors or odd behavior (at least none that I found).

So, I've applied it in the hopes that more people can test out the -dev release than can test patches.

I'd still love to hear feedback on the subject in case this can be made even more obvious (m3avrck? aigenta? jari?)

Thanks.

m3avrck’s picture

Greg this seems better for sure.

Now my question is, it seems *most* people will want that -raw option, so should this be made the default instead of just regular? Or rather, added a "recommend" or "this is mostly likely what you want" next to it or in the description to help users?

greggles’s picture

It is the default for new installs. If there are patterns in the default that don't match this then please file an issue for that.

Only upgrades are in the un-enviable position of getting to see errors from this bit of code (which is why I was hesitant to add it in the first place - it's only applicable for a short time).

m3avrck’s picture

Gotcha, sounds good to me!

Perhaps even remove this code *after* the subsequent / beta / release?

greggles’s picture

So, the original patch did a great job of checking the default patterns, but I forgot to make it check more specific patterns - e.g. if your Default path pattern for nodes was wrong it would catch that, but if your "Pattern for all blog entry paths" was bad it would let that slip by.

I just committed a fix for that.

greggles’s picture

I had two more issues with this that I wanted to document more thoroughly

The first was that I forgot to "unset" the error message after my fix in comment #21 which meant that once any pattern in the node section (for example) was found bad then the highlighting message would be included on all the following text boxes in that section.

The second was that my regex was greedy and has to allow -, so if you had [token1]-[token2] then it would see those as a single token of [token1-token2]. Making it "ungreedy" seems to have fixed this.

Anonymous’s picture

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for two weeks with no activity.