Our clients' newspaper site posted a story earlier today that used ‘curly single quotes’ in the title. PathAuto dutifully created a URL from it, curly quotes intact.
For whatever reason, the link worked in the taxonomy listing page, but $node->path was definitely returning the path with the unescaped quote characters, causing a 404 when someone tried to access the story through a custom-written Flash rotator on our site. The fix was easy enough -- manually removing the quotes from the path -- but I think PathAuto, if it's removing "straight" apostrophe characters, should be smart enough to remove their curly bretheren as well.
Resolution
See comment #29 below for a code sample on how to add your own punctuation characters to pathauto.
Comment | File | Size | Author |
---|---|---|---|
#37 | pathauto-remove_non_ASCII-96_chars-207840-37.patch | 450 bytes | jenlampton |
#25 | pathauto-207840-25-7.x-1.x-dev.patch | 1011 bytes | balsama |
#23 | pathauto-207840-23-6.x-2.x-dev.patch | 1.02 KB | balsama |
#23 | pathauto-207840-23-7.x-1.x-dev.patch | 1.03 KB | balsama |
#18 | pathauto-n207840.patch | 379 bytes | jwilson3 |
Comments
Comment #1
gregglesCan you look at the function in the bottom of pathauto.inc and please provide a patch.
Comment #2
Garrett Albright CreditAttribution: Garrett Albright commentedCould you be more specific, please? Which function?
If it's pathauto_punctuation_chars(), is it a matter of just adding the characters to the $punctuation array?
Comment #3
gregglesYou got it - pathauto_punctuation_chars is the place to add it. Just look at the examples, copy, paste, add your code and descriptive text and we should be all set.
Comment #4
Garrett Albright CreditAttribution: Garrett Albright commentedOkay, here's a patch. Brief testing shows it works well, though I must admit I'm unfamiliar with any considerations I must keep in mind when using non-ASCII characters in PHP strings -- it has always just worked for me, including in this case. I had to adjust the spacing of the lines a bit in order to keep things lining up in nice pretty columns…
I must say, though, that I think the way this is done -- specifying each individual character that should/could be filtered out -- is somewhat untenable. My patch handles English quotes (and apostrophes), but what about quote characters in languages other than English -- and that's just quote characters… In other words, there seems to be no limits to how large this $punctuation array can grow!
Comment #5
mlsamuelson CreditAttribution: mlsamuelson commentedTried to apply the patch, but it fails:
mlsamuelson
Comment #6
Garrett Albright CreditAttribution: Garrett Albright commentedI hate to say it worked for me, but… It worked for me. Are you patching against the most recent version of Pathauto? What command are you using?
Comment #7
mlsamuelson CreditAttribution: mlsamuelson commentedI had the 2.x version instead of 2.0. So... when I used the correct version of the module, the patch applied correctly. Funny, that.
Not only did it apply cleanly, but it also worked as advertised.
As a test, I left the pathauto settings at default, and then created a story node with the title quotation marks ( ‘ ’, “ ” ). Pathauto aliased it as content/quotation-marks.
Then I tested setting the action as "do nothing" for left quotes in the punctuation settings of Pathauto, and left quotes appeared in the URL as expected.
Looks good.
mlsamuelson
Comment #8
gregglesWell, I would like to commit this, but it fails for me.
So...what am I doing wrong now?
Comment #9
Garrett Albright CreditAttribution: Garrett Albright commentedHmm. Well, I must admit that I'm not entirely experienced in the art of creating patches for others' consumption. I'm wondering if the problem is a (filesystem) path problem, since the opening lines of the patch do reference absolute paths to the files as they are on my drive. Greg, could you maybe try explicitly specifying the file to patch using the syntax I used (`patch path/to/pathauto.inc path/to/patch.patch`)?
EDIT: I re-RTFM'd the "Creating patches" page. Here's a new patch created from the Drupal root. Hopefully this'll work better for you folks. Sorry for my n00bishness.
Comment #10
mlsamuelson CreditAttribution: mlsamuelson commentedI have no idea how I got that first patch to work. Weird.
The newest patch worked great, however. I ran it through the same tests as before, and it checked out.
mlsamuelson
Comment #11
Cameron Tod CreditAttribution: Cameron Tod commentedAdd this line to pathauto_punctuation_chars() get rid of Word's rritating long hyphens:
Comment #12
Freso CreditAttribution: Freso commentedComment #13
gregglesIf Transliteration module provides support for stuff like this then that sounds excellent to me and I agree that it should be marked duplicate in favor of that.
Yet another alternate (more scalable) proposal is to provide 3 textboxes (one per action) where people can put their own punctuation and have it provide some default values.
Comment #14
Freso CreditAttribution: Freso commentedWell, the goal and purpose of Transliteration is to change Unicode stuff into ASCII/ANSI stuff. Which includes punctuation. I'm not sure whether curly quotes are currently in Transliteration's tables, but they could easily be added. (See my #257041: More transliterations (x21??) for adding/updating one of the tables to include transliterating of more characters.)
I just went and looked at x20.php (Unicode hyphens are x2010 and x2012-15), and it looks like they're already taken care of. I don't know where the curly quotes are, but chances are they're already in there as well.
So, in short: Take a look at #247758: Use Transliteration module for transliteration and test it if you can! (Note that Transliteration kicks in before the character rules played with so far in this issue, so when Transliteration has transliterated “” to "" and – to --, Pathauto will then use its settings to determine whether to remove or replace or ignore these in the alias.)
Comment #15
therzog CreditAttribution: therzog commentedHi: I tried this patch and it didn't work on my test string. Instead, I had to use the pack function to specify the matching strings like this. It might help others, and it seems more portable:
Comment #16
jromine CreditAttribution: jromine commentedsubscribe
Comment #17
maijs CreditAttribution: maijs commentedsubscribe
Comment #18
jwilson3This is a fairly major oversight neither handled in the transliteration file (i18n-ascii.example.txt) nor the confounded ui for specifying to strip quotes.
I tested this and fixed it easily by adding both of the curly quotes (opening and closing) to the i18n-ascii.txt file, and enable the Transliteration option in pathauto (which is recommendable for ANY website).
I'd prefer that these two make their way into both d7 and d6, but I dont have time to dl and test this for D7, but provide a patch for d6, for anyone inclined to run with the ball and test / reroll for d7.
Comment #20
Dave ReidIf curly quotes aren't already handled by the Transliteration module then you need to file an issue in its issue queue.
Comment #21
jakew CreditAttribution: jakew commentedI'm having this issue with 6.x-1.5.
Comment #22
balsama@Dave Reid, am I missing something? Isn't the original request to add support for curly quotes to pathauto?
Patches attached for 6.x-2.x-dev and 7.x-1.x-dev versions of the module.
Both patches add support for the following characters:
Comment #23
balsamaLast patch had syntax error. Reattaching.
Comment #25
balsamaOne more try.
Comment #26
balsamaComment #27
balsamaOk. Now that I read this a little closer, it looks like the official stance of pathauto is that "if your site is likely contain characters beyond ASCII 128" you should just use the transliteration module.
I still think it would be a good idea to include the above patch in pathauto OR make transliteration a requirement since, sooner or later, a large percentage of sites are likely to have an editor paste a curly quote into a title used for pathauto. But since it looks like the module maintainers are going a different route, I don't want to clutter the issue queue.
Comment #28
fletchgqc CreditAttribution: fletchgqc commentedUse Case Overlooked
I apologise for re-opening this issue, but I believe a reasonable use-case has been overlooked. If I'm wrong just close it again. Here is the use-case: I want to use native characters for Japanese, Russian, whatever (i.e. I don't want to transliterate), but I don't want punctuation in my URLs. How do I do this?
The Transliteration module is not easily customisable. You can't decide to only transliterate punctuation and nothing else. So transliteration can't do the job, but neither can pathauto, currently.
You might say: "if you are happy with native characters, then be happy with curly apostrophes". But this is not fair. The point is that I don't want any punctuation at all. Now, admittedly we are sort of opening a can of worms to start supporting the removal of every kind of punctuation, because sooner or later someone will ask you to remove the ¿ character, etc.
So far I put my punctuation into the box marked "Strings to remove. Don't use this for punctuation" :-). I don't know why I'm not allowed to use this for punctuation, because it seems to work. But it doesn't work for curly speech marks pointing right, I know because I just tried it and ended up finding this issue.
A More Scalable Solution?
So I'm fully in support of the proposed patches. But I propose a more scalable solution. Scrap the long list of punctuation and just have three boxes:
There is a third option on the punctuation select lists: No action. But obviously this can be implemented by just not including that in either of the boxes.
With this solution, the subject will never need to be discussed again. Everyone will just put the punctuation they need into these boxes. And the screen will get a lot shorter :-). Default content of the boxes could be based on the current default settings of the select boxes. Does this sound reasonable?
Comment #29
codesidekick CreditAttribution: codesidekick commentedSide note to this issue:
Don't feel like patching Pathauto or installing another module? You can implement the alter hook pathauto_punctuation_chars_alter() like so:
Add anymore punctuation you don't want in paths, clear cache and enjoy.
Comment #30
jwilson3^ thats a great suggestion. Thank you. I think probably with this we can reclose this?
Comment #30.0
jwilson3Added a reference to comment #29.
Comment #31
jberg1 CreditAttribution: jberg1 commentedThis doesn't seem to work for me. I patched the pathauto.inc. Now on URL Aliases->Setting, I see the new Double curly left, Double curly right, Single curly right, Single curly left but in the ( ) it is empty. And it is not removing those punctuations from the auto generated path.
Am I doing something wrong?
Thanks for any help.
Comment #32
millenniumtreeBe sure to clear your cache, and also make sure your editor isn't messing up the 'fancy' characters you have to enter into the hook definition function.
I added the unicode 'dash' to mine.
Someone entered one of these into a node title and pathauto actually killed the whole page with a PHP error. :P
Comment #33
jberg1 CreditAttribution: jberg1 commentedI made sure to clear the cache, and I'm just using a plain text editor to add the characters (not sure how else to place them). I'm also running into the same issue with "é". Parenthesis are just empty and it doesn't remove character. How else could I enter those "fancy" characters into the function so it recognizes them?
Thanks for any help.
Comment #34
jwilson3@jberg1, you will be better off using the Transliteration module, which could help you do clever things like convert the "é" to a regular "e", to create a legible url.
Comment #34.0
jwilson3Link to comment #29.
Comment #35
jenlamptonI understand that transliteration is the recommended solution here, but I don't have any reason to use transliteration on my site other than to remove the curly quotes from URLs, and it *appears* that there's already an option for removing them right here in pathauto. I see an option for Reduce strings to letters and numbers which says "Filters the new alias to only letters and numbers found in the ASCII-96 set."
Are curly quotes in the ASCII-96 set? I checked online and it didn't look like it.
I'm going to re-open this issue because it appears as though the solution to removing curly quotes from URLs should be as simple as checking this option. But that's not working. (I'm also changing the status of this issue to a bug.)
Comment #36
jenlamptonOkay, I see what's going on. Instead of removing the characters that aren't in the ASCII-96 set, it looks like pathauto is replacing them with separators.
It's certainly not clear from the checkbox description that this is what will happen, and I think the expected behavior (removal) is more likely the intended feature. I'm going to write a patch that strips them out instead of adding the separator, but if the current behavior is in fact what's intended, the description should be updated instead.
Comment #37
jenlamptonComment #38
gregglesWhat would you change the description to?
Comment #39
Dave ReidI would still think that the punctuation solution seems the best, compared to changing the reduce ascii method, although I agree that the description of that feature could be improved (but filed as a separate issue).
Comment #40
temkin CreditAttribution: temkin as a volunteer commentedAgree that changing 'reduce_ascii' logic may come as unexpected for site owners who already rely on the current implementation. Suggested solution should be through
hook_pathauto_punctuation_chars_alter
.I also created a follow-up ticket to improve the description of 'reduce_ascii' option to avoid the confusion in future - #2905169.
Changing this ticket to "Won't fix", but please re-open if there are any objections.
Comment #41
jwilson3Have a library client with a lot of extra punctuation marks in their titles. Maybe this would be useful for someone else (or even myself) in the future:
Comment #42
codewatson CreditAttribution: codewatson commentedFor myself and others, @jwilson3 #41 provided a good start (thank you!), but I found that just putting the characters in the value did not work, I had to convert them to UTF-8 code units using PHP's pack() function:
This was helpful for finding the correct codes: https://r12a.github.io/app-conversion/
Comment #43
jwilson3interesting @codewatson. Maybe your copy/paste into your code editor didn't work because the file was opened and then saved with the wrong File encoding format? I use Sublime Text which has an option to File > Save with Encoding > UTF8.