Hi Guys,
I've successfully integrated pathauto Hebrew Urls on my own sites in all versions since drupal 4.6.
Each version, I had to patch it again to remove the needlessly complex logic of deletion nonprintable chars.
I am not really that master on CVS - i am more of a system analyst and intergator and I code slowly, so I can't do it on my own, but here are the requirements.
To be truly i18n compatible - there is almost no need for string cleaning.
All we need is to clean the chars which has special URL meaning, and make sure the url is utf8-encoded.
$special_chars = array ("?",":","&","@","~","+","_","\"","'",";",".");
$output = str_replace($special_chars, "", $string);
Attached is the modified pathauto.module - I've only modified out the pathauto_cleanstring function. Somehow, my change was not caught by DIFF, and I don't have the time to dive into it further, but here is my modified code attached. Hebrew urls work perfectly - take a look at my blog.
Amnon
-
Professional: Drupal Search | Drupal Israel | Web Hosting Strategies
Personal: Hitech Dolphin: Regain Simple Joy :)
| Comment | File | Size | Author |
|---|---|---|---|
| #39 | pathauto_punc_39.patch | 869 bytes | HorsePunchKid |
| #32 | pathauto_punc_32.patch | 858 bytes | HorsePunchKid |
| #31 | pathauto_punc_31.patch | 868 bytes | HorsePunchKid |
| #27 | pathauto_punc_char_143831_20070926.patch | 931 bytes | suzanne.aldrich |
| #19 | pathauto.inc_.patch | 576 bytes | agilpwc |
Comments
Comment #1
gregglesI provide a more fully featured version of this feature in 5.x-2 (which is still in "development mode" but this part works anyway) - see http://drupal.org/node/98964
If your change provides a different feature please let me know how it is different.
Thanks!
Comment #2
druvision commentedI've tested the new pathauto 5.x-2.x.dev with UTF-8 (Hebrew)
Wanting to test it, I've created a new node and added some of those punctuation chars (?,&,=) to the node title and then tried to save the node.
Test results
1. Great improvement! UTF8 URLs are no longer cut.
2. The only complaint I have is that the punctuation chars above (e.g. ?&=) are not removed but seemingly translitated (I hope I used the correct word, since I haven't had the time to dive into the transliation algorithm).
Wanting to remove those punctuation chars anyway, I've tried to add some them (?,&,=) to the field 'String to Remove'. I got tons of preg_replace errors - seemingly you are expecting a regular expression.
I got 48 identical errors lines (no alias was created):
Let's make a feature request.
Why just quotation marks? I feel the 'Quotation Marks' option should be expanded to treat other punctuation chars as well. I suggest those punctuation chars (to be defined by the user, but defaulting to the punctuation chars above) should optionally be replaced by dash (not removed) - otherwise some words might be concatenated.
All the best,
Amnon
-
Professional: Drupal Search | Drupal Israel | Web Hosting Strategies
Personal: Hitech Dolphin: Regain Simple Joy :)
Comment #3
druvision commentedThe punctuation chars treatment suggested above would benefit other users, not only i18n, hence I am changing the title of this issue again.
Comment #4
gregglescan you please take a look at the i18n-ascii.txt file - its use is detailed in the INSTALL.txt and/or README.txt files.
That should allow for generalized "replace this character with something else including nothing" without cluttering up the UI too much (which is what I don't like about having a table of "replace this character with this" for every character on the settings page.
Comment #5
herb commentedBig thanks for maintaining this module. I'm using it for the first time on a "real" site and it's great. I'm having what I think is the same issue with punctuation characters like ' " ? ! ( ) etc. I've updated the the i18n-ascii.txt file with a the previous characters followed by = "" hoping that this would give me "nothing" instead of -039. Is there some other way to really get "nothing"? Also, I think it gave me an error when I tried to use " = "" to get "nothing" for a ". (If any of this makes sense) - Is it possible to have some characters completely ignored?
Comment #6
druvision commentedI tried to modify i18n-ascii.txt but with no effect on results. Then I looked at the code. i18n-ascii.txt is only used when translitation is active. In my case, I can't activate translitation since it will cause all non-english chars are deleted. I need them to be preserved. Additionally, a text-file approach very difficult to to use.
Deleting punctuation characters is the common case, and it's needed for ALL langs, even for English, where no translitation is used. It has no connection with i18n.
The easy solution: Expand the 'Quotation Marks' option to treat other punctuation chars as well. The attached patch gives an example to this option.
The attached example can easily be expanded to allow the list of special chars to be user-defined. This will save a lot of work - I suggest the list of punctuation chars will be user defined - there is no reason why quotes are different then any other punctuation chars. This can easily *replace* the current 'Quotes' option - so no other option is needed - simply a text field below the 'Quotation marks' title and above the radio button.
Comment #7
druvision commentedComment #8
herb commentedThanks for this patch. While I couldn't get the patch to apply cleanly, it seemed to be just one line of code that I manually added to the 2.x-dev version. Everything translates well now, and I don't need to play with the transliteration file. I just added ")", "(", to get rid of parenthesis and all looks well.
Comment #9
gregglesFwiw I dislike this approach.
I see now that the i18n-ascii.txt file won't work for these characters since many of them aren't allowed as keys. That said, I still think we should have admin control and adding characters into the module file isn't control.
I'd like to see
1) a tab under admin/settings/pathauto for special characters
2) which would have a list of the special characters and three radios: "leave alone, replace with separater, remove"
3) code which does as the admin specifies in #2
Comment #10
herb commentedA tab with radio buttons sounds like a much better long-term solution. Since I'm not much of a coder, I'll try to test whatever patch is developed.
Comment #11
druvision commentedThis is a good idea but I am not much of a coder either. I am a systems analyst better at the reqirements definition phase. I would also love to test this code once it's available.
The tab approach allows for a broader range of characters, but we must take care to pack it in the same way, so that performance won't suffer.
Comment #12
korayal commentedI can't remove the quotation characters, instead there is a word like "quot"
"Weeds"
http://www.hecatomber.org/film_incelemeleri/quot_weeds_quot
or if i try ' as a quotation mark;
'Weeds'
http://www.hecatomber.org/film_incelemeleri/%2526%2523039%3Bweeds%2526%2...
also i've chosen from the pathauto config to remove quotation marks. but it doesn't work.
Comment #13
gregglesSo, basically token module does a "check_plain" on lots of elements of the site including title. This takes the title "asdf'fsa" and turns it into "asdf'fsa". By the time that pathauto gets it, the quotation is already gone.
Pathauto just sees th & and # and ; characters which it then freaks out about and does crazy things to.
So, we're going to need a really good solution between pathauto and token to fix this problem...
In the meantime, if you really really need to have quotations work a certain way the solution is to stick with 5.x-1.8.
Comment #14
meatbites commentedPerhaps an idea I posted here will help?
Comment #15
taqwa commentedDoes the patch submitted work for beta 2? If not, can you just tell me the lines you added to the code?
Comment #16
gregglesIn comment #13 I said 5.x-1.8 which was vague and misleading. What I meant to say was the "Pathauto 5.x-1.x branch" though I understand now that what I actually said was horribly misleading.
Comment #17
wim leersIf you could tell me what direction this patch should go, I"ll write it. This behaviour is a real pain right now.
Comment #18
gregglesSee comment #9 that explains how I'd like to see it done.
See comment #13 - that explains why it can't be done until some modifications to the token module. I'm working on the token module piece first, but that doesn't hold back any work on the pathauto portion from #9.
Comment #19
agilpwc commentedThis is for the latest dev release.
In the cleanstrings function if you move these lines out of the if that checks for 'pathauto_transliterate'
Then punctution is removed. Otherwise it only gets removed if you are using transliterate.
I have attached a patch off the latest dev release.
Comment #20
mscdex commentedThanks for the fix agilpwc. I was having a problem with the "/" character not being replaced in term names (which caused problems) and this now alleviates this issue for me.
Comment #21
David Latapie commentedJust for registering
Comment #22
dgraver commentedOn my site, I have team names like Texas A&M. I just want to replace the & with nothing. Unfortunately, even with the supplied code above, it converts it to texas-amp-m. The A in A&M is removed b/c it is a word to ignore (which I want to keep ignored), so that is why it is not a-amp-m. Any tips on how to get just 'am' for 'A&M'? I've been trying all sorts of preg_replace in the clean string function, and can't get it for some reason. Thanks.
Comment #23
alliax commentedIT really is a problem, even a single quote (') is being translated as 039 please fix that token and pathauto thing very quickly, it was working fine before, I understand the need of token, which should already be included in Drupal core.
Comment #24
jiangxijay commentedSubscribe to issue.
Comment #25
HorsePunchKid commentedSubscribing. Note that the patch at #19 may break your [menupath]s, among other things, since it will translate the path separators into hyphens or whatever your character is. See my note here.
Comment #26
jcruz commentedsubscribing
Comment #27
suzanne.aldrich commentedI am attaching a kludge patch that synthesizes http://drupal.org/files/issues/pathauto.inc_.patch from comment #19 by agilpwc and also http://drupal.org/node/177272 from comment #2 by therainmakor to first strip out the HTML entities, and then the remaining punctuation.
I tested this on Drupal 5.2 with pathauto 5.x-2.x-dev released 2007-09-22 and found that it worked as a temporary fix until this can be looked at more holistically with regard to token and having a proper interface. However, enough frustrated people have clamored for a quick fix that I thought I'd pony up.
This patch deals with the bit in the function pathauto_cleanstring in the file pathauto.inc by stripping out HTML entities before dealing with the remaining punctuation, and moves this stripping out of the if (variable_get('pathauto_transliterate', FALSE)) clause:
Comment #28
Wolfey commentedI have just tried out the patch that aigeanta provided in #27, and it does work. Unfortunately, it also removes slashes provided by the [catpath] token, which can break paths that make use of it.
Other than that, though, the patch works great =)
Comment #29
suzanne.aldrich commentedHi Wolfey, thanks for testing out the patch, and I'm glad it worked for the most part. I know about the [catpath] token problem, it's mentioned in comment #25 above by HorsePunchKid (see http://drupal.org/node/178344).
Comment #30
Wolfey commentedYou're welcome.
I did see that issue earlier and read his comment in this issue, but overlooked the "among other things" part. I thought only the [menupath] token was affected - my fault.
Comment #31
HorsePunchKid commentedThis is a refinement of aigeanta's patch that seems to be working reasonably well for me. I tried it with a story (which has some path separators) having a relatively crazy title (
Test: Foo & Bar! (#四)). It gave me the aliasstory/2007/09/29/test-foo-bar, which I am happy with.Comment #32
HorsePunchKid commentedHere is a slightly different version that removes the extra regex alternation I had added to try to get rid of
%20-type strings resulting from URL encoding. This version seems to handle my above test case just fine, and it also seems to handle single quotes and transliteration, which I've seen people complaining about.None of this is to suggest that this is a long-term solution, but it appears to be working well for the time being.
Comment #33
physiotek commentedsubscribing
Comment #34
Wolfey commentedI have just tried out HorsePunchKid's patch in #32 which, in addition to removing unnecessary punctuation (including quotation marks and apostrophes), also preserves slashes provided by the [catpath] token. Unfortunately, it preserves slashes present in the node title as well.
Aside from this, though, the patch is working great =)
Comment #35
jiangxijay commentedPatch 32 is great! Thanks for helping us remove the 039 for apostrophes!
Comment #36
ivansb@drupal.orgwhen patch 32 is going to make it's way into beta4?
thx
Comment #37
mlncn commentedSubscribing. Confirm just about everything on this thread and strongly endorse #32 as good enough for now. If I see one more alias with
-039-for an apostrophe, I may cry.Tested #32 (manually applying changes to pathauto.inc), it works.
http://agaricdesign.com/note/thisll-test-32s-special-character-removal-i...
Pathauto is fantastic; this would make life bliss again. I'll live with slashes sticking around uninvited.
Comment #38
HorsePunchKid commentedI'm not sure what to do about the slashes. Patch #32 may be an improvement, but I think that preserving slashes in the "Preserve alphanumerics" block is probably incorrect.
That said, I don't know what the correct solution should be. I understand that
pathauto_cleanstringshould only be called on the components of the string, not on the full string itself. So why the slashes were getting removed--slashes that weren't inside components--is not clear to me.Thoughts?
Comment #39
HorsePunchKid commentedHm. Whatever I did earlier, I apparently don't need the
/to also be in the regex for preserving alphanumerics. Here's an updated patch that doesn't have that bug I apparently introduced. For those of you already on #32, just remove the\/from the regex on line 108. The line should read:$pattern = '/[^a-zA-Z0-9]+/';Comment #40
Wolfey commentedHorsePunchKid, I just tried out your newest patch - it fixes the "slashes preserved in the node title" issue, but reintroduces the "slashes replaced with separators in the [catpath] token" issue.
Comment #41
HorsePunchKid commentedYou're right; I just tested this on the wrong node (my node for testing event tokens rather than my catpath / menupath node). So the problem is indeed the fact that some strings getting passed to
pathauto_cleanstringcontain multiple components from which the slashes shouldn't get stripped (e.g. catpath), while some contain only a single component from which the slashes do need to get stripped (e.g. node title).I don't see a good solution to this short of allowing
hook_token_valuesreturn an array or something like that. What if I've got a menu item with a slash in it? Somehow you've got to cleanstring the menupath components but not the whole thing.For me, it's back to patch #32, since I'm not that worried about slashes in node titles.
Comment #42
gregglesI just committed a fix for this to CVS. It won't show up in tarballs for another 12 hours or so, so if you want to test it you'll need to get the file from cvs. It also requires using the latest version of token 5.x from cvs. If you haven't used cvs before, you can read about it in http://drupal.org/handbook/cvs/ or just wait until the tarballs are rebuilt.
Note that you will need to update your tokens because some of them have changed. Specifically you'll need to use [title-raw] instead of just [title] and similar -raw tokens.
If someone wants to build a token tester to make sure that patterns make sense that would be nice, but I don't want to spend time on it.
Comment #43
gregglesNobody screamed, so either you haven't tested or it worked perfectly and you've gone on to more important things ;)
Marking fixed.
Comment #44
HorsePunchKid commentedThanks for the new release, greggles! Once I got my node path settings updated (mainly, replacing
titlewithtitle-rawandcatpathwithtermpath-raw), everything worked perfectly!Comment #45
Wolfey commentedI just updated to the newest dev release - all I had to do was modify my patterns to use [termpath] instead of [catpath] (due to the name change) and [title-raw] instead of [title] (otherwise apostrophes and quotation marks got converted to "039" and "quot", respectively).
After trying out editing and saving a few nodes, I've noticed that slashes are removed from node titles and preserved in the [termpath] token - that's exactly what I wanted Pathauto to do here!
Thank you very much for fixing this =D
Comment #46
alliax commentedyes, thanks! It's working fine, it's been long but you fixed it, congratulations!
Comment #47
(not verified) commentedAutomatically closed -- issue fixed for two weeks with no activity.
Comment #48
realityloop commentedThanks for the fix!
Comment #49
daniel wentsch commentedSubscribing.
Edit: oooups, wrong Tab, sorry for digging this out.