file_create_url should return valid URLs
bjaspan - March 24, 2008 - 23:04
| Project: | Drupal |
| Version: | 7.x-dev |
| Component: | file system |
| Category: | bug report |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | duplicate |
Description
I noticed that when webform.module receives a file upload containing a space in the filename, it emails a link with that space included in the URL instead of escaped as + or %20. For example:
Submitted values are:
File: http://mysite.com/files/webform/barrytest/Barry Test.jpg
I think the problem lies in file_create_url(). It says:
<?php
switch (variable_get('file_downloads', FILE_DOWNLOADS_PUBLIC)) {
case FILE_DOWNLOADS_PUBLIC:
return $GLOBALS['base_url'] .'/'. file_directory_path() .'/'. str_replace('\\', '/', $path);
?>Shouldn't $path be urlencode()d there? The patch is trivial but I do not have time at the moment to check to see whether it will break anything.

#1
Here's an untested patch. It's aganst D5.7 but applies against HEAD too.
#2
The previous patch is wrong. Here is a lightly tested version. I still have not verified that modifying file_create_url()'s output like this won't break anything, but at least the url is right now.
#3
seems to work correctly for me, is there some reason we don't just pass the URL through url()? it would handle the escaping for us...
#4
drewish: that was also my first thought.
#5
At least four reasons:
1. url() doesn't urlencode() absolute URLs at all...which is what we plan to hand it.
...and if url() did urlencode() absolute urls:
2. "http://..." would turn, in the least horrific case, into "http%3A//", which would ruin it.
3. url() ultimately uses PHP's urlencode(), not rawurlencode(). The former translates spaces into '+'. When webservers serve files, they don't understand '+'. They must see '%20'. '+' is only valid in a query string.
4. urlencode() tranlates '/' into '%2F'. When serving files, Apache would return 404 for this. That's probably the reason Barry revised his patch.
As for the proposed patch:
It doesn't encode spaces in folder names, does it? So I'm marking this 'needs work'.
#6
I just saw the following issue:
#154245: problem with filenames with '%' character
I'd say these two issues can be merged. I'm for marking this one a 'duplicate'.
#7
@#5: My patch does not address spaces in folder names because folder names are created on the server, not by an uploading user, so presumably the server knows what it is doing. I would not object to a patch that handled them, though.
@#6: I'm not familiar with the referenced issue but if it addresses the same problem and is older I have no obection to marking this issue as a duplicate.
#8
Would be great if we could create a test case for this.
#9
file-create-url-238299-2.patch adds a
/.to the path for files in the files folder (i.e. not in subfolders) like this:http://example.com/sites/example/files/./foo.txt
This is probably mostly harmless, but I assume it is not intended.
For testing purposes I have been using a file called
foo . # %20 % & + ! ' $ %2F %26 %23 é.gif.#10
I am interested in this solution. Far too often I'm having to tell folks how to create filenames that can be uploaded to the net neatly. I'd like to see the upload module to work like path_auto where it strips out all bad characters when it is creating the file. Just had a admin earlier today try to upload a file he'd named 'example #2.pdf' but really drupal should be smart enough to either provide an alert that this won't work, or rename it to 'example_2.pdf' which would be my preference.
Wouldn't it be nice if:
foo . # %20 % & + ! ' $ %2F %26 %23 é.gif.
became
foo_._!_e.gif
or something like that? Shouldn't human readable be a goal even for uploaded files?
#11
Here is an updated patch along with a unit test. A number of files with various evil filenames are created and then fetched through the webserver.
All tests passes on Linux. On Windows, some characters are not allowed in filenames AFAIK. Perhaps file_create_path() or similar could be extended to always return a valid filename no matter what garbage you throw at it. But this issue about filename clean-up is a subject worth its own bug.
So far the test only covers FILE_DOWNLOADS_PUBLIC (but this is not a problem wrt to the subject of this issue). I am not sure how to make a unit test implement hook_file_download() in order to make the files accessible via /system/files/...
#12
I have now added tests for private files as well, with and without clean URLs enabled. This revealed a bug in drupal_urlencode() that I also fixed.
I had to change url() to make it not cache the output from
variable_get('clean_url', 0)in a static variable. Without this change, url() wouldn't respond to changes to theclean_urlvariable after the first call to url(). The comment in url() claimed that the value was cached for performance reasons, but I doubt the change has any noticeable impact - variable_get() is extremely fast.The patch doesn't break any existing tests. I do get 2 exceptions in each of the DBLog and User modules, but that also happens with a clean CVS check-out.
#13
Looks like I forgot to upload the actual patch mentiond in #12.
#14
Minor update (now uses $this->assertResponse(200)).
#15
I installed Drupal on Windows to see how it deals with filenames containing funny characters. Appearently – at least on my WinXP machine with PHP 5.2.6 – PHP assumes that all files are encoded using the Windows-1252 character set. If Drupal writes a file using
file_put_contents('flødeskum.txt', 'stuff'), the filename readsflødeskum.txtin Explorer. If I create a file using Explorer with a filename containing a character not in Windows-1252 (e.g. a chinese character), readdir() returns?instead of that character, and glob() doesn't find the file at all!So it looks like Drupal cannot save files using their proper names on Windows. But luckily it isn't a big issue. The audience (i.e. everybody else except the site owner) wont notice any (big) difference, though. One exception: When using public files, the URLs contain more %-encoded characters than they do on Linux, and if you try to decode the %-encoding, you will not get the original filename but a mangled version.
Well, this updated patch tries to make the best of it on Windows. It also strips characters that are not allowed in filenames on Windows (those are
* : ? " < > |). All the supplied tests now pass on Windows as well as on Linux.#16
Slightly simplified by incorporating the patch from #284899: Drupal url problem with clean urls.
#17
It looks as if file_create_url-6.patch double-encodes more characters than necessary, so please review file_create_url-5.patch instead.
#18
PHP on Windows' lack of support for UTF-8 appears to be confirmed by the last comment in this bug report.
#19
Chasing HEAD.
Everybody, please try out the patch and see if you can come up with a filename that it doesn't support. For instance, you can test it using the following steps:
Passes all tests (except two that also don't pass from a clean CVS checkout - this seemingly unrelated problem is reported in #299290: User tests exceptions).
#20
Chasing HEAD (following the big update of file.inc in #308434: Clean up file.inc ahead of hook_file and add unit tests.).
#21
Chasing HEAD.
#22
Chasing HEAD (following the big hook_file commit).
#23
i didn't look at the file.inc code skipping straight to the unit tests to see what the desired behavior was and had a very hard time following your unit tests. it seems like you've pushed too much code down into the _testFileCreateUrl() function. i'd suggest being more verbose and less "efficient" in terms of the code. helper functions are alright but the test should be a readable example of put in an input or series of inputs and check for an expected output.
feel free to hit me up on IRC if you'd like to discuss this.
#24
Reworked the unit tests based on IRC discussion with drewish.
#25
The tests are much better now. They all pass on OSX with Apache2.
In file_create_url() I think I understand what's happening but I don't follow the comment:
+ if (substr(PHP_OS, 0, 3) == 'WIN') {+ // PHP on Windows assumes that filenames are encoded in Windows-1252, but
+ // Drupal passes UTF-8-encoded filenames to filesystem functions, so
+ // non-US-ASCII characters end up mangled in the filesystem. We thus need
+ // to do the same mangling to the URLs in order for them to work.
+ $path = utf8_encode($path);
+ }
Looking at file_create_filename():
function file_create_filename($basename, $directory) {- $destination = $directory . '/' . $basename;
+ if (substr(PHP_OS, 0, 3) == 'WIN') {
+ // These characters are not allowed in Windows filenames
+ $basename = str_replace(array(':', '*', '?', '"', '<', '>', '|'), '_', $basename);
+ }
+
+ $destination = strtr($directory . '/' . $basename, '\\', '/');
It seem to me that the strtr is really windows specific code that could be moved up into the block where we filter the windows blacklist.
#26
also wanted to cross link to #135359: URLs broken for absolute file system paths
#27
The comment in file_create_url() is now more elaborate. Also, utf8_encode() is replaced with drupal_convert_to_utf8($path, 'Windows-1252'), because the former converts from ISO-8859-1. It actually turned out to mean a difference if one of the octets in the UTF-8-encoded string was in the range 0x80-0x9F where ISO-8859-1 and Windows-1252 differ.
Backslash is now only treated specially on Windows. On other platforms it is treated as any other character.
#28
c960657, will this affect existing files?
#29
The patch only affects URL generation and not how incoming requests are handled or how files are saved on the disk, so any existing URLs should continue to work.
The changes to the URL generation fixes all problems I could find using Apache 2.2 on WinXP and on Linux (on Linux I tested with PHP running as CGI as well as an Apache module). I cannot say how well it works on other platforms, though, but the unit tests will allow people to report any problems that may occur on their specific platform.
#30
i'd like to have this wait on #329226: Store file_test.module's values in variables rather than globals so we can pass the file_test_file_download() header via the file_test_set_return() function rather than adding a separate variable.
#31
Another cross-link: #278770: file_create_url only returns absolute URL (containing the domain name)
#32
Updated Doxygen comment for drupal_urlencode() (mod_rewrite actually unescapes all %-encoded characters).
#33
#329226: Store file_test.module's values in variables rather than globals got committed so here's a re-roll around it. all the tests look good.
#34
marked #312354: Attaching files with [#?]-characters results in broken link as a dupe.
#35
Subscribing
#36
The last submitted patch failed testing.
#37
bumping after the test bot meltdown.
#38
Updated to new comment convention (see #338403: Remove implementation of... comments from setUp(), getInfo() or tearDown()).
#39
Reroll.
#40
The last submitted patch failed testing.
#41
Reroll.
#42
Reroll.
#43
The last submitted patch failed testing.
#44
Reroll.
#45
The last submitted patch failed testing.
#46
Reroll.
Part of the previous patch was committed in #356721: Regression: Test failures with clean URLs, so there is now less to review :-)
#47
#48
The last submitted patch failed testing.
#49
Work on this has proceeded in #284899: Drupal url problem with clean urls.