Drupal core patch's file_create_url(): skip absolute URLs and support the pathologic module
dkruglyak - May 30, 2009 - 13:53
| Project: | CDN integration |
| Version: | 6.x-1.x-dev |
| Component: | Module |
| Category: | bug report |
| Priority: | critical |
| Assigned: | Unassigned |
| Status: | active |
Description
Many modules that make calls to file_create_url occasionally pass an absolute URL. This could happen if one (or more) modules make serial calls to file_create_url or store absolute URLs. I ran into the issue while handling existing Image nodes.
The solution is simple: file_create_url needs to check if the URL already has host and if so return it back. A few lines of code solve the issue:
// If path already includes hostname, do not change it
$fragments = parse_url($path);
if (isset($fragments['host'])) return $path;I updated the CDN patch with this fix (please note the patch refers to 6.x version).
| Attachment | Size |
|---|---|
| cdn_drupal_absolute_fix.patch | 10.06 KB |

#1
#2
It's faster to just check if the URL begins with "http://" or "https://". Is there a reason you're not just doing that?
Thanks for your feedback!
#3
Not sure if it is that much faster. Besides if you DO parse the URL, it might be possible to add further domain remapping rules (e.g. identify which are the old source domains to be turned into new CDN'd domains). There are quite a few modules that may generate and store absolute URLs that may need remapping.
Also note that you cannot use
valid_urlcall. I found that in case of filenames that have spaces (e.g.http://absolute.com/files/multi word.png) they could get generated by some image handling modules, pass through browsers, but failvalid_urltest. Howeverparse_urlhandles them right.Speaking of absolute URL remapping, looks like some image modules generate absolute URL HTML that gets hardcoded into node bodies and teasers and may not even go through your
file_create_url. While these issues could be dealt with statically by modules such as "Search and Replace", perhaps CDN module should add a new filter to handle these remaps automagically? The filter would share code with absolute URL checker/mapper infile_create_url.URL hardcoding becomes a real problem when you have to deal with development vs. production domains
#4
Changing the title to broaden the use case...
#5
Looks like there is another requirement for URL output, in addition to taking care of absolute URLs.
In multi-site environment, files normally go into sub-directories with paths such as
sites/MYDOMAIN/files/MYFILE. This makes for very long URLs and for introducing dependence on the domain names into filepath (BAD!). There is a solution by creating/filesdirectory containing a small .htaccess file that would redirect requests made tofiles/MYFILE(tested and works):RewriteEngine OnRewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{HTTP_HOST} ^(www.)?([a-z0-9-.]+)$ [NC]
RewriteRule ^(.*)$ /sites/%2/files/$1 [L]
However, just adding this directory with .htaccess is not enough. All paths generated by file_create_url or stored in nodes need to lose the leading
sites/MYDOMAINfragment.So, yet another URL post-processing feature is needed. Perhaps it could be added as an optional setting?
P.S. The
/filesdirectory solution is described here http://drupal.org/node/53705. In my code above I modified it slightly to remove leading "www.". This was fully tested and works for both WWW and non-WWW cases. There is an additional thread on the issue of removing the leading /sites fragment from the URL: http://drupal.org/node/452308P.P.S. Looks like such functionality already exists in URL Replace Filter (http://drupal.org/project/url_replace_filter) and Pathologic (http://drupal.org/project/pathologic)? What is the best way to integrate them?
#6
OK, so I had a chance to investigate URL Replace and Pathologic. They fully solve the problem of stripping leading URL segments for both links (
href) and image (src) attributes. Works great for any HTML that passes through filters and nothing to add. Pathologic is seen as superior even by other module's author.These modules however cannot do anything about URLs emitted by CDN module. A great example would be a node with file attachement, that is linked both from file attachement section (through file.inc) and directly linked to absolute path from content. While Pathologic fixes the hardcoded link within content, the attachment section link still comes across as
http://www.MYDOMAIN.com/sites/MYDOMAIN/files/MYFILE. That's awful for SEO.So I modified the CDN core patch to add a section that is run only if Pathologic is enabled. It uses similar (significantly simplified) stripping logic and for its configuration relies on Pathologic setting for default Input Format. I presume the dependency on Pathologic could be completely eliminated if similar setting is added directly to CDN module, but I had no reason to rush this yet. In most cases, the setting is highly likely to be identical.
Would appreciate any comments re: this solution. I am using the code in production now with no problem.
#7
Slightly off-topic, but: are you using the CDN module in production already? The daemon too? If you don't mind sharing, on which site?
I've been studying for an exam. But I'll reply to all your follow-ups tonight/tomorrow. I appreciate your feedback and contributions! :)
#8
I would describe the current status as "soft production" :)
This is a new multi-site install designed to support a large number of owned-and-operated sites as well as a large number of sites for clients. A few sites are almost released (LOL) but are still kind of under the radar for now... That is why I care so much about fixing site-specific URLs - nothing should look clunky to end users. Not using the daemon just yet but planning to implement soon. First want to stabilize core patches, site functionalities and build scripts.
Greatly appreciate all your work on this module. Very nicely done and hope we could soon make it even better.
As an aside - I just updated the patch to add support for image buttons (form.inc).
#9
I'm sorry that I didn't get back to you as promised — the exam I had to make today required more effort than I had anticipated.
I'll most definitely get back to you tomorrow!
#10
Unfortunately I was delayed again … and again. I really had to get my test case for my bachelor thesis up and running, and now it is: http://driverpacks.net. It should be blazingly fast (but it of course depends on your location in the world and the network speed). Anyway, that's what has prevented me from getting back to you sooner.
I committed the section of your patch that affected form.inc, to support image buttons. Thanks, I missed that one!
While I understand your concerns, they're all optimizations outside of my scope:
- support for not handling absolute URLs is in fact not necessary, instead the bugs in the modules doing this wrong should be fixed
- explicitly using the pathologic module from within core is simply unacceptable
The goal in the end is to get this patch into Drupal 7, and with the suggestions you're making (however useful they may be in certain use cases), that becomes impossible. If there is sufficient response from others to include this into this module, I will commit it. I'll leave this issue as active for some time, to allow for feedback from others on your suggestions.
#11
Wim, I do not have time to argue whether supporting absolute URLs is "necessary" or "not necessary" and whether using pathologic in core is "acceptable" or "unacceptable".
I am trying to use this module for an actual multi-site setup that has to support real-world requirements. It has to interoperate with many buggy modules and work around users who constantly screw up. From the perspective of actually trying to use your module, it has many shortcomings, which I documented and addressed above. It would be great to incorporate real fixes into the official release. If not, I will of course keep a separate fork for my own use, though this would be quite an unnecessary headache.
Bottom line, I suggest you re-read my description of the problems and if you do not like my solutions suggest your own.
Thank you.
#12
Just curious, would anything like Url alter module be useful at all for this module Wim?
#13
Thank you for this patch. Saved me a ton of time.
I can understand why putting module dependencies straight into core is a bad idea. I cannot yet understand refusing to support absolute URLs -- particularly when Drupal itself makes it a point to do so (with the absolute parameter in the url() call). Please explain: How is using an absolute URL a "bug" in a module that should be "fixed?" There are very good reasons for using absolute URLs in particular cases, and not handling them at all doesn't seem like the right answer for that.