Many modules that make calls to file_create_url occasionally pass an absolute URL. This could happen if one (or more) modules make serial calls to file_create_url or store absolute URLs. I ran into the issue while handling existing Image nodes.
The solution is simple: file_create_url needs to check if the URL already has host and if so return it back. A few lines of code solve the issue:
// If path already includes hostname, do not change it
$fragments = parse_url($path);
if (isset($fragments['host'])) return $path;
I updated the CDN patch with this fix (please note the patch refers to 6.x version).
Comments
Comment #1
dkruglyak commentedComment #2
wim leersIt's faster to just check if the URL begins with "http://" or "https://". Is there a reason you're not just doing that?
Thanks for your feedback!
Comment #3
dkruglyak commentedNot sure if it is that much faster. Besides if you DO parse the URL, it might be possible to add further domain remapping rules (e.g. identify which are the old source domains to be turned into new CDN'd domains). There are quite a few modules that may generate and store absolute URLs that may need remapping.
Also note that you cannot use
valid_urlcall. I found that in case of filenames that have spaces (e.g.http://absolute.com/files/multi word.png) they could get generated by some image handling modules, pass through browsers, but failvalid_urltest. Howeverparse_urlhandles them right.Speaking of absolute URL remapping, looks like some image modules generate absolute URL HTML that gets hardcoded into node bodies and teasers and may not even go through your
file_create_url. While these issues could be dealt with statically by modules such as "Search and Replace", perhaps CDN module should add a new filter to handle these remaps automagically? The filter would share code with absolute URL checker/mapper infile_create_url.URL hardcoding becomes a real problem when you have to deal with development vs. production domains
Comment #4
dkruglyak commentedChanging the title to broaden the use case...
Comment #5
dkruglyak commentedLooks like there is another requirement for URL output, in addition to taking care of absolute URLs.
In multi-site environment, files normally go into sub-directories with paths such as
sites/MYDOMAIN/files/MYFILE. This makes for very long URLs and for introducing dependence on the domain names into filepath (BAD!). There is a solution by creating/filesdirectory containing a small .htaccess file that would redirect requests made tofiles/MYFILE(tested and works):However, just adding this directory with .htaccess is not enough. All paths generated by file_create_url or stored in nodes need to lose the leading
sites/MYDOMAINfragment.So, yet another URL post-processing feature is needed. Perhaps it could be added as an optional setting?
P.S. The
/filesdirectory solution is described here http://drupal.org/node/53705. In my code above I modified it slightly to remove leading "www.". This was fully tested and works for both WWW and non-WWW cases. There is an additional thread on the issue of removing the leading /sites fragment from the URL: http://drupal.org/node/452308P.P.S. Looks like such functionality already exists in URL Replace Filter (http://drupal.org/project/url_replace_filter) and Pathologic (http://drupal.org/project/pathologic)? What is the best way to integrate them?
Comment #6
dkruglyak commentedOK, so I had a chance to investigate URL Replace and Pathologic. They fully solve the problem of stripping leading URL segments for both links (
href) and image (src) attributes. Works great for any HTML that passes through filters and nothing to add. Pathologic is seen as superior even by other module's author.These modules however cannot do anything about URLs emitted by CDN module. A great example would be a node with file attachement, that is linked both from file attachement section (through file.inc) and directly linked to absolute path from content. While Pathologic fixes the hardcoded link within content, the attachment section link still comes across as
http://www.MYDOMAIN.com/sites/MYDOMAIN/files/MYFILE. That's awful for SEO.So I modified the CDN core patch to add a section that is run only if Pathologic is enabled. It uses similar (significantly simplified) stripping logic and for its configuration relies on Pathologic setting for default Input Format. I presume the dependency on Pathologic could be completely eliminated if similar setting is added directly to CDN module, but I had no reason to rush this yet. In most cases, the setting is highly likely to be identical.
Would appreciate any comments re: this solution. I am using the code in production now with no problem.
Comment #7
wim leersSlightly off-topic, but: are you using the CDN module in production already? The daemon too? If you don't mind sharing, on which site?
I've been studying for an exam. But I'll reply to all your follow-ups tonight/tomorrow. I appreciate your feedback and contributions! :)
Comment #8
dkruglyak commentedI would describe the current status as "soft production" :)
This is a new multi-site install designed to support a large number of owned-and-operated sites as well as a large number of sites for clients. A few sites are almost released (LOL) but are still kind of under the radar for now... That is why I care so much about fixing site-specific URLs - nothing should look clunky to end users. Not using the daemon just yet but planning to implement soon. First want to stabilize core patches, site functionalities and build scripts.
Greatly appreciate all your work on this module. Very nicely done and hope we could soon make it even better.
As an aside - I just updated the patch to add support for image buttons (form.inc).
Comment #9
wim leersI'm sorry that I didn't get back to you as promised — the exam I had to make today required more effort than I had anticipated.
I'll most definitely get back to you tomorrow!
Comment #10
wim leersUnfortunately I was delayed again … and again. I really had to get my test case for my bachelor thesis up and running, and now it is: http://driverpacks.net. It should be blazingly fast (but it of course depends on your location in the world and the network speed). Anyway, that's what has prevented me from getting back to you sooner.
I committed the section of your patch that affected form.inc, to support image buttons. Thanks, I missed that one!
While I understand your concerns, they're all optimizations outside of my scope:
- support for not handling absolute URLs is in fact not necessary, instead the bugs in the modules doing this wrong should be fixed
- explicitly using the pathologic module from within core is simply unacceptable
The goal in the end is to get this patch into Drupal 7, and with the suggestions you're making (however useful they may be in certain use cases), that becomes impossible. If there is sufficient response from others to include this into this module, I will commit it. I'll leave this issue as active for some time, to allow for feedback from others on your suggestions.
Comment #11
dkruglyak commentedWim, I do not have time to argue whether supporting absolute URLs is "necessary" or "not necessary" and whether using pathologic in core is "acceptable" or "unacceptable".
I am trying to use this module for an actual multi-site setup that has to support real-world requirements. It has to interoperate with many buggy modules and work around users who constantly screw up. From the perspective of actually trying to use your module, it has many shortcomings, which I documented and addressed above. It would be great to incorporate real fixes into the official release. If not, I will of course keep a separate fork for my own use, though this would be quite an unnecessary headache.
Bottom line, I suggest you re-read my description of the problems and if you do not like my solutions suggest your own.
Thank you.
Comment #12
dave reidJust curious, would anything like Url alter module be useful at all for this module Wim?
Comment #13
gnassar commentedThank you for this patch. Saved me a ton of time.
I can understand why putting module dependencies straight into core is a bad idea. I cannot yet understand refusing to support absolute URLs -- particularly when Drupal itself makes it a point to do so (with the absolute parameter in the url() call). Please explain: How is using an absolute URL a "bug" in a module that should be "fixed?" There are very good reasons for using absolute URLs in particular cases, and not handling them at all doesn't seem like the right answer for that.
Comment #14
Steven Merrill commentedAnother example: using the flickr module and input format, calls to theme_image() will have absolute URLs to flickr images (already on a CDN, no less) in them.
Comment #15
sadist commentedeverything was working fine, until I upgraded my Drupal core to the latest 6.15 (and run the CDN patch). I checked the patches and realised that this code isn't in the patch file.
so I added it in my file.inc file_create_url($path) and it's working now.
I wonder how I managed to get it working before this?
Comment #16
wim leers@Dave Reid: not AFAIK.
@dkruglyak and likeminded people: I was wrong, you were right. Conceptually, that is: the changes you proposed did *not* belong in the core patch (i.e. in file_create_url(), where you put it), but in the CDN integration module's code. Since #610322: Upgrade the Drupal 6 core patch to match the committed Drupal 7 core patch, that is inside cdn_file_url_alter().
I've added the test that checks for absolute URLs (see attached patch): http://drupal.org/cvs?commit=323376.
Could you please reroll the code to support Pathologic to use the current version of the Pathologic module? The code also makes little sense to me, so if possible, please use more of Pathologic's functions, and otherwise include more comments. The code should go in cdn_file_url_alter(). Thanks! :)
Comment #17
raintonr commentedRe: #14. That exact same thing happens with the image.module.
Within
image_displaya call is made tofile_create_urlwhich places the CDN URL in front of the passed in URL. However, this function then callstheme('image_display'...)which has another call tofile_create_urlwhich adds the CDN URL a second time, therefore breaking all image.module images.Comment #18
dkruglyak commented@Wim: Here is an updated patch that runs CDN urls through Pathologic settings. It works great for me, but you may definitely want to review it. All it does is uses Pathologic setting for default filter to process URLs. Note, right now Pathologic module does not provide a nice way to directly call itself from outside, which is why I just did copy/paste of its regexp code. Feel free to update this patch if Pathologic provides better API in the future #758118: wild card for Additional paths
@Dave: I looked at URL Alter module and seems like all it is doing is overriding custom_url_rewrite_inbound() and custom_url_rewrite_outbound()? This is useful, but not sufficient for what CDN module needs. In fact, even before checking out your module I tried to add that same pathologic fix to custom_url_rewrite_outbound(), finding that it does not get called every time it is needed... Is your module supposed to do anything other than override these two functions?
Comment #19
dkruglyak commentedA minor fix to the patch - enforce leading '/' symbol for generated path...
Comment #20
dkruglyak commentedOne more little fix to the core patches. When pathologic is used with the latest patch enforcing leading '/', if file_create_url is called on the path several times (for example in Image module) it does not work correctly and adds redundant second '/' character in front.
The solution is to add another condition for file_create_url returning original path - whenever it starts with '/' already.
Attached is the modified core path. It is trivial - only changing the IF line above (and updating the comment).
Comment #21
dkruglyak commentedSorry, wrong patch. This one should do it...
Comment #22
wim leers#18/#19: Thanks! I've rerolled the patch to clean up comments and to get a full pass from Code Review (cdn_pathologic_support.patch). However, I realized that this code actually belongs in Pathologic. And now that a *hook* is being used to alter file URLs instead of a single callback, that has become possible. We just need to make sure that the CDN integration module runs *after* the Pathologic module. And that's quite easy to achieve.
Hence I've rolled a patch for Pathologic 3.x (pathologic_hook_file_url_alter.patch), which I would like you to review and then post to the Pathologic issue queue. In the mean time, I've already made HS run after Pathologic. Pathologic sets no weight and thus has the standard weight: zero. I've assigned the CDN integration module the weight 10.
#20/#21: You're absolutely right. In fact, this should be considered a bug in D7. Which is why I worked on a core patch that fixes this and that comes with unit tests right away: #777830: file_create_url() does not support protocol-relative nor root-relative file URLs. I've gone a bit further than you though: instead of just root-relative URLs, I made sure also protocol-relative URLs are supported now. Updated the core patch. The cool thing is that users will automatically be warned in the status report (and on all CDN integration administration pages) that there is a problem :)
Thanks for sticking along. This issue can now finally be marked fixed! :)
Commit: http://drupal.org/cvs?commit=357690
Comment #23
wim leersDarn, I forgot to upload the patches. And this issue shouldn't be marked fixed. Multitasking fail. Issue reopened.
I accidentally committed cdn_pathologic_support.patch. Reverted: http://drupal.org/cvs?commit=357704.
Comment #24
wim leersComment #25
wim leersYay, #777830: file_create_url() does not support protocol-relative nor root-relative file URLs got committed! :)
dkruglyak, any news from you? :)
Comment #26
dkruglyak commentedI have not had chance to review your new patches.
So far I just upgraded to the latest dev version of this module with my auto-build script and it is working for my use cases so far.
What do you want me to actually review? I will surely take a deeper look if my current setup breaks :)
Comment #27
dkruglyak commentedOK, with pathologic module being forcefully upgraded to 3.x, I had to reapply/retest all patches.
Pathologic module patch had to be rewritten since that module's functions changed. I attached the rewritten and streamlined patch. It might use further work, as I only implemented a bare minimum regex that might not cover all use cases. Reporting the patch to pathologic issue queue.
Basic CDN patches from the module distribution applied just fine and everything seems to be working as far as I tested.
One minor annoyance is the CDN module's stats page shows original URLs, not rewritten / post-processed one. I have not tackled that code but suggest this as enhancement.
Comment #28
dkruglyak commentedOK, a little addition here. Looks like there is need to update your CDN patch for drupal core (attached).
It has only one minor difference from your original one. In file.inc, if the path starts with leading '/' this means the base path was already prepended. If this fix is not present many modules that call url function repeatedly will append double '//' in front of the paths. Here is the revised snippet:
This fixed numerous URL-generation bugs I had in several sites.
Comment #29
gilgabar commentedThere appear to still be issues with absolute URLs. I'm using Pressflow 6.17 with CDN module 2.x-dev. It looks like cdn_file_url_alter handles an absolute path correctly and returns it to file_create_url without alteration. But since it is not altered the path is not returned immediately. Instead it is returned by the 'shipped files' condition which adds the base_path in front of it.
Perhaps the code added in cdn_file_url_alter should be in file_create_url instead as suggested by the very first post in this thread?
Comment #30
rwohlebsubscribe
Comment #31
wim leers#27:
As far as I can see, you first posted this here and now it has become an issue of its own: #820910: Non-filter URL rewriting. Am I right when I interpret this as being ready with this as far as the CDN integration module is concerned?
When you hover over a hyperlink, you see the actual URL in your status bar. If you think that's not ok, then please create a new issue for that.
#28: I think you were looking at the old code (or the 1.x code), the current code already checks for root-relative and protocol-relative URLs:
This already includes the check that you tried to add. Or am I overlooking something?
#29: There's an issue for that: #839282: Problems with shipped files and Drupal 6 file_create_url() behavior.
Tentatively marked as fixed! :)