Many modules that make calls to file_create_url occasionally pass an absolute URL. This could happen if one (or more) modules make serial calls to file_create_url or store absolute URLs. I ran into the issue while handling existing Image nodes.

The solution is simple: file_create_url needs to check if the URL already has host and if so return it back. A few lines of code solve the issue:

  // If path already includes hostname, do not change it
  $fragments = parse_url($path);
  if (isset($fragments['host'])) return $path;

I updated the CDN patch with this fix (please note the patch refers to 6.x version).

Comments

dkruglyak’s picture

Status: Active » Needs review
wim leers’s picture

Version: 5.x-1.0-rc2 » 6.x-1.x-dev
Status: Needs review » Active

It's faster to just check if the URL begins with "http://" or "https://". Is there a reason you're not just doing that?

Thanks for your feedback!

dkruglyak’s picture

Status: Active » Needs work

Not sure if it is that much faster. Besides if you DO parse the URL, it might be possible to add further domain remapping rules (e.g. identify which are the old source domains to be turned into new CDN'd domains). There are quite a few modules that may generate and store absolute URLs that may need remapping.

Also note that you cannot use valid_url call. I found that in case of filenames that have spaces (e.g. http://absolute.com/files/multi word.png) they could get generated by some image handling modules, pass through browsers, but fail valid_url test. However parse_url handles them right.

Speaking of absolute URL remapping, looks like some image modules generate absolute URL HTML that gets hardcoded into node bodies and teasers and may not even go through your file_create_url. While these issues could be dealt with statically by modules such as "Search and Replace", perhaps CDN module should add a new filter to handle these remaps automagically? The filter would share code with absolute URL checker/mapper in file_create_url.

URL hardcoding becomes a real problem when you have to deal with development vs. production domains

dkruglyak’s picture

Title: Handle absolute URLs passed into file_create_url » Handle absolute URLs in file_create_url and stored content

Changing the title to broaden the use case...

dkruglyak’s picture

Title: Handle absolute URLs in file_create_url and stored content » URL post-processing in file_create_url and stored content

Looks like there is another requirement for URL output, in addition to taking care of absolute URLs.

In multi-site environment, files normally go into sub-directories with paths such as sites/MYDOMAIN/files/MYFILE. This makes for very long URLs and for introducing dependence on the domain names into filepath (BAD!). There is a solution by creating /files directory containing a small .htaccess file that would redirect requests made to files/MYFILE (tested and works):

RewriteEngine On
RewriteCond %{REQUEST_FILENAME} !-f
RewriteCond %{REQUEST_FILENAME} !-d
RewriteCond %{HTTP_HOST} ^(www.)?([a-z0-9-.]+)$ [NC]
RewriteRule ^(.*)$ /sites/%2/files/$1 [L]

However, just adding this directory with .htaccess is not enough. All paths generated by file_create_url or stored in nodes need to lose the leading sites/MYDOMAIN fragment.

So, yet another URL post-processing feature is needed. Perhaps it could be added as an optional setting?

P.S. The /files directory solution is described here http://drupal.org/node/53705. In my code above I modified it slightly to remove leading "www.". This was fully tested and works for both WWW and non-WWW cases. There is an additional thread on the issue of removing the leading /sites fragment from the URL: http://drupal.org/node/452308

P.P.S. Looks like such functionality already exists in URL Replace Filter (http://drupal.org/project/url_replace_filter) and Pathologic (http://drupal.org/project/pathologic)? What is the best way to integrate them?

dkruglyak’s picture

Title: URL post-processing in file_create_url and stored content » URL stripping in file_create_url and stored content
Status: Needs work » Needs review
StatusFileSize
new11.56 KB

OK, so I had a chance to investigate URL Replace and Pathologic. They fully solve the problem of stripping leading URL segments for both links (href) and image (src) attributes. Works great for any HTML that passes through filters and nothing to add. Pathologic is seen as superior even by other module's author.

These modules however cannot do anything about URLs emitted by CDN module. A great example would be a node with file attachement, that is linked both from file attachement section (through file.inc) and directly linked to absolute path from content. While Pathologic fixes the hardcoded link within content, the attachment section link still comes across as http://www.MYDOMAIN.com/sites/MYDOMAIN/files/MYFILE. That's awful for SEO.

So I modified the CDN core patch to add a section that is run only if Pathologic is enabled. It uses similar (significantly simplified) stripping logic and for its configuration relies on Pathologic setting for default Input Format. I presume the dependency on Pathologic could be completely eliminated if similar setting is added directly to CDN module, but I had no reason to rush this yet. In most cases, the setting is highly likely to be identical.

Would appreciate any comments re: this solution. I am using the code in production now with no problem.

wim leers’s picture

Slightly off-topic, but: are you using the CDN module in production already? The daemon too? If you don't mind sharing, on which site?

I've been studying for an exam. But I'll reply to all your follow-ups tonight/tomorrow. I appreciate your feedback and contributions! :)

dkruglyak’s picture

StatusFileSize
new12.32 KB

I would describe the current status as "soft production" :)

This is a new multi-site install designed to support a large number of owned-and-operated sites as well as a large number of sites for clients. A few sites are almost released (LOL) but are still kind of under the radar for now... That is why I care so much about fixing site-specific URLs - nothing should look clunky to end users. Not using the daemon just yet but planning to implement soon. First want to stabilize core patches, site functionalities and build scripts.

Greatly appreciate all your work on this module. Very nicely done and hope we could soon make it even better.

As an aside - I just updated the patch to add support for image buttons (form.inc).

wim leers’s picture

I'm sorry that I didn't get back to you as promised — the exam I had to make today required more effort than I had anticipated.

I'll most definitely get back to you tomorrow!

wim leers’s picture

Title: URL stripping in file_create_url and stored content » Drupal core patch's file_create_url(): skip absolute URLs and support the pathologic module
Priority: Critical » Minor
Status: Needs review » Active

Unfortunately I was delayed again … and again. I really had to get my test case for my bachelor thesis up and running, and now it is: http://driverpacks.net. It should be blazingly fast (but it of course depends on your location in the world and the network speed). Anyway, that's what has prevented me from getting back to you sooner.

I committed the section of your patch that affected form.inc, to support image buttons. Thanks, I missed that one!

While I understand your concerns, they're all optimizations outside of my scope:
- support for not handling absolute URLs is in fact not necessary, instead the bugs in the modules doing this wrong should be fixed
- explicitly using the pathologic module from within core is simply unacceptable

The goal in the end is to get this patch into Drupal 7, and with the suggestions you're making (however useful they may be in certain use cases), that becomes impossible. If there is sufficient response from others to include this into this module, I will commit it. I'll leave this issue as active for some time, to allow for feedback from others on your suggestions.

dkruglyak’s picture

Priority: Minor » Critical

Wim, I do not have time to argue whether supporting absolute URLs is "necessary" or "not necessary" and whether using pathologic in core is "acceptable" or "unacceptable".

I am trying to use this module for an actual multi-site setup that has to support real-world requirements. It has to interoperate with many buggy modules and work around users who constantly screw up. From the perspective of actually trying to use your module, it has many shortcomings, which I documented and addressed above. It would be great to incorporate real fixes into the official release. If not, I will of course keep a separate fork for my own use, though this would be quite an unnecessary headache.

Bottom line, I suggest you re-read my description of the problems and if you do not like my solutions suggest your own.

Thank you.

dave reid’s picture

Just curious, would anything like Url alter module be useful at all for this module Wim?

gnassar’s picture

Thank you for this patch. Saved me a ton of time.

I can understand why putting module dependencies straight into core is a bad idea. I cannot yet understand refusing to support absolute URLs -- particularly when Drupal itself makes it a point to do so (with the absolute parameter in the url() call). Please explain: How is using an absolute URL a "bug" in a module that should be "fixed?" There are very good reasons for using absolute URLs in particular cases, and not handling them at all doesn't seem like the right answer for that.

Steven Merrill’s picture

Another example: using the flickr module and input format, calls to theme_image() will have absolute URLs to flickr images (already on a CDN, no less) in them.

sadist’s picture

everything was working fine, until I upgraded my Drupal core to the latest 6.15 (and run the CDN patch). I checked the patches and realised that this code isn't in the patch file.

// If path already includes hostname, do not change it
  $fragments = parse_url($path);
  if (isset($fragments['host'])) return $path;

so I added it in my file.inc file_create_url($path) and it's working now.

I wonder how I managed to get it working before this?

wim leers’s picture

Title: Drupal core patch's file_create_url(): skip absolute URLs and support the pathologic module » Skip absolute URLs and support the Pathologic module
Status: Active » Postponed (maintainer needs more info)
StatusFileSize
new824 bytes

@Dave Reid: not AFAIK.

@dkruglyak and likeminded people: I was wrong, you were right. Conceptually, that is: the changes you proposed did *not* belong in the core patch (i.e. in file_create_url(), where you put it), but in the CDN integration module's code. Since #610322: Upgrade the Drupal 6 core patch to match the committed Drupal 7 core patch, that is inside cdn_file_url_alter().
I've added the test that checks for absolute URLs (see attached patch): http://drupal.org/cvs?commit=323376.

Could you please reroll the code to support Pathologic to use the current version of the Pathologic module? The code also makes little sense to me, so if possible, please use more of Pathologic's functions, and otherwise include more comments. The code should go in cdn_file_url_alter(). Thanks! :)

raintonr’s picture

Re: #14. That exact same thing happens with the image.module.

Within image_display a call is made to file_create_url which places the CDN URL in front of the passed in URL. However, this function then calls theme('image_display'...) which has another call to file_create_url which adds the CDN URL a second time, therefore breaking all image.module images.

dkruglyak’s picture

Status: Postponed (maintainer needs more info) » Needs review
StatusFileSize
new1.57 KB

@Wim: Here is an updated patch that runs CDN urls through Pathologic settings. It works great for me, but you may definitely want to review it. All it does is uses Pathologic setting for default filter to process URLs. Note, right now Pathologic module does not provide a nice way to directly call itself from outside, which is why I just did copy/paste of its regexp code. Feel free to update this patch if Pathologic provides better API in the future #758118: wild card for Additional paths

@Dave: I looked at URL Alter module and seems like all it is doing is overriding custom_url_rewrite_inbound() and custom_url_rewrite_outbound()? This is useful, but not sufficient for what CDN module needs. In fact, even before checking out your module I tried to add that same pathologic fix to custom_url_rewrite_outbound(), finding that it does not get called every time it is needed... Is your module supposed to do anything other than override these two functions?

dkruglyak’s picture

StatusFileSize
new1.58 KB

A minor fix to the patch - enforce leading '/' symbol for generated path...

dkruglyak’s picture

StatusFileSize
new10.51 KB

One more little fix to the core patches. When pathologic is used with the latest patch enforcing leading '/', if file_create_url is called on the path several times (for example in Image module) it does not work correctly and adds redundant second '/' character in front.

The solution is to add another condition for file_create_url returning original path - whenever it starts with '/' already.

  // Return path if it was altered by any module OR it already includes base
  if ( ($path != $old_path) || (strpos($path, '/') === 0) ) {
    return $path;
  }

Attached is the modified core path. It is trivial - only changing the IF line above (and updating the comment).

dkruglyak’s picture

StatusFileSize
new11.01 KB

Sorry, wrong patch. This one should do it...

wim leers’s picture

Title: Skip absolute URLs and support the Pathologic module » Skip root-relative and protocol-relative URLs and support the Pathologic module
Status: Needs review » Fixed

#18/#19: Thanks! I've rerolled the patch to clean up comments and to get a full pass from Code Review (cdn_pathologic_support.patch). However, I realized that this code actually belongs in Pathologic. And now that a *hook* is being used to alter file URLs instead of a single callback, that has become possible. We just need to make sure that the CDN integration module runs *after* the Pathologic module. And that's quite easy to achieve.
Hence I've rolled a patch for Pathologic 3.x (pathologic_hook_file_url_alter.patch), which I would like you to review and then post to the Pathologic issue queue. In the mean time, I've already made HS run after Pathologic. Pathologic sets no weight and thus has the standard weight: zero. I've assigned the CDN integration module the weight 10.

#20/#21: You're absolutely right. In fact, this should be considered a bug in D7. Which is why I worked on a core patch that fixes this and that comes with unit tests right away: #777830: file_create_url() does not support protocol-relative nor root-relative file URLs. I've gone a bit further than you though: instead of just root-relative URLs, I made sure also protocol-relative URLs are supported now. Updated the core patch. The cool thing is that users will automatically be warned in the status report (and on all CDN integration administration pages) that there is a problem :)

Thanks for sticking along. This issue can now finally be marked fixed! :)

Commit: http://drupal.org/cvs?commit=357690

wim leers’s picture

Status: Fixed » Needs review
StatusFileSize
new1.38 KB
new1.62 KB

Darn, I forgot to upload the patches. And this issue shouldn't be marked fixed. Multitasking fail. Issue reopened.

I accidentally committed cdn_pathologic_support.patch. Reverted: http://drupal.org/cvs?commit=357704.

wim leers’s picture

Version: 6.x-1.x-dev » 6.x-2.x-dev
wim leers’s picture

dkruglyak’s picture

I have not had chance to review your new patches.

So far I just upgraded to the latest dev version of this module with my auto-build script and it is working for my use cases so far.

What do you want me to actually review? I will surely take a deeper look if my current setup breaks :)

dkruglyak’s picture

StatusFileSize
new1.17 KB

OK, with pathologic module being forcefully upgraded to 3.x, I had to reapply/retest all patches.

Pathologic module patch had to be rewritten since that module's functions changed. I attached the rewritten and streamlined patch. It might use further work, as I only implemented a bare minimum regex that might not cover all use cases. Reporting the patch to pathologic issue queue.

Basic CDN patches from the module distribution applied just fine and everything seems to be working as far as I tested.

One minor annoyance is the CDN module's stats page shows original URLs, not rewritten / post-processed one. I have not tackled that code but suggest this as enhancement.

dkruglyak’s picture

StatusFileSize
new11.01 KB

OK, a little addition here. Looks like there is need to update your CDN patch for drupal core (attached).

It has only one minor difference from your original one. In file.inc, if the path starts with leading '/' this means the base path was already prepended. If this fix is not present many modules that call url function repeatedly will append double '//' in front of the paths. Here is the revised snippet:

  // Return path if it was altered by any module OR it already includes base
  if ( ($path != $old_path) || (strpos($path, '/') === 0) ) {
    return $path;
  }

This fixed numerous URL-generation bugs I had in several sites.

gilgabar’s picture

Status: Needs review » Needs work

There appear to still be issues with absolute URLs. I'm using Pressflow 6.17 with CDN module 2.x-dev. It looks like cdn_file_url_alter handles an absolute path correctly and returns it to file_create_url without alteration. But since it is not altered the path is not returned immediately. Instead it is returned by the 'shipped files' condition which adds the base_path in front of it.

Perhaps the code added in cdn_file_url_alter should be in file_create_url instead as suggested by the very first post in this thread?

rwohleb’s picture

subscribe

wim leers’s picture

Status: Needs work » Fixed

#27:

OK, with pathologic module being forcefully upgraded to 3.x, I had to reapply/retest all patches.

Pathologic module patch had to be rewritten since that module's functions changed. I attached the rewritten and streamlined patch. It might use further work, as I only implemented a bare minimum regex that might not cover all use cases. Reporting the patch to pathologic issue queue.

Basic CDN patches from the module distribution applied just fine and everything seems to be working as far as I tested.

As far as I can see, you first posted this here and now it has become an issue of its own: #820910: Non-filter URL rewriting. Am I right when I interpret this as being ready with this as far as the CDN integration module is concerned?

One minor annoyance is the CDN module's stats page shows original URLs, not rewritten / post-processed one. I have not tackled that code but suggest this as enhancement.

When you hover over a hyperlink, you see the actual URL in your status bar. If you think that's not ok, then please create a new issue for that.

#28: I think you were looking at the old code (or the 1.x code), the current code already checks for root-relative and protocol-relative URLs:

+  // Return path if it was altered by any module, or if it already is a
+  // root-relative or a protocol-relative URL.
+  if ($path != $old_path || drupal_substr($path, 0, 1) == '/' || drupal_substr($path, 0, 2) == '//') {
+    return $path;
+  }

This already includes the check that you tried to add. Or am I overlooking something?

#29: There's an issue for that: #839282: Problems with shipped files and Drupal 6 file_create_url() behavior.

Tentatively marked as fixed! :)

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.