Spin-off issue from #320331-133: Turn custom_url_rewrite_inbound and custom_url_rewrite_outbound into hooks.

hook_url_inbound_alter() is great, but surely there must be some way to not have it destroy the performance gains from page caching when using a module that just needs to know $_GET['q'] within hook_boot() or hook_exit().

Comments

effulgentsia’s picture

Issue tags: +Performance

tagging

catch’s picture

subscribing.

gpk’s picture

What does hook_url_inbound_alter() need to make it run successfully, apart from drupal_alter() being available?

dave reid’s picture

Mostly it was drupal_alter() and the assumption that since this is the opposite of hook_url_outbound_alter(), it should run the same bootstrap. We could make a copy of it and put it into bootstrap.inc (bootstrap_alter). Also in the test function we had used user_load_by_name() but I guess that could be converted to a simpler query: SELECT uid FROM {users} WHERE name = :name instead of relying on user.module being loaded. I guess if a module implementation of hook_url_inbound_alter() needed a full bootstrap it could do so itself and we can add BOOTSTRAP_PATH back in.

Other changes required to revert removing the bootstrap:
- Add 'url_inbound_alter' to http://api.drupal.org/api/function/bootstrap_hooks/7
- Add to the hook documentation that a full bootstrap is not available.

dave reid’s picture

I'd also like to see more benchmarks with the current code and statistics module enabled. Ones I did indicated little difference. Recent ones indicate huge difference.

gpk’s picture

@5: my hunch in #320331-121: Turn custom_url_rewrite_inbound and custom_url_rewrite_outbound into hooks was that the page cache probably wasn't turned on for your tests.

Re 4:
> - Add 'url_inbound_alter' to http://api.drupal.org/api/function/bootstrap_hooks/7
Given that the 'url_inbound_alter' hook is only required during a cached page response if a module (e.g. statistics) has upped the bootstrap phase to DRUPAL_BOOTSTRAP_PATH, can we distinguish between boot/exit/watchdog and url_inbound_alter? Otherwise we could be loading modules unnecessarily. The bootstrap column in {system} is already a full-blown int so there's certainly room to store the info..

dave reid’s picture

It would needed to be added to bootstrap_hooks since it's invoked prior to all modules loaded. Otherwise modules that do implement url_inbound_alter() but not the other bootstrap hooks would never be run.

gpk’s picture

I was thinking that in many cases this could cause such modules to be loaded unnecessarily, however possibly it would require too big an API change at this stage to finesse the loading of these modules. Same could also be said for modules implementing hook_watchdog - in 99% of cached page requests they will not be needed.

sun.core’s picture

Priority: Critical » Normal

Not critical. Please read and understand Priority levels of Issues, thanks.

effulgentsia’s picture

Priority: Normal » Critical

To be honest, I don't entirely understand the description of "critical" on that page, but if "critical" means we should not release Drupal 7 until this is resolved, then this issue is critical as per #320331-126: Turn custom_url_rewrite_inbound and custom_url_rewrite_outbound into hooks and remaining discussion on that issue. Turning url inbound rewriting from a handler to an alter hook resulted in a site with statistics module enabled having 500% or worse performance degradation for serving cached pages, and according to webchick, this is unacceptable. This issue can be marked fixed if we roll that other one back. But I don't accept that that's the only path towards fixing this.

catch’s picture

Yes, we either fix this, or we roll back hook_url_*_alter(). Note that the hooks also cause potentially infinite recursion if you call url() inside taxonomy_term_load() per #712076: Infinite loop possibility and performance problems in forum_url_outbound_alter().

effulgentsia’s picture

Title: Full bootstrap is required for $_GET['q'] to be initialized: makes page caching incompatible with statistics module » A site with statistics module enabled is much slower in serving cached pages in D7 than in D6

Better issue title I think, and one that got me thinking about an alternate way to solve the problem.

I don't think it's necessary for hook_boot() to have a fully normalized $_GET['q']. I think it's reasonable for us to say that the process of normalizing $_GET['q'] requires module involvement, and therefore doesn't happen until full bootstrap. I'm not too keen on making hook_url_inbound_alter() a bootstrap level hook (#6 - #8), unless some other compelling use-case forces us into that.

The issue with statistics module is more focused to hook_exit() during cached pages. This should have a normalized $_GET['q']. So, how about we add a "path" field to the "cache_page" table, and cache the normalized path along with the page content? Then either within drupal_serve_page_from_cache() or right after calling it from _drupal_bootstrap_page_cache(), we set $_GET['q'] to that cached value.

Thoughts? If it sounds good, anyone up for rolling such a patch?

effulgentsia’s picture

Status: Active » Needs review
StatusFileSize
new8.83 KB

So, how about we add a "path" field to the "cache_page" table...

Um, I guess we can't add fields to cache tables. So...

How about this?

catch’s picture

We already have something for ESI tie in in drupal_render_cache_set(), couldn't the same pattern be used here? http://drupal.org/node/651902

So far we have two critical bugs open due to hook_url_*_alter() - I think it should just be rolled back to be honest, consequences were not thought through, all the fixes create pain.

effulgentsia’s picture

So far we have two critical bugs open due to hook_url_*_alter()

What's the other one? #712076: Infinite loop possibility and performance problems in forum_url_outbound_alter() is not marked critical.

moshe weitzman’s picture

I would rather drop statistics module from core than take away url_alter hook for all developers. logging all page views in mysql is a pretty bad idea, and we should just acknowledge that. it can live on in contrib for folks who can't seem to find google analytics or apache logs or any other alternative.

christophweber’s picture

Quoting Moshe in #16: logging all page views in mysql is a pretty bad idea, and we should just acknowledge that. it can live on in contrib for folks who can't seem to find google analytics or apache logs or any other alternative.

I completely agree with this statement, having wondered repeatedly how Drupal's Statistics module fits into the general web access statistics world.
While I am only superficially familiar with Drupal's API and other internals, I am keenly aware of performance issues caused by disk access patterns. The less the better, and dropping Statistics from core would seem to fit the bill.

dries’s picture

To be honest, I think this is getting way to complex. What about rolling back hook_url_*_alter() ?

effulgentsia’s picture

What about rolling back hook_url_*_alter() ?

If we have to, we have to. I'm -1 for it, but I can appreciate the arguments for doing so. Let's realize we're actually talking about two hooks here: hook_url_inbound_alter() and hook_url_outbound_alter(). This issue points to a weakness of hook_url_inbound_alter(), though I think #13 is an okay solution, but I haven't had a chance yet to investigate what is suggested in #14.

I suspect if we have to roll back hook_url_inbound_alter(), then we need to roll back hook_url_outbound_alter() too. To do that, we would need to remove existing core implementations of that hook. For forum_url_outbound_alter(), we need to do that anyway, and have a patch for it: #699440: Add bundle support to entity_uri callback to remove performance overhead of forum_url_outbound_alter() (regression from D6). That would leave locale_url_outbound_alter().

sun’s picture

I agree with Moshe and others that we should keep hook_url_inbound/outbound_alter(), mainly from a strategic perspective.

Hopping through the API and bootstrap process, I'd see two potential solutions:

1) Add a {cache_page}.data column to store additional information (here, $_GET['q'], drupal_get_title()) with each entry. If done properly, that would most likely be interesting and helpful for other use-cases. Potential data handling (getter/setter) functions: hook_boot() and hook_exit(). Alternatively, or additionally, cache_get() and cache_set() could and probably should invoke a hook_cache_get|set_BIN().

2) Revert the removal of DRUPAL_BOOTSTRAP_PATH and turn hook_url_inbound_alter() into a bootstrap hook. Not sure why that was removed in the first place. There may be other use-cases other than statistics that require knowledge of the current system path prior to DRUPAL_BOOSTRAP_FULL.

moshe weitzman’s picture

@Dries - in #16 I suggested that the hook is more important to the framework than statistics module. Are you in disagreement based on your suggestion in #18?

gpk’s picture

At the risk of sticking my nose in ...

> the hook is more important to the framework than statistics module
part of the tension here is I think that large/busy sites have different requirements and performance issues from small/quieter sites. At the moment as far as I can make out Drupal is trying to work for both, which creates certain challenges.

> live on in contrib for folks who can't seem to find google analytics or apache logs or any other alternative.
I use both of those, but they have different uses than statistics module. And I imagine that alternatives (i.e. contrib modules yet to be written..??) would have the same problem with the bootstrap. The point about statistics module is that you just turn it on if you need it. However I agree it might ultimately have a better time in contrib - it's not had much love of late.

Another workaround might be to store the requested URL in {accesslog} which could then be used to get the normalised path for a cached response. However this wouldn't provide the additional benefits for other potential use cases (#20 option 1).

catch’s picture

I like sun's option #2. Add back DRUPAL_BOOTSTRAP_PATH and make hook_url_inbound_alter() a bootstrap hook. Then we don't need to make any schema changes just to accommodate statistics.module

We absolutely need to fix #699440: Add bundle support to entity_uri callback to remove performance overhead of forum_url_outbound_alter() (regression from D6) whether we were to rollback the alter hooks or not, so that in itself isn't an argument for a rollback.

http://api.drupal.org/api/function/locale_url_outbound_alter/7 is a proper use of the hook IMO - or at least afaik the code is extremely similar to the Drupal 6 code (which I can't locate at the moment).

So if these two are fixed, I'm in favour of leaving the hook in - the issue less the hook as such, more that we converted too many special cases to it for which it's not really suited.

effulgentsia’s picture

I'd be on-board with either of the two options in #20. While I think #13/#20.1 is cool, there's something to be said for invoking the principle of "solve the issue in the simplest way possible". Therefore, #20.2 is probably the better way to go. It was suggested in #4, and I don't even remember why I didn't like the idea back then. I think that like gpk (#6), I was concerned about the consequences of extra bootstrap modules. But, on thinking about it more, there really shouldn't be very many modules that implement hook_url_inbound_alter(). So +1 to that approach. #13/#20.1 can wait for another use-case, perhaps not until D8 (though maybe in D7 contrib).

effulgentsia’s picture

Status: Needs review » Needs work

CNW since a patch for that is needed.

chx’s picture

just roll back the !@#$%^&* hook. there was a good reason it never was a hook and it should still not be a hook. it's pointless to be a hook and it causes a major performance hit. what's so hard?

damien tournoud’s picture

Status: Needs work » Needs review
StatusFileSize
new9.97 KB

Why not simply removing this old crap that is {cache}.headers (that is used *only* for page cache), and simply implement the page cache in a more reasonable manner.

This allows us to pass some additional meta-data in the cached page, solving the issue at hand.

damien tournoud’s picture

StatusFileSize
new13.77 KB

This thing is like weed, it goes everywhere. Removed more 'headers'.

moshe weitzman’s picture

Very nice work. Thats how you slay a critical!

Waiting for bot and perhaps more feedback before rtbc.

effulgentsia’s picture

Agreed. This is awesome. The patch leads to some bugs in statistics_exit() for cached pages. I'm working on adding tests for that. I have confidence that it will be easy to make those tests pass.

effulgentsia’s picture

StatusFileSize
new19.2 KB

With the test and minor code additions to make the test pass.

Status: Needs review » Needs work
Issue tags: -Performance

The last submitted patch, 692044-cached-page-metadata-31.patch, failed testing.

effulgentsia’s picture

Status: Needs work » Needs review
Issue tags: +Performance
damien tournoud’s picture

Looks good to me, thanks for the additional tests and the fixes.

yesct’s picture

Status: Needs review » Reviewed & tested by the community
catch’s picture

Looks great here too. Will need to document the API/schema change but it's a small one, and the schema change won't cause actual breakage anyway.

sun’s picture

Status: Reviewed & tested by the community » Needs work
+++ includes/bootstrap.inc	18 May 2010 01:10:40 -0000
@@ -1134,7 +1138,7 @@ function drupal_serve_page_from_cache(st
-      // $cache->data is already gzip'ed, so make sure zlib.output_compression
+      // $cache->data['body'] is already gzip'ed, so make sure zlib.output_compression
       // does not compress it once more.

Exceeds 80 chars.

+++ includes/bootstrap.inc	18 May 2010 01:10:40 -0000
@@ -1142,11 +1146,12 @@ function drupal_serve_page_from_cache(st
+  // Print the page.
+  print $cache->data['body'];

+++ includes/common.inc	18 May 2010 01:10:41 -0000
@@ -4518,28 +4518,32 @@ function drupal_page_set_cache() {
+        'body' => ob_get_clean(),

Wouldn't 'html', 'markup', or 'content' be a more appropriate array key instead of 'body'?

+++ modules/statistics/statistics.module	18 May 2010 01:10:45 -0000
@@ -51,9 +51,23 @@ function statistics_help($path, $arg) {
+  // If this is running for a cached page using the page_cache_without_database
+  // configuration, then we may not have bootstrapped the system variables yet.
+  // Websites running non-database page caches will almost certainly not have
+  // the statistics module enabled, but just in case, bootstrap the system
+  // variables. This wipes out most if not all of the performance gain from a
+  // non-database page cache, but if the statistics module is enabled, then it
+  // is also likely to have the 'statistics_enable_access_log' variable set, in
+  // which case, we'll be bootstrapping to the session phase anyway. In summary,
+  // we do not invoke an expensive full bootstrap, but for any site that does
+  // not want to bootstrap beyond DRUPAL_BOOTSTRAP_PAGE_CACHE for cached pages,

Awesome explanation + information. However, can we shorten all sentences, while keeping the information?

"When serving cached pages with the 'page_cache_without_database' configuration, system variables need to be loaded. This is a major performance decrease for non-database page caches, but with Statistics module, it is likely to also have 'statistics_enable_access_log' enabled, in which case we need to bootstrap to the session phase anyway."

+++ modules/statistics/statistics.module	18 May 2010 01:10:45 -0000
@@ -51,9 +51,23 @@ function statistics_help($path, $arg) {
+  // do not enable the statistics module. ¶

1) The "Statistics" module (proper name).

2) Trailing white-space.

+++ modules/statistics/statistics.module	18 May 2010 01:10:45 -0000
@@ -51,9 +51,23 @@ function statistics_help($path, $arg) {
+    // We need this to use the arg() function.
+    require_once DRUPAL_ROOT . '/' . variable_get('path_inc', 'includes/path.inc');

I wonder whether we shouldn't move arg() into bootstrap.inc or something, too? It's a few lines only, and we already have a group of path-related functions in bootstrap.inc it could be moved into. (IIRC, not having arg() available early was a WTF for me a couple of times already in the past)

Alternatively, we could simply inline $arg = explode('/', $_GET['q']); here.

I'd prefer to do the former though (move arg()).

+++ modules/update/update.install	18 May 2010 01:10:45 -0000
@@ -167,3 +167,10 @@ function update_update_7000() {
\ No newline at end of file

Missing newline.

Powered by Dreditor.

damien tournoud’s picture

Status: Needs work » Needs review
StatusFileSize
new21.15 KB

New version, integrating all of @sun's comments, except one: I stick with 'body', which is the HTTP terminology (headers / body).

sun’s picture

+++ includes/bootstrap.inc
@@ -1134,19 +1134,24 @@ function drupal_serve_page_from_cache(stdClass $cache) {
+  // Restore the metadata that was cached with the page.
+  $_GET['q'] = $cache->data['path'];
+  drupal_set_title($cache->data['title'], PASS_THROUGH);

I wonder whether we shouldn't move this into drupal_page_get_cache() or _drupal_bootstrap_page_cache(), so as to make both available to hook_boot() and hook_exit()?

Powered by Dreditor.

dries’s picture

I need to give this a closer review to verify the smaller details but from a conceptual level this looks pretty clever.

It would be interesting to see some performance results.

I'm OK with 'body' -- that seems to be official term according to the HTTP specification. See http://www.w3.org/Protocols/rfc2616/rfc2616-sec4.html#sec4.

effulgentsia’s picture

StatusFileSize
new22.32 KB

With #39. So tempting to put it into drupal_page_get_cache() for symmetry with drupal_page_set_cache(), but global side-effects within a "get" function just seems wrong. Therefore, put it in _drupal_bootstrap_page_cache().

sun’s picture

Status: Needs review » Reviewed & tested by the community

Thanks, looks good.

AFAICS, there shouldn't be any measurable positive or negative performance impact from this patch.

catch’s picture

If there wasn't, then this issue wouldn't be critical - ideally someone should benchmark cached pages with statistics module on and off. Agreed that without statistics on this ought to be no change though.

effulgentsia’s picture

Benchmarks of anonymous user visiting the front page after a clean "standard profile" install (no content):

Statistics Off
without page cache: HEAD = 41.4ms, Patch = 41.3ms
with page cache: HEAD = 6.6ms, Patch = 6.6ms

Statistics On (access log and count content views enabled)
without page cache: HEAD = 43.0ms, Patch = 42.8ms
with page cache: HEAD = 32.8ms, Patch = 8.5ms

Each number is the "ab" (Apache Benchmark) average of 500 requests.

dries’s picture

Status: Reviewed & tested by the community » Fixed

Reviewed and looks very solid. Committed to CVS HEAD. Thanks.

jbrown’s picture

Status: Fixed » Needs review
Issue tags: +D7 upgrade path
StatusFileSize
new2.58 KB

Some upgrade issues:

In update_prepare_d7_bootstrap() we are creating table cache_bootstrap with headers field. We then drop it in an update hook. Better to just create it correctly in the first place.

Also, cache_block needs to have its header field dropped.

damien tournoud’s picture

Status: Needs review » Reviewed & tested by the community

Good catch!

dries’s picture

Project: Drupal core » HEAD to HEAD
Version: 7.x-dev »
Component: base system » Code
Priority: Critical » Normal
Status: Reviewed & tested by the community » Needs work

Committed to CVS HEAD. Moving to HEAD to HEAD project as this will break HEAD to HEAD upgrades.

pwolanin’s picture

Project: HEAD to HEAD » Drupal core
Version: » 7.x-dev
Component: Code » base system
Priority: Normal » Critical
Status: Needs work » Fixed

Status: Fixed » Closed (fixed)
Issue tags: -Performance, -D7 upgrade path

Automatically closed -- issue fixed for 2 weeks with no activity.

mfb’s picture

Issue summary: View changes

Does anyone happen to remember why statistics module needs to bootstrap to the session phase in statistics_exit()? If the request didn't already bootstrap to session phase - i.e. it's being served from cache - then this would generate a "bogus" session ID that can be logged by statistics module. But this doesn't seem particularly useful.

In recent versions of PHP, initializing session after serving the page throws a PHP warning, which I was researching in #3009351: "session_id(): Cannot change session id". I don't use statistics module so not clear what it's trying to achieve.