If you look at the typical queries, issued during a page load, using Devel module, you will see call to drupal_lookup_path() many times. This function makes a database call for each source/dest pair and in a sizable project ends you up with many unnecessary database queries.

Granted, these queries are fast and simple but still create significant scalability problem since, light or not, they tie-up database connections.

The function could be easily rewritten so that it retrieves the entire source/dest mapping into two associative arrays on the first request, "caches" it in a static variable and uses the in-memory lookup for all subsequent requests. This approach significantly improves the scalability and even performance (if a module like pathauto is used for e.g. SEO).

This problem exists in all Drupal versions.

You can find a more detailed description of the problem and a sample rewrite of the function for 4.7 branch at this address:
http://www.freshblurbs.com/major-drupal-performance-boost

If core developers agree with the proposed change, I will be happy to work on similar change for 5.x branch, as well.

thank you

Comments

kbahey’s picture

Status: Needs work » Closed (won't fix)

This is not correct.

Up to 4.6 we did exactly what you propose, and then the bottleneck was building the associative array for sites that have several hundreds or thousands of path aliases.

So, starting with 4.7, we rewrote the code so that it does db queries instead of building the associative arrays.

The old way was so bad that it took several seconds for some sites. I even have some 4.6 sites with a backport of the 4.7 patch to resolve this bottleneck.

irakli’s picture

Thank you for a very interesting information. I certainly did not know that.

However, I must disagree. It's not that cut-and-dry. Any performance tuning is a compromise and this one, too. Benefits largely depend on the specific setup. Maybe, considering the importance of this piece, we need both approaches?

I just completed one round of tests with 20 concurrent users and both approaches and for 2,000 aliases the throughput with the current Drupal approach was 416 request/sec while with the suggested one - 6646. It's almost 16 times faster!

Yes, I also did a test with 100,000 aliases and the current approach was clearly faster.

Which one is "better"?

2,000 aliases is not too few. Drupal is used on blogs, a lot. If you take a website of a very popular blogger, she probably writes 1-2 posts a day? 1.5 on average.

So, 2000 aliases (if aliases are only used in titles) is about 4 years worth of blog posts. You don't think it's a typical use-pattern? I think it is.

I will do some more testing and publish all my findings, soon. I hope you will be willing to take one more look at them.

I completely understand your point but since the number of use-cases are many - there may not necessarily be one best solution.

thank you

irakli’s picture

Thank you for a very interesting information. I certainly did not know that.

However, I must disagree. It's not that cut-and-dry. Any performance tuning is a compromise and this one, too. Benefits largely depend on the specific setup. Maybe, considering the importance of this piece, we need both approaches?

I just completed one round of tests with 20 concurrent users and both approaches and for 2,000 aliases the throughput with the current Drupal approach was 416 request/sec while with the suggested one - 6646. It's almost 16 times faster!

Yes, I also did a test with 100,000 aliases and the current approach was clearly faster.

Which one is "better"?

2,000 aliases is not too few. Drupal is used on blogs, a lot. If you take a website of a very popular blogger, she probably writes 1-2 posts a day? 1.5 on average.

So, 2000 aliases (if aliases are only used in titles) is about 4 years worth of blog posts. You don't think it's a typical use-pattern? I think it is.

I will do some more testing and publish all my findings, soon. I hope you will be willing to take one more look at them.

I completely understand your point but since the number of use-cases are many - there may not necessarily be one best solution.

thank you

irakli’s picture

Thank you for a very interesting information. I certainly did not know that.

However, I must disagree. It's not that cut-and-dry. Any performance tuning is a compromise and this one, too. Benefits largely depend on the specific setup. Maybe, considering the importance of this piece, we need both approaches?

I just completed one round of tests with 20 concurrent users and both approaches and for 2,000 aliases the throughput with the current Drupal approach was 416 request/sec while with the suggested one - 6646. It's almost 16 times faster!

Yes, I also did a test with 100,000 aliases and the current approach was clearly faster.

Which one is "better"?

2,000 aliases is not too few. Drupal is used on blogs, a lot. If you take a website of a very popular blogger, she probably writes 1-2 posts a day? 1.5 on average.

So, 2000 aliases (if aliases are only used in titles) is about 4 years worth of blog posts. You don't think it's a typical use-pattern? I think it is.

I will do some more testing and publish all my findings, soon. I hope you will be willing to take one more look at them.

I completely understand your point but since the number of use-cases are many - there may not necessarily be one best solution.

thank you

kbahey’s picture

Status: Closed (won't fix) » Postponed (maintainer needs more info)

You either have a good solution to a problem, or your benchmarking methodology is not accurate.

The best way to convince the rest of us, is to share the exact benchmarks that you did, in detail.

Thing like:

-Version(s) of Drupal you used.
- Before and after scenarios.
- Data set size.
- Tools used (e.g. ab/ab2, devel module, custom tools?)

irakli’s picture

Absolutely, working on that. I am using JMeter and will publish all sources/test plans etc. as soon I am done. I just want to get a fuller picture - test more scenarios before jumping into any conclusions :).

thank you.

P.S. I apologize for repeated comments I posted. The server was giving me "Internal Server Error" and I thought they were not going through.

kevinweber’s picture

This sounds to me to be an ideal solution for small websites, since small drupal installations are typically on shared hosting where MySQL queries are at a premium.

I'd like to make everyone aware of another discussion ongoing now: http://drupal.org/node/106559

Additionally, my question is: why can't we combine the current method AND this new method? If we initially query the number of alias, then we could compare it to some magic number. If the number of aliases is larger than the magic number, then the current method is used. Otherwise, this new method is used.

kevinweber’s picture

Priority: Critical » Minor
Status: Postponed (maintainer needs more info) » Needs work
StatusFileSize
new3.29 KB

Here's a patch for Drupal 5.1 to implement my suggestion above. I don't recommend that this be put into the core. I'm just posting it here for other small site drupallers to help them out.

After I installed this patch, the number of mysql queries went from 159 queries executed in approx 250 ms to 89 queries executed in approx 150 ms. A nice little improvement for a site with only 100 nodes.

FYI, I set it up so that a site with less than 1000 aliases will cache them all, while a site with more than 1000 aliases will use the standard method. I don't know if this is the optimum cache size, but I didn't want it to be set too high to waste resources.

kbahey’s picture

First of all, make that 1000 a setting under Performance somewhere. No hard coded numbers please.

Second, less queries is always good, but it has to be accompanied by a total reduction in page load time. I have seen cases where just building the associative array in 4.6 ate up way more time than order of magnitudes of the database query time.

Install the devel module and see how much each case is. Restart MySQL between the tests so as to rule out the query cache.

Also, look at how Drupal 4.6 used to do it, which is almost what you want to do (the code was in bootstrap.inc at the time before path.inc existed). Specifically, the functions drupal_get_path_map() and drupal_get_path_alias().

Another thing to look at is this comment in a related issue that is meant to cache some aliases if there is only a few of them.

FiReaNGeL’s picture

Crell’s picture

For what it's worth, if the performance benefit is real then I would definitely support this for Drupal 6. The problem is that the Drupal 4.6 method is optimized for a site with just a few aliases, in which case I suspect that's fastest. The current method is optimized for sites that either do not have path module enabled (so no entries at all) or have zillions of them. Sites that have 10000 nodes but only 20 key aliases (like, say, Drupal.org) kinda get screwed. If we can have the code self-throttle, even if it's an admin-defined threshold, that lets us have our cake and eat it too. +1

kevinweber’s picture

I checked the page execution times, which are unchanged. My test site only has 100 aliases, so this may change for higher numbers of aliases. I don't have the time right now to test this with other numbers of aliases, but if someone else would like to try, please be my guest.

I have a feeling that page execution times will never decrease with this patch. The point at which the page execution times start increasing is where we should switch to the current system.

I'm content with this hack patch as it stands. However, I will continue working on it if others would like to see it mature.

kbahey’s picture

@Crell
The other patch with whitelist/black list should also help with the case of 20 key aliases. These will be cached and one db I/O will get them.

@kevinweber
Here is an idea.

Install the devel module and enable the node/taxonomy generation that comes with it. Also install and enable pathauto and make sure it will create aliases for nodes.

Then creat 1,000 nodes.

You now have 1,000 aliases (at least).

However, you don't have a page with lots of aliases. Installing sitemenu will create one page that has many aliases on it.

Go ahead and benchmark that before and after your patch and see what the results are.

kevinweber’s picture

The final benchmark:

* Drupal 5.1
* 1000 nodes

---Current Path.inc.php---
* Front page has 329 database queries
* Average execution time is 2200 ms.

* Sitemap module has 218 database queries. (Most are path alias lookups.)
* Average execution time is 1500 ms.

---With the above patch---
* Front page has 213 database queries. (35% fewer)
* Average execution time is 3500 ms. (60% longer)

* Sitemap module had 24 database queries. (90% fewer)
* Average execution time is 2200 ms. (45% longer)

Conclusions: 1000 nodes is too many nodes for this "optimization". While the number of database queries decreased, the execution time increased dramatically. However, with 100 nodes (see comment #8) the number of database queries were reduced 45% with no noticeable change in execution time.

kbahey’s picture

That is exactly what we found on 4.6, and why the 4.7 way was ported to some sites to get over this.

To give you and idea, here is the number of aliases on a large site (hundreds of thousands of page views per day).

mysql> select count(*) from url_alias;
+----------+
| count(*) |
+----------+
|    47752 |
+----------+

There was way too much time spend building the static associative array for every request, and way too much memory wasted.

What you can do is see the other approaches listed in links above (caching N aliases instead of doing individual queries, so it is one query for N (magic number with some settings).

kevinweber’s picture

Status: Needs work » Closed (fixed)

Closed.