Download & Extend

Optimize the generate chunk SQL query

Project:XML sitemap
Version:6.x-2.x-dev
Component:xmlsitemap.module
Category:task
Priority:normal
Assigned:Unassigned
Status:closed (won't fix)
Issue tags:Needs architectural review, optimization, Performance

Issue Summary

I could use some help from some awesome DB people to make sure the following query gets optimized:

SELECT COALESCE(ua.dst, x.loc) AS alias, x.loc, x.lastmod, x.changefreq, x.changecount, x.priority FROM {xmlsitemap} x
          LEFT JOIN (SELECT src, dst FROM {url_alias} WHERE language IN ('en', '') GROUP BY src ORDER BY language) ua ON x.loc = ua.src
          WHERE x.status = 1

This is what I get when I use EXPLAIN on the query with MySQL:

+----+-------------+------------+------+---------------+------+---------+------+-------+----------------------------------------------+
| id | select_type | table      | type | possible_keys | key  | key_len | ref  | rows  | Extra                                        |
+----+-------------+------------+------+---------------+------+---------+------+-------+----------------------------------------------+
|  1 | PRIMARY     | x          | ALL  | NULL          | NULL | NULL    | NULL |   285 | Using where                                  |
|  1 | PRIMARY     | <derived2> | ALL  | NULL          | NULL | NULL    | NULL | 20051 |                                              |
|  2 | DERIVED     | url_alias  | ALL  | NULL          | NULL | NULL    | NULL | 20053 | Using where; Using temporary; Using filesort |
+----+-------------+------------+------+---------------+------+---------+------+-------+----------------------------------------------+

...which doesn't look the best. The problem seems to be the GROUP BY. Any advice or suggestions?

Comments

#1

The whole LEFT JOIN WHERE seems to be required only by multilanguage sites? Why filter for EN language for single language site.

Also GROUP BY src assumes there could be multiple occurrences of src, but I thought that is not allowed, and src dst relationship is 1 to 1 meaning 1 src is assigned to 1 dst.

So not sure if GROUP BY is needed at all.

#2

GROUP BY src assumes there could be multiple occurrences of src

There could be an alias for each language; in that case more than one alias would have the same value for the field src.

#3

Thanks Kiam, I though different language node occurrences get a different node id, not sure.

But the LEFT JOIN part could easily be dismissed in that case for single language sites.

#4

There are no restrictions on assigning two different aliases to the same node, where one would be used for one language, and the other for a different language. The code tries to be as more generic as possible.

I actually don't see any reason to make the query deal with the path alias too.
The purpose of doing that should be to avoid to execute two different queries, which would be replaced by a single query. The problem is that the query is getting a value that is not required until the node URL is generated.
Rather than getting the path alias, and then instruct url() that it must not find the path alias, it would maybe be better to not find the path alias, and let url() do that task.

#5

It is wise to do a left join on {url_alias} because we pass all the links through url(). If we don't already have the alias for each path, *each* and *every* call to url() results in a call to drupal_get_path_alias() and drupal_lookup_path(), which results in the following query being executed *each* and *every* time: SELECT dst FROM {url_alias} WHERE src = '%s' AND language IN('%s', '') ORDER BY language DESC. Would you prefer one join, or still one big SQL and 50,000 (max) other SQL calls for each sitemap chunk?

We need the GROUP BY because there is no restriction on the number of aliases a path can have, even if you have multiple languages disabled. If we didn't have that we would end up with listing every single alias in the sitemap, which is a no-no.

#6

@giorgio79: The LEFT JOIN on url_alias is required because we want to look up all the url aliases instead of looking each one up invididually in url(). It doesn't necessarily have anything to do with multilingual sites, just url aliases. It just so happens that it's pretty much the same query used to look up path aliases in drupal_lookup_path(), but used in a join.

#7

We need the GROUP BY because there is no restriction on the number of aliases a path can have, even if you have multiple languages disabled. If we didn't have that we would end up with listing every single alias in the sitemap, which is a no-no.

Then you need a distinct on the selection or you end up with more than one of the same src anyway. And which alias for the same src is the correct one to use? I thought it was the last entry, don't we need to order by the pid in descending order?

The order by language looks needless since we're only selecting a language of 'en' or ''.

#8

@earnie: Core's drupal_lookup_path() does not sort by pid DESC, so I didn't want to change the expected behavior. Using GROUP BY works perfectly to eliminate the duplicate src values. I couldn't get it to work with DISTINCT, so if you can figure it out, please let me know.

Also, replace 'en' with the current language of the sitemap being generated since we have multilingual support. Technically it's a placeholder value '%s', so it changes. The order by language is currently used by drupal_lookup_path so in case there are two aliases, one for all languages and one for english language, the alias that is used is the one for all languages since it is an empty string.

#9

For reference, we want to closely duplicate the SQL that drupal_lookup_path() runs (see #5), but in a join.

#10

If the LEFT JOIN is kept, then there isn't a way to replace the GROUP BY, which is needed to avoid duplicates for src values. Being this the case, nothing in the query can be removed, or replaced.

#11

Looks like removing the subquery select and just running a left join on the url_alias table directly helps a bit. My average query went from 1.5 seconds to 0.01 seconds.

SELECT x.loc, COALESCE(ua.dst, x.loc) AS alias, x.lastmod, x.changefreq, x.changecount, x.priority FROM xmlsitemap x
          LEFT JOIN url_alias ua ON x.loc = ua.src AND ua.language IN ('en', '')
          WHERE x.status = 1
          GROUP BY x.loc

Still says that "Using where; Using temporary; Using filesort" for the {xmlsitemap} table. :/

#12

I ♥ PostgreSQL. I can't just do GROUP BY x.loc. PostgreSQL requires all fields to be listed (GROUP BY x.loc, ua.dst, x.lastmod, x.changefreq, x.changecount, x.priority. Bleh.

#13

IIRC, Oracle needs all the selected fields listed as well but today my memory is fuzzy.

#14

I gave up for the moment trying to do a left join on {url_alias}. The query is now:
SELECT x.loc, x.lastmod, x.changefreq, x.changecount, x.priority FROM {xmlsitemap} x WHERE x.status = 1 and each $link['loc'] is run through url(). I'll probably figure this out soon, but it was getting too damn irritating.

#15

Status:active» closed (won't fix)

I have given up trying to do a database join attempt to fetch all the necessary url aliases. Instead I implemented #489006: Optimize fetching url aliases, so I'm considering this won't fix since I don't believe it can be optimized any more since we're just selecting a few fields from one table without any orders or groups.

nobody click here