Performance: denormalize releases, files and lines with project/release info [#375817]

Damien Tournoud says that it should be straightforward to denormalize the data structure by pushing down the project id to release, file and line tables and the release id to file and line tables, since the file and line rows are unique to a release of a project and are not shared among the different releases of project or even different projects. That would allow us to join from the line table in regular queries and skip the release and file table (we already skip the project table).

Comment	File	Size	Author
#21	include-missed.patch	829 bytes	gábor hojtsy
#20	pid-rid-denormalization.patch	43.54 KB	gábor hojtsy
#19	pid-rid-denormalization.patch	43.24 KB	gábor hojtsy
#18	pid-rid-denormalize.patch	17.23 KB	gábor hojtsy
#16	l10n_denormalize_6.patch	16.39 KB	meba
#15	l10n_denormalize_5.patch	16.4 KB	meba
#8	l10n_denormalize_4.patch	16.62 KB	meba
#3	l10n_denormalize_3.patch	7.79 KB	meba
#2	l10n_denormalize_2.patch	5.32 KB	meba
#1	l10n_denormalize.patch	3.11 KB	meba

Comments

Comment #1

meba commented 7 March 2009 at 15:29

Status	File	Size
new	l10n_denormalize.patch	3.11 KB

Just saving progress. Attached patch adds new columns and copies data from existing installations.

Working on inserting data during data creation and selects.

Comment #2

meba commented 7 March 2009 at 17:59

Status:

Active

» Needs review

Status	File	Size
new	l10n_denormalize_2.patch	5.32 KB

OK, this is actually getting ready for review, Gabor :-)

Status: New columns in database, update is moving data to them, extractor is able to save strings and files with the data.

Moving to simplifying the join query

Comment #3

meba commented 7 March 2009 at 18:46

Status	File	Size
new	l10n_denormalize_3.patch	7.79 KB

OK, this seems to be done. Implemented stuff:

- Adding new fields
- Populating them with data either on update from old installation and both fresh installation and scanning packages
- Editing screen using new columns

Moving to next issue.

Comment #4

gábor hojtsy

he/him

Hungarian

Hungary

commented 12 March 2009 at 14:12

Status:

Needs review

» Needs work

Applied the patch and did a quick review. I am determined to commit this as soon as possible to help you work on the other fixes. So here are the issues I found:

1. in the update code, you use comma for JOIN, while rest of Drupal always uses the proper spelled out JOIN to show what is happening better.

2. I did a grep on both changed tables to find queries related; came up with this list of queries which should take advantage of the new table structure as well; packaging or Ajax calls are just as much important to make perform better:

l10n_community/ajax.inc:  $result = db_query('SELECT p.pid, p.title project_title, r.rid, r.title release_title, COUNT(l.lineno) as occurance_count FROM {l10n_community_project} p INNER JOIN {l10n_community_release} r ON p.pid = r.pid INNER JOIN {l10n_community_file} f ON r.rid = f.rid INNER JOIN {l10n_community_line} l ON f.fid = l.fid INNER JOIN {l10n_community_string} s ON l.sid = s.sid WHERE s.sid = %d GROUP BY r.rid ORDER by p.pid, r.rid', $sid);
l10n_community/export.inc:    $sql = 'SELECT s.sid, s.value, f.location, f.revision, l.lineno, l.type FROM {l10n_community_release} r INNER JOIN {l10n_community_file} f ON r.rid = f.rid INNER JOIN {l10n_community_line} l ON f.fid = l.fid INNER JOIN {l10n_community_string} s ON l.sid = s.sid WHERE r.pid = %d';
l10n_community/export.inc:    $sql = "SELECT s.sid, s.value, f.location, f.revision, l.lineno, l.type, t.translation, t.uid_approved, t.time_approved FROM {l10n_community_release} r INNER JOIN {l10n_community_file} f ON r.rid = f.rid INNER JOIN {l10n_community_line} l ON f.fid = l.fid INNER JOIN {l10n_community_string} s ON l.sid = s.sid  LEFT JOIN {l10n_community_translation} t ON s.sid = t.sid AND t.language = '%s' AND is_active = 1 AND is_suggestion = 0 WHERE r.pid = %d";
l10n_community/pages.inc:      return db_result(db_query('SELECT COUNT(DISTINCT s.sid) FROM {l10n_community_release} r INNER JOIN {l10n_community_file} f ON r.rid = f.rid INNER JOIN {l10n_community_line} l ON f.fid = l.fid INNER JOIN {l10n_community_string} s ON l.sid = s.sid WHERE r.pid = %d', $id));
l10n_community/pages.inc:        $count_sql = "SELECT COUNT(DISTINCT t.sid) AS translation_count, t.language, t.is_suggestion FROM {l10n_community_release} r INNER JOIN {l10n_community_file} f ON r.rid = f.rid INNER JOIN {l10n_community_line} l ON f.fid = l.fid INNER JOIN {l10n_community_string} s ON l.sid = s.sid LEFT JOIN {l10n_community_translation} t ON s.sid = t.sid WHERE r.pid = %d AND t.is_active = 1 AND t.translation != '' GROUP BY t.language, t.is_suggestion";
l10n_community/pages.inc:      $result = db_query("SELECT COUNT(DISTINCT s.sid) AS string_count, p.pid, p.title, p.uri FROM {l10n_community_project} p INNER JOIN {l10n_community_release} r ON p.pid = r.pid INNER JOIN {l10n_community_file} f ON r.rid = f.rid INNER JOIN {l10n_community_line} l ON f.fid = l.fid INNER JOIN {l10n_community_string} s ON l.sid = s.sid GROUP BY p.pid;");
l10n_community/pages.inc:      $count_sql = "SELECT COUNT(DISTINCT t.sid) AS translation_count, r.pid, t.is_suggestion FROM {l10n_community_release} r INNER JOIN {l10n_community_file} f ON r.rid = f.rid INNER JOIN {l10n_community_line} l ON f.fid = l.fid INNER JOIN {l10n_community_string} s ON l.sid = s.sid LEFT JOIN {l10n_community_translation} t ON s.sid = t.sid WHERE t.is_active = 1 AND t.translation != '' ";

3. The cascade delete in l10n_community_delete_project() can now be made postresql compatible, since we have the pid on all rows, so we can just remove the items based on pid from all three tables in three DELETE's not just one cascade delete. That would be an easy and great way to jump forward. A good side-effect of this performance optimization.

Thanks for all your hard work.

Comment #5

meba commented 12 March 2009 at 20:26

See:
SELECT COUNT(DISTINCT t.sid) AS translation_count, r.pid, t.is_suggestion FROM {l10n_community_release} r INNER JOIN {l10n_community_file} f ON r.rid = f.rid INNER JOIN {l10n_community_line} l ON f.fid = l.fid INNER JOIN {l10n_community_string} s ON l.sid = s.sid LEFT JOIN {l10n_community_translation} t ON s.sid = t.sid WHERE t.is_active = 1 AND t.translation != ''

Changed into:

SELECT COUNT(DISTINCT t.sid) AS translation_count, l.pid, t.is_suggestion FROM {l10n_community_translation} t LEFT JOIN {l10n_community_string} ON t.sid = s.sid INNER JOIN {l10n_community_line} l ON t.sid = l.sid WHERE t.is_active = 1 AND t.translation != ''

Is there a reason to join from translation to string and then to line if we already have sid in translation so joining translation+line would be enough?

Comment #6

gábor hojtsy

he/him

Hungarian

Hungary

commented 12 March 2009 at 20:28

Huh, I don't see a reason to join with lines there.

Comment #7

meba commented 12 March 2009 at 20:35

Sorry for misunderstanding...

See:

SELECT COUNT(DISTINCT t.sid) AS translation_count, l.pid, t.is_suggestion FROM {l10n_community_translation} t LEFT JOIN {l10n_community_string} s ON t.sid = s.sid INNER JOIN {l10n_community_line} l ON t.sid = l.sid WHERE t.is_active = 1 AND t.translation != ''

Compared to:

SELECT COUNT(DISTINCT t.sid) AS translation_count, l.pid, t.is_suggestion FROM {l10n_community_translation} t INNER JOIN {l10n_community_line} l ON t.sid = l.sid WHERE t.is_active = 1 AND t.translation != ''

This should give same results in my opinion but I found the previous version (with older joins too) in the code so I am trying to find a reason for that :)

Comment #8

meba commented 12 March 2009 at 20:51

Status:

Needs work

» Needs review

Status	File	Size
new	l10n_denormalize_4.patch	16.62 KB

Attaching new patch.

How to test revised SQL queries:

- Go to Explore projects page, Explore languages page and Translate welcome screen
- Note all numbers displayed on each page
- Apply patch
- DELETE FROM cache;
- Reload all pages

All numbers should stay the same

Comment #9

meba commented 12 March 2009 at 23:47

I would also like to do some benchmarks.

First impressions:

mysql> explain SELECT COUNT(DISTINCT(s.sid)) FROM l10n_community_release r INNER JOIN l10n_community_file f ON r.rid = f.rid INNER JOIN l10n_community_line l ON f.fid = l.fid INNER JOIN l10n_community_string s ON l.sid = s.sid LEFT JOIN l10n_community_translation t ON s.sid = t.sid AND t.language = 'ab' AND t.is_active = 1 AND t.is_suggestion = 0 WHERE r.pid = 2670;
+----+-------------+-------+--------+--------------------------------------------------------------------------------+-------------------------+---------+----------------------------------------+------+-------------+
| id | select_type | table | type   | possible_keys                                                                  | key                     | key_len | ref                                    | rows | Extra       |
+----+-------------+-------+--------+--------------------------------------------------------------------------------+-------------------------+---------+----------------------------------------+------+-------------+
|  1 | SIMPLE      | r     | ref    | PRIMARY,pid                                                                    | pid                     | 5       | const                                  |  106 | Using where | 
|  1 | SIMPLE      | f     | ref    | PRIMARY,rid                                                                    | rid                     | 5       | l10nserverdenormfull.r.rid             |    7 | Using where | 
|  1 | SIMPLE      | l     | ref    | fid,sid                                                                        | fid                     | 5       | l10nserverdenormfull.f.fid             |   16 | Using where | 
|  1 | SIMPLE      | s     | eq_ref | PRIMARY                                                                        | PRIMARY                 | 4       | l10nserverdenormfull.l.sid             |    1 | Using index | 
|  1 | SIMPLE      | t     | ref    | is_active,is_suggestion,language,suggestion_active,sid_language_suggestion,sid | sid_language_suggestion | 46      | l10nserverdenormfull.s.sid,const,const |    1 |             | 
+----+-------------+-------+--------+--------------------------------------------------------------------------------+-------------------------+---------+----------------------------------------+------+-------------+
5 rows in set (0.00 sec)

New query:

mysql> explain SELECT COUNT(DISTINCT(s.sid)) FROM l10n_community_line l INNER JOIN l10n_community_string s ON l.sid = s.sid LEFT JOIN l10n_community_translation t ON s.sid = t.sid AND t.language = 'ab' AND t.is_active = 1 AND t.is_suggestion = 0 WHERE l.pid = 2670;+----+-------------+-------+-------+--------------------------------------------------------------------------------+-------------------------+---------+----------------------------------------+--------+-------------+
| id | select_type | table | type  | possible_keys                                                                  | key                     | key_len | ref                                    | rows   | Extra       |
+----+-------------+-------+-------+--------------------------------------------------------------------------------+-------------------------+---------+----------------------------------------+--------+-------------+
|  1 | SIMPLE      | s     | index | PRIMARY                                                                        | PRIMARY                 | 4       | NULL                                   | 117996 | Using index | 
|  1 | SIMPLE      | t     | ref   | is_active,is_suggestion,language,suggestion_active,sid_language_suggestion,sid | sid_language_suggestion | 46      | l10nserverdenormfull.s.sid,const,const |      1 |             | 
|  1 | SIMPLE      | l     | ref   | sid                                                                            | sid                     | 5       | l10nserverdenormfull.s.sid             |      2 | Using where | 
+----+-------------+-------+-------+--------------------------------------------------------------------------------+-------------------------+---------+----------------------------------------+--------+-------------+
3 rows in set (0.00 sec)

BUt second (new) seems to be much slower!

Comment #10

gábor hojtsy

he/him

Hungarian

Hungary

commented 13 March 2009 at 09:51

Well, we did not change any indexes keeping the new queries in mind!

Comment #11

killes@www.drop.org commented 20 July 2009 at 06:45

The reason why the second query is slower is likely that it looks at much more rows in the first table.

Maybe a better index helps.

Comment #12

killes@www.drop.org commented 20 July 2009 at 06:50

Actually, the best way to deal with this might be the materialized views approach that David has taken in tracker2, ie we'd keep a table with precalculated values.

Comment #13

meba commented 16 August 2009 at 22:20

Well, should I try to implement something like materialized views or should we continue on this patch?

Comment #14

gábor hojtsy

he/him

Hungarian

Hungary

commented 17 August 2009 at 07:30

Well, pid and rid are "precalculated" just like in materialized views, and help us skip a few tables in most cases. I think we should trade somewhere inbetween the best overall performance improvement and manageable code.

Comment #15

meba commented 18 August 2009 at 08:16

Status	File	Size
new	l10n_denormalize_5.patch	16.4 KB

Changing the query to LEFT JOIN saved over 2 seconds...

Posting an updated patch.

Comment #16

meba commented 18 August 2009 at 11:07

Status	File	Size
new	l10n_denormalize_6.patch	16.39 KB

Forgot to delete from releases, fixed one phpdoc

Comment #17

gábor hojtsy

he/him

Hungarian

Hungary

commented 17 September 2009 at 12:40

Assigned:

Unassigned

» gábor hojtsy

I'm taking this on. Implemented some devel generate hooks which can be used with the 6MB database dump I have from localize.drupal.org. I can share that with you @meba, if interested. The two coupled can make a nice testing environment, where some of the queries are extremely slow :) Time to look into skipping a few tables in the queries.

Comment #18

gábor hojtsy

he/him

Hungarian

Hungary

commented 18 September 2009 at 10:41

Status	File	Size
new	pid-rid-denormalize.patch	17.23 KB

Took the schema changes and the update code at first. Fixed the update code to work with less SQL queries. We can populate the pid based on the release list and populate the pid, rid on lines based on releases after that. This is still about 10000 queries on the update on ldo, but it runs well on my machine.

Then went to apply this to the overview/explore pages. Found that some of the DISTINCTs can be eliminated, but it was still a lot of time to render that with the generated data set. So added caching to the project and language overviews. Also took the changes for the translation and moderation listing queries. Did not take the other changes yet, but that is in progress.

Note that you did not include code to actually fill in the pid and rid when new code is actually parsed. So while the existing data is updated, it is not maintained with new data added.

To work on this more. Maybe only next week.

Comment #19

gábor hojtsy

he/him

Hungarian

Hungary

commented 18 September 2009 at 17:47

Status	File	Size
new	pid-rid-denormalization.patch	43.24 KB

Oh, of course you did include code to fill in the pid and rid in the new columns. Sorry for overlooking. I looked over all the changes in your patch and implemented them in this attached patch. Actually ended up implementing the rid/pid filling via stored statics, so that less queries are required. and extractions are quicker.

Also, took the opportunity to rework the query builder for the browse/translate pages to use the more flexible arrays as on the moderate page. Since the simpletests have pretty good coverage of these pages, I could easily verify that my refactorings worked (and initially they did not :).

Also, added more cache clearing and some advance rebuilding, so cron clears off all caches but rebuild the three most important ones. Also, removing or adding a language team clears the caches.

The test suite runs fine with this patch, so I'll tinker with it a little and hopefully commit and deploy tonight.

Comment #20

gábor hojtsy

he/him

Hungarian

Hungary

commented 18 September 2009 at 18:04

Status:

Needs review

» Fixed

Status	File	Size
new	pid-rid-denormalization.patch	43.54 KB

Slightly tweaked comments, committed this patch. Thanks for all your contributions.

Comment #21

gábor hojtsy

he/him

Hungarian

Hungary

commented 18 September 2009 at 18:40

Status	File	Size
new	include-missed.patch	829 bytes

Forgot to include files at one place. Committed this too.

Comment #22

2 October 2009 at 18:50

Status:

Fixed

» Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

Performance: denormalize releases, files and lines with project/release info

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

Comment #18

Comment #19

Comment #20

Comment #21

Comment #22

News items

Our community

Documentation

Drupal code base

Governance of community