Damien Tournoud says that it should be straightforward to denormalize the data structure by pushing down the project id to release, file and line tables and the release id to file and line tables, since the file and line rows are unique to a release of a project and are not shared among the different releases of project or even different projects. That would allow us to join from the line table in regular queries and skip the release and file table (we already skip the project table).

Comments

meba’s picture

StatusFileSize
new3.11 KB

Just saving progress. Attached patch adds new columns and copies data from existing installations.

Working on inserting data during data creation and selects.

meba’s picture

Status: Active » Needs review
StatusFileSize
new5.32 KB

OK, this is actually getting ready for review, Gabor :-)

Status: New columns in database, update is moving data to them, extractor is able to save strings and files with the data.

Moving to simplifying the join query

meba’s picture

StatusFileSize
new7.79 KB

OK, this seems to be done. Implemented stuff:

- Adding new fields
- Populating them with data either on update from old installation and both fresh installation and scanning packages
- Editing screen using new columns

Moving to next issue.

gábor hojtsy’s picture

Status: Needs review » Needs work

Applied the patch and did a quick review. I am determined to commit this as soon as possible to help you work on the other fixes. So here are the issues I found:

1. in the update code, you use comma for JOIN, while rest of Drupal always uses the proper spelled out JOIN to show what is happening better.

2. I did a grep on both changed tables to find queries related; came up with this list of queries which should take advantage of the new table structure as well; packaging or Ajax calls are just as much important to make perform better:

l10n_community/ajax.inc:  $result = db_query('SELECT p.pid, p.title project_title, r.rid, r.title release_title, COUNT(l.lineno) as occurance_count FROM {l10n_community_project} p INNER JOIN {l10n_community_release} r ON p.pid = r.pid INNER JOIN {l10n_community_file} f ON r.rid = f.rid INNER JOIN {l10n_community_line} l ON f.fid = l.fid INNER JOIN {l10n_community_string} s ON l.sid = s.sid WHERE s.sid = %d GROUP BY r.rid ORDER by p.pid, r.rid', $sid);
l10n_community/export.inc:    $sql = 'SELECT s.sid, s.value, f.location, f.revision, l.lineno, l.type FROM {l10n_community_release} r INNER JOIN {l10n_community_file} f ON r.rid = f.rid INNER JOIN {l10n_community_line} l ON f.fid = l.fid INNER JOIN {l10n_community_string} s ON l.sid = s.sid WHERE r.pid = %d';
l10n_community/export.inc:    $sql = "SELECT s.sid, s.value, f.location, f.revision, l.lineno, l.type, t.translation, t.uid_approved, t.time_approved FROM {l10n_community_release} r INNER JOIN {l10n_community_file} f ON r.rid = f.rid INNER JOIN {l10n_community_line} l ON f.fid = l.fid INNER JOIN {l10n_community_string} s ON l.sid = s.sid  LEFT JOIN {l10n_community_translation} t ON s.sid = t.sid AND t.language = '%s' AND is_active = 1 AND is_suggestion = 0 WHERE r.pid = %d";
l10n_community/pages.inc:      return db_result(db_query('SELECT COUNT(DISTINCT s.sid) FROM {l10n_community_release} r INNER JOIN {l10n_community_file} f ON r.rid = f.rid INNER JOIN {l10n_community_line} l ON f.fid = l.fid INNER JOIN {l10n_community_string} s ON l.sid = s.sid WHERE r.pid = %d', $id));
l10n_community/pages.inc:        $count_sql = "SELECT COUNT(DISTINCT t.sid) AS translation_count, t.language, t.is_suggestion FROM {l10n_community_release} r INNER JOIN {l10n_community_file} f ON r.rid = f.rid INNER JOIN {l10n_community_line} l ON f.fid = l.fid INNER JOIN {l10n_community_string} s ON l.sid = s.sid LEFT JOIN {l10n_community_translation} t ON s.sid = t.sid WHERE r.pid = %d AND t.is_active = 1 AND t.translation != '' GROUP BY t.language, t.is_suggestion";
l10n_community/pages.inc:      $result = db_query("SELECT COUNT(DISTINCT s.sid) AS string_count, p.pid, p.title, p.uri FROM {l10n_community_project} p INNER JOIN {l10n_community_release} r ON p.pid = r.pid INNER JOIN {l10n_community_file} f ON r.rid = f.rid INNER JOIN {l10n_community_line} l ON f.fid = l.fid INNER JOIN {l10n_community_string} s ON l.sid = s.sid GROUP BY p.pid;");
l10n_community/pages.inc:      $count_sql = "SELECT COUNT(DISTINCT t.sid) AS translation_count, r.pid, t.is_suggestion FROM {l10n_community_release} r INNER JOIN {l10n_community_file} f ON r.rid = f.rid INNER JOIN {l10n_community_line} l ON f.fid = l.fid INNER JOIN {l10n_community_string} s ON l.sid = s.sid LEFT JOIN {l10n_community_translation} t ON s.sid = t.sid WHERE t.is_active = 1 AND t.translation != '' ";

3. The cascade delete in l10n_community_delete_project() can now be made postresql compatible, since we have the pid on all rows, so we can just remove the items based on pid from all three tables in three DELETE's not just one cascade delete. That would be an easy and great way to jump forward. A good side-effect of this performance optimization.

Thanks for all your hard work.

meba’s picture

See:
SELECT COUNT(DISTINCT t.sid) AS translation_count, r.pid, t.is_suggestion FROM {l10n_community_release} r INNER JOIN {l10n_community_file} f ON r.rid = f.rid INNER JOIN {l10n_community_line} l ON f.fid = l.fid INNER JOIN {l10n_community_string} s ON l.sid = s.sid LEFT JOIN {l10n_community_translation} t ON s.sid = t.sid WHERE t.is_active = 1 AND t.translation != ''

Changed into:

SELECT COUNT(DISTINCT t.sid) AS translation_count, l.pid, t.is_suggestion FROM {l10n_community_translation} t LEFT JOIN {l10n_community_string} ON t.sid = s.sid INNER JOIN {l10n_community_line} l ON t.sid = l.sid WHERE t.is_active = 1 AND t.translation != '' 

Is there a reason to join from translation to string and then to line if we already have sid in translation so joining translation+line would be enough?

gábor hojtsy’s picture

Huh, I don't see a reason to join with lines there.

meba’s picture

Sorry for misunderstanding...

See:

SELECT COUNT(DISTINCT t.sid) AS translation_count, l.pid, t.is_suggestion FROM {l10n_community_translation} t LEFT JOIN {l10n_community_string} s ON t.sid = s.sid INNER JOIN {l10n_community_line} l ON t.sid = l.sid WHERE t.is_active = 1 AND t.translation != '' 

Compared to:

SELECT COUNT(DISTINCT t.sid) AS translation_count, l.pid, t.is_suggestion FROM {l10n_community_translation} t INNER JOIN {l10n_community_line} l ON t.sid = l.sid WHERE t.is_active = 1 AND t.translation != '' 

This should give same results in my opinion but I found the previous version (with older joins too) in the code so I am trying to find a reason for that :)

meba’s picture

Status: Needs work » Needs review
StatusFileSize
new16.62 KB

Attaching new patch.

How to test revised SQL queries:

- Go to Explore projects page, Explore languages page and Translate welcome screen
- Note all numbers displayed on each page
- Apply patch
- DELETE FROM cache;
- Reload all pages

All numbers should stay the same

meba’s picture

I would also like to do some benchmarks.

First impressions:

mysql> explain SELECT COUNT(DISTINCT(s.sid)) FROM l10n_community_release r INNER JOIN l10n_community_file f ON r.rid = f.rid INNER JOIN l10n_community_line l ON f.fid = l.fid INNER JOIN l10n_community_string s ON l.sid = s.sid LEFT JOIN l10n_community_translation t ON s.sid = t.sid AND t.language = 'ab' AND t.is_active = 1 AND t.is_suggestion = 0 WHERE r.pid = 2670;
+----+-------------+-------+--------+--------------------------------------------------------------------------------+-------------------------+---------+----------------------------------------+------+-------------+
| id | select_type | table | type   | possible_keys                                                                  | key                     | key_len | ref                                    | rows | Extra       |
+----+-------------+-------+--------+--------------------------------------------------------------------------------+-------------------------+---------+----------------------------------------+------+-------------+
|  1 | SIMPLE      | r     | ref    | PRIMARY,pid                                                                    | pid                     | 5       | const                                  |  106 | Using where | 
|  1 | SIMPLE      | f     | ref    | PRIMARY,rid                                                                    | rid                     | 5       | l10nserverdenormfull.r.rid             |    7 | Using where | 
|  1 | SIMPLE      | l     | ref    | fid,sid                                                                        | fid                     | 5       | l10nserverdenormfull.f.fid             |   16 | Using where | 
|  1 | SIMPLE      | s     | eq_ref | PRIMARY                                                                        | PRIMARY                 | 4       | l10nserverdenormfull.l.sid             |    1 | Using index | 
|  1 | SIMPLE      | t     | ref    | is_active,is_suggestion,language,suggestion_active,sid_language_suggestion,sid | sid_language_suggestion | 46      | l10nserverdenormfull.s.sid,const,const |    1 |             | 
+----+-------------+-------+--------+--------------------------------------------------------------------------------+-------------------------+---------+----------------------------------------+------+-------------+
5 rows in set (0.00 sec)

New query:

mysql> explain SELECT COUNT(DISTINCT(s.sid)) FROM l10n_community_line l INNER JOIN l10n_community_string s ON l.sid = s.sid LEFT JOIN l10n_community_translation t ON s.sid = t.sid AND t.language = 'ab' AND t.is_active = 1 AND t.is_suggestion = 0 WHERE l.pid = 2670;+----+-------------+-------+-------+--------------------------------------------------------------------------------+-------------------------+---------+----------------------------------------+--------+-------------+
| id | select_type | table | type  | possible_keys                                                                  | key                     | key_len | ref                                    | rows   | Extra       |
+----+-------------+-------+-------+--------------------------------------------------------------------------------+-------------------------+---------+----------------------------------------+--------+-------------+
|  1 | SIMPLE      | s     | index | PRIMARY                                                                        | PRIMARY                 | 4       | NULL                                   | 117996 | Using index | 
|  1 | SIMPLE      | t     | ref   | is_active,is_suggestion,language,suggestion_active,sid_language_suggestion,sid | sid_language_suggestion | 46      | l10nserverdenormfull.s.sid,const,const |      1 |             | 
|  1 | SIMPLE      | l     | ref   | sid                                                                            | sid                     | 5       | l10nserverdenormfull.s.sid             |      2 | Using where | 
+----+-------------+-------+-------+--------------------------------------------------------------------------------+-------------------------+---------+----------------------------------------+--------+-------------+
3 rows in set (0.00 sec)

BUt second (new) seems to be much slower!

gábor hojtsy’s picture

Well, we did not change any indexes keeping the new queries in mind!

killes@www.drop.org’s picture

The reason why the second query is slower is likely that it looks at much more rows in the first table.

Maybe a better index helps.

killes@www.drop.org’s picture

Actually, the best way to deal with this might be the materialized views approach that David has taken in tracker2, ie we'd keep a table with precalculated values.

meba’s picture

Well, should I try to implement something like materialized views or should we continue on this patch?

gábor hojtsy’s picture

Well, pid and rid are "precalculated" just like in materialized views, and help us skip a few tables in most cases. I think we should trade somewhere inbetween the best overall performance improvement and manageable code.

meba’s picture

StatusFileSize
new16.4 KB

Changing the query to LEFT JOIN saved over 2 seconds...

Posting an updated patch.

meba’s picture

StatusFileSize
new16.39 KB

Forgot to delete from releases, fixed one phpdoc

gábor hojtsy’s picture

Assigned: Unassigned » gábor hojtsy

I'm taking this on. Implemented some devel generate hooks which can be used with the 6MB database dump I have from localize.drupal.org. I can share that with you @meba, if interested. The two coupled can make a nice testing environment, where some of the queries are extremely slow :) Time to look into skipping a few tables in the queries.

gábor hojtsy’s picture

StatusFileSize
new17.23 KB

Took the schema changes and the update code at first. Fixed the update code to work with less SQL queries. We can populate the pid based on the release list and populate the pid, rid on lines based on releases after that. This is still about 10000 queries on the update on ldo, but it runs well on my machine.

Then went to apply this to the overview/explore pages. Found that some of the DISTINCTs can be eliminated, but it was still a lot of time to render that with the generated data set. So added caching to the project and language overviews. Also took the changes for the translation and moderation listing queries. Did not take the other changes yet, but that is in progress.

Note that you did not include code to actually fill in the pid and rid when new code is actually parsed. So while the existing data is updated, it is not maintained with new data added.

To work on this more. Maybe only next week.

gábor hojtsy’s picture

StatusFileSize
new43.24 KB

Oh, of course you did include code to fill in the pid and rid in the new columns. Sorry for overlooking. I looked over all the changes in your patch and implemented them in this attached patch. Actually ended up implementing the rid/pid filling via stored statics, so that less queries are required. and extractions are quicker.

Also, took the opportunity to rework the query builder for the browse/translate pages to use the more flexible arrays as on the moderate page. Since the simpletests have pretty good coverage of these pages, I could easily verify that my refactorings worked (and initially they did not :).

Also, added more cache clearing and some advance rebuilding, so cron clears off all caches but rebuild the three most important ones. Also, removing or adding a language team clears the caches.

The test suite runs fine with this patch, so I'll tinker with it a little and hopefully commit and deploy tonight.

gábor hojtsy’s picture

Status: Needs review » Fixed
StatusFileSize
new43.54 KB

Slightly tweaked comments, committed this patch. Thanks for all your contributions.

gábor hojtsy’s picture

StatusFileSize
new829 bytes

Forgot to include files at one place. Committed this too.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.