I just got an idea and started creating patches, wondering how much will be able to get into 6.x (the rest should wait for 7 - really annoying to get in with fresh ideas and bump into a code-freeze hanging everywhere), but found several things broken in locale.module (http://drupal.org/node/171562), so I have to wait until that is solved, to have a working code for enhancements. But anyway I'm starting this Issue, to write my ideas down at the very least.

The locale tables are never pruned, which means there's a lot of garbage, and that's a bad thing for performance. This applies to long-standing localized sites: After a chain of several upgrades, each introducing some string- and feature-changes, the tables are full of dead entries from old versions, never used, but still processed by SQL queries, and still cached (if short).

On my site, the garbage found in the locale tables included:
- 2087 (of 4102 total) untranslated strings (mostly associated with admin pages), many of these probably never appearing in current Drupal version.
- Big amount of strings from different contrib modules we no more use.
- Many tab- and menu-titles which were capitalized in 5.x, as well as many strings with old t() replacement keys.

Currently, the only way to clean up is to delete whole locale, then rebuild from .po files. But I never go this way because:
- Not every contribs have a Czech locale, so these translations will be lost
- Several translations are strange and misleading (.po author didn't see page-wide context), so there are many customizations done on my site.
- Even more customizations are needed if I want the texts to be descriptive from user's perspective, i.e. written in context to my site's content and purpose, and not only just explaining the function in Drupal itself. This all will be lost, if rebuilding the locale.

To reduce the dead load, which is larger with each new upgrade, and to improve performance, I propose three measures (the more will be able to get into 6.x, or into Drupal at all, the more I'll be happy about this)

1. - Prune untranslated strings while upgrading to another major version. We don't actually lose anything - the strings still in use will re-register soon, the others are good for nothing. This patch I already created, so attaching (but I think that solving of the other mentioned Issue will break it).

2. - Monitoring of string-usage, based on Drupal versions. This is the crucial part, which should be done as soon as possible, to start data-gathering. It includes a small patch to locale.module and a new column in {locales_source}, and gives also a slight performance gain (I believe) by reducing amount of cached strings without a single new line in the time-critical part handling already-cached translations. Basically this new column will hold information about Drupal version, under which the string was last touched (only updated if different, so very infrequently). I already implemented this on my 5.1 test install, and hope to throw it onto my production site soon, to gather statistical data. The two purposes of this version-registering are:

2.1 - Long-term, this will gather information about outdated strings. For example, if I just upgraded from 6.x to 7.x, I'll look at summary of this data, and see strings divided by last use into groups '7.0', '6.2', '6.1', '6.0', '5.2', '5.1', 'never used' (given that I get the thing running on my 5.x install for testing). I might consider batch-deleting all strings from 5.x and older, which were last used two major versions back, and are unlikely to reoccur. This statistics might also help to get rid of never used strings from unused features - for example on my website, the user's comment-browsing-settings-panel never appears. I also abandoned a few contrib modules.

2.2 - Immediate performance effect: The locale cache (both database and static array) will hold only strings already touched by the current Drupal version, so old garbage will never get into the cache (making cache rebuilds expensive otherwise, unserializing on every page and so on). Only active strings of current version will be cached.

3. - Later on, extending the Locales admin interface to use the data gathered by #2, to show overview, allow batch deleting by Drupal-version history, show version in string-listings (helping to distinguish between old and new variations, such as capitalized/uncapitalized labels, upon editing), perhaps add search rule "only active in current version". This may easily wait to next version, will not be useful immediately (i.e. without gathered data in the table).

By the way - I also considered recording names of modules, which call the single strings. That would allow a per-module overview, auto-pruning on module uninstall, and per-module .po exports... But after implementing on test-install of 5.1 I found it too expensive, involving several string- and array-search operations on every single t() call :-( So I quit this part of the idea.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

JirkaRybka’s picture

Assigned: Unassigned » JirkaRybka

Hmm, thinking of it, I should probably assign to myself.

JirkaRybka’s picture

Status: Active » Needs review

Okay, I just submitted a patch to the locale-system bugs blocking this issue, and it turned out that the fix doesn't break the already-attached patch here.

Therefore marking "needs review" - it's all this patch needs now. :) It's addressing the point #1 from above, and does nothing more than harmless pruning at the time of upgrade.

JirkaRybka’s picture

FileSize
5.49 KB

The patch attached now is NOT ready to be commited, at least until http://drupal.org/node/171562 is fixed - since this depends on the bugfixes from the other issue, patch is not against HEAD. It applies AFTER patch from the linked issue (#14 in there), in case someone wants to give it a test.

Now, I've both #1 and #2 parts of my initial list implemented, so the features currently are:

-1- Locale pruning on D5->D6 upgrade. Only just removes untranslated strings, aiming to reduce dead load in the table. Strings still in use will self-register soon again (or will be re-imported from .po), and there's no point in hand-translating outdated strings, so we don't lose anything. This is the patch from above included.

-2- Core-version based string-history. A small change to locale() translating function introduces Drupal-version field on each string, aiming to exclude dead strings from cache-processing, and gather information for future pruning.
- It doesn't break any existing code. There's a new column on {locales_source} table, but it doesn't need any attention from outside this patch - there's a good default value, and no other should be inserted by other code. Adding this column in both install and upgrade is also covered in my patch.
- It doesn't affect the quick array-cached locale processing; code is unchanged here. But however, depending on site traffic and garbage-level in the tables, it may reduce amount of strings cached, improving performance. The excluded strings are dead ones, never used since last Drupal core update. This of course also applies to cache-rebuilding, where the code does a foreach loop on all short strings - now dead ones are excluded also from this.
- Processing of uncached strings only got tiny change to the query and one if() statement, so also almost unchanged. The registration of string-usage (involving one simple query and cache flushing) is only done on the first string-occurence after Drupal-update, which should collect most frequent strings on the first few pages, and then appear only rarely, or never.
- Besides of reduced cache load, the new column in fact collects string-history information, ready to be used for pruning in future. (See my vision below.)

It would be a bit difficult to do a benchmark on this, as the effect depends on long-term collected information and site-dependent garbage-level, but by the nature of my changes I believe that it will be somewhere between zero and slight improvement, shouldn't be worse (except the first few pages after upgrade).

I hope this is still able to get into D6, as it doesn't break anything, doesn't require any changes to other code, improves locale caching a bit, and collects data for future improvement (which is long-term and so should start soon). Now some kind of quick review is needed, particularly from core-commiters if possible, not perhaps final code, but the usefullness and chance to get into D6. I'll be happy to update my patch then.

-3- My vision of a follow-up - probably going into D7. (I'll be happy to provide this part for D6 in about two week's time, but I see that it doesn't exactly fit into code freeze.) I have an idea of a new administration pruning-page under "Translate interface", something like shown below. Pruning should also make string searches easier for hand-translation, so it's not only just "some hidden" performance to be never clicked.

Administer > Translate interface > String usage overview
--------------------------------------------------------------
| Last used | Textgroup |   Total   |Translated |Untranslated|
| in version|           |           |           |            |
--------------------------------------------------------------
| 7.0       |  Built-in |xxx strings|xxx strings|xxx strings |
|           | interface |  (delete) |  (delete) |  (delete)  |
--------------------------------------------------------------
| 6.0       |  Built-in |xxx strings|xxx strings|xxx strings |
|           | interface |  (delete) |  (delete) |  (delete)  |
--------------------------------------------------------------
| Not used  |  Built-in |xxx strings|xxx strings|xxx strings |
|   yet     | interface |  (delete) |  (delete) |  (delete)  |
--------------------------------------------------------------
|Total count|  Built-in |xxx strings|xxx strings|xxx strings |
|           | interface |  (delete) |  (delete) |  (delete)  |
--------------------------------------------------------------

What do you think about it?

JirkaRybka’s picture

Priority: Normal » Critical

Setting critical in hope of getting some attention before missing first D6-beta without this being even considered.

Gábor Hojtsy’s picture

I agree that pruning could go in as soon as possible. That is a great improvement now.

I think the version/usage marking should be explored more. We can definitely do it in the D6 version of autolocale module for example. The caching and usage change the update experience a lot. It makes you wait a lot to regenerate the cache all times in lots of page views as you browse around after the upgrade. There are lot more consistency and performance issues we can and should solve, and we need to explore them in a contrib module as well (the automatic import in D6 will not catch new translations for modules you have already installed, etc). Your ideas are definitely fresh and good about collecting the VERSION we are using strings at, and filling caches based on them. I'll think about this more, whether we can have it in D6 anyway.

Gábor Hojtsy’s picture

Priority: Critical » Normal

So, this is not critical. Concentrate first on the initial pruning, get that it soon, and then go to the version based dynamic stuff.

JirkaRybka’s picture

Thanks for your comment!

As for the initial pruning (#1 goal), the patch is ready at the very top of this issue (locale-pruning-1.patch) - still applies, no changes in this part. Ready to use if considered good.

The dynamic stuff (the other patch) - you're right that it slows down the very first pages after update, but that's no more than string-registering if empty locale added. Consider also, that the cache is almost empty meanwhile (very few strings registered for current version yet), so rebuild is short. I tested this briefly with real site data, and it didn't make an impression of any big difference to me.

BTW - there were much worse things in Drupal already - for example cache-filter rebuild after update is of similar nature, and throws on my site even multiple fatal errors of 'Memory limit exhausted', as there's pretty complicated filter installed (Texy) and often multiple nodes and blocks needs to be filtered on the same page, normally added incrementally, but on update it goes all in one batch. Compared to this, the proposed string-usage thing didn't make even any difference to notice on my test install. And then again, rebuilding of various caches after update is somehow normal on Drupal, and I believe that webmasters are already used to it. What counts is the real run-time performance, isn't it?

Needs more testing, that's certainly true point.

My current plan is (after upcoming one-week holidays) to backport this to 5.x and install on the production site I've access to, so that I'll be able to see the real results. But still, no exact (benchmark-like) numbers are likely to come out of this. Any help with testing and judging this proposal is welcome.

JirkaRybka’s picture

Forgot to add: After update, there are often new strings in the user interface, so the locale-cache rebuild on almost every page occurs without this patch too, but is more costy - processing the whole locale, as oposed to still-almost-empty patched cache. So actually we're only talking about the simple UPDATE queries setting version info, which should be pretty fast - being given the primary key 'lid' value as only search condition, and limited to only one row.

Just wanted to say this, now make your own judgement! :)

Gábor Hojtsy’s picture

OK, you are quite convincing on the version tracking issue. But let's get the update pruning quickly in. I looked into your SQL, and it does not seem like cross-database compatible (because of the "DELETE s FROM" part and the JOIN also). See http://dev.mysql.com/doc/refman/5.0/en/delete.html for example. What database are you testing with, where this DELETE worked?

JirkaRybka’s picture

I'm on MySQL 4.1.19, and I just used a snippet from the very page You linked, comment titled 'Posted by Alexandre Nunes on August 7 2005 4:55pm', didn't have a clue there might be a problem :(

Now going to read through that discussion more carefully, hope come back soon.

Gábor Hojtsy’s picture

I might be wrong. I remember (might be correct or not) that some SQL servers (eg. previus MySQL versions we might need to support) might not handle JOINs in DELETEs well. Again, I might be wrong. It needs to be checked.

JirkaRybka’s picture

There are two more approaches in that thread, and I really don't know which one should I choose and further investigate, without being incompatible again. Please comment:
DELETE FROM t1 USING t1 LEFT JOIN t2 ON t1.id=t2.id WHERE t2.id IS NULL
- or -

1) create a table, containing the distinct
reference numbers
CREATE TEMPORARY TABLE distinct_images
SELECT DISTINCT imagename.image FROM imagename;

2) create an extra field in table image, and tag
the entries in image that are referenced
ALTER TABLE image ADD linked tinyint default 0;
UPDATE image i, distinct_images di SET i.linked=1
WHERE i.aindex=di.image;

3) delete the entries from image that are not tagged
DELETE FROM image where linked=0;

4) delete the tagging field
ALTER TABLE image DROP COLUMN linked;

First seems clever, but dangerously similar to the old one (which I previously also used successfully on a production site on MySQL 4.0.x by the way). Second more difficult to implement, so I would like to know in advance, if that approach is desirable. Thanks in advance.

JirkaRybka’s picture

Hmm, I just read carefully through the specification for MySQL DELETE, as said on the linked-above page (the specification itself, not user comments), and from that, my original query seems perfectly compliant, every single bit of the syntax being explicitly allowed - as far as that MySQL manual can tell. As for other database types, I don't know.

JirkaRybka’s picture

From this discussion: http://www.thescripts.com/forum/thread74235.html I gather that this deleting is supported since MySQL 4.0. Observing Drupal's requirements at http://drupal.org/requirements, I see: Recommended: MySQL 4.1 or MySQL 5.0. Drupal will work on v3.23.17 and 4.0 but it is strongly suggested you use 4.1 or 5.0 for future compatibility with Drupal 6 which will drop support for older versions of MySQL. And indeed, modules/system/system.module says: define('DRUPAL_MINIMUM_MYSQL', '4.1.0'); // If using MySQL (Which worries me a lot by the way, because the production site I'm maintaining is on someone else's webhosting account, giving me low chances of forcing upgrade, and both php and MySQL are pretty outdated there).

So I think it's safe to use the original query, for MySQL at least. I guess the situation on other database types won't be too different, but any info is welcome here of course.

JirkaRybka’s picture

Okay, I've the first testing results on VERSION-recording part:

Backported to 5.x, now running on real site (at http://www.naturista.cz). This site uses heavy customized Czech locale, a lot of contrib modules. The history is 4.7.3 test-install, which wildly evolved into a production site, abandoning about 10 evaluated contribs (which were never translated though), then upgraded to 5.x (new strings hand-translated on-the-fly as discovered around the site). At the time of 4.7->5 update, pruning part #1 was also done, manually via PhpMyAdmin, so it's pretty much like my proposals now.

After installing the patch, I browsed all pages I could think of (not all nodes ofcourse, but all content types, forms, settings etc. - I'm pretty sure that I hit at least 90% of all page-types, probably more).

Observing speed, I see no difference. The server was never exactly snappy with Drupal, and I can't measure miliseconds on shared hosting via poor connection, but at least it's no worse in experience. (Honestly, it was rather faster today, but that's because the server had one of its better days, happens sometimes.) I was particularly afraid of unique-strings heavy admin pages, which I browsed last (i.e. with cache already full), but I didn't see any eye-striking difference there too.

First results collected from database: 3142 strings registered for current 5.x version, out of total 4485. That's 70% alive, 30% dead load. Looking only for short strings (<75) which will be cached, 2689 got 5.x flag, out of total 3711 - so again 72% alive, 28% dead. Saved 1022 cache-rebuilding-foreach-loops = per-page unserialized array items. As there are no visible problems, I'll leave it running - after my one-week holidays is over, I'll re-check the ratio.

meba’s picture

subscribing

JirkaRybka’s picture

FileSize
3.06 KB

If there's somebody out there wanting to give it a test on more live sites or full-featured test installs (which probably shouldn't be 6.x, as there are not contribs available yet)...

How to test this on 5.x:
- Pruning part #1 for 4.7 / 5.x only tested with one locale installed - in PhpMyAdmin execute: DELETE locales_source, locales_target FROM locales_source, locales_target WHERE locales_source.lid=locales_target.lid AND locales_target.translation=''; (Might be also interesting to see final results both without/with this done, after live strings registered again, showing real pruning ratio in the end.)
- In PhpMyAdmin, go to table 'locales_source', add a new column named version, varchar (20), Not null, default = none. (Feel free to drop this column after finishing test and restoring code to original form.)
- Back-up the original file, and apply the patch attached here, which is against the Locale module of Drupal 5.1.
- Browse as much pages as you could - the results will only get interesting long-term.
- Check ratio by executing from PhpMyAdmin:

SELECT source, version, COUNT( lid )
FROM locales_source WHERE LENGTH(source) < 75 GROUP BY version

for short (cached) strings, and

SELECT version, COUNT( lid )
FROM locales_source GROUP BY version

for all strings. This gives counts of alive/dead strings.

My today's ratio, after one night running: Cached - 74% alive, total - 72% alive.

I just wanted to provide maximum information now, because within a hour I'm leaving to my holidays (offline for a week). So forgive me silence here ;-)

Gábor Hojtsy’s picture

Status: Needs review » Fixed

Well, I took the time to think this through, and came to the decision that this helps us so much, that we should try to get into Drupal 6. Although it is a small-ish API change, it could have a big impact on multilanguage sites. All-in-all I agree that the cache rebuilding needs to be done anyway on the update, although this patch makes it run more times. Committed.

JirkaRybka’s picture

The goal #3 (for Drupal 7) follows in a separate issue:
http://drupal.org/node/175798

Anonymous’s picture

Status: Fixed » Closed (fixed)
hass’s picture

Status: Closed (fixed) » Needs work

May i jump in here please... only looked over this patch and saw LIMIT... this is MySQL specific code that will/may cause some problems...

+        if ($obj->version != VERSION) {  // NEW CODE!!!
+          db_query("UPDATE {locales_source} SET version = '%s' WHERE lid = %d LIMIT 1", VERSION, $obj->lid);
+        }
hass’s picture

Status: Needs work » Closed (fixed)

the above patch looks not like the code inside locale module... so closing.

JirkaRybka’s picture

The LIMIT already got it's own issue, and was fixed. http://drupal.org/node/186977
My apologies for that, I didn't know that it's MySQL specific back then. Also forgot to cross-link the issues.

Another problem was the DELETE query in update, which was fixed here: http://drupal.org/node/128866 (from #63 on).

I've learned my lesson on this...

mbria’s picture

subscribing