For using the packages from #568994: Implement translation package export in l10n_packager (see #568996: Knowing the file paths, grab latest translations for projects when installing), we need to figure out whether PHP supports gzip packaging widely. Otherwise we just cannot depend on that. José says it might not, other says it would sound natural that it would.

Comments

martin_q’s picture

Assigned: Unassigned » martin_q
martin_q’s picture

Here's the short answer, from http://www.phpro.org/manual/ref.zlib.html and many other places:

Zlib support in PHP is not enabled by default. You will need to configure PHP --with-zlib[=DIR]

The windows version of PHP has built in support for this extension. You do not need to load any additional extension in order to use these functions.

Note: Built-in support for zlib on Windows is available with PHP 4.3.0.

zlib is the library that offers functions such as gzopen to open a gzipped file stream - and it does happen that it is not always configured:
http://www.sitepoint.com/forums/showthread.php?t=155679

martin_q’s picture

Should we get the localise server to ask the client whether it supports gzip (or get the client just to tell the server when it makes the request), and then send gzipped - on the fly - or not, depending on support?

Would it suffice to check if (function_exists(gzopen)) and send a different request to the server depending on the outcome?

martin_q’s picture

For comparison, Joomla! requires zlib support in its PHP on install: http://help.joomla.org/content/view/1938/310/

The fact that this is not available universally upsets some people as they have problems installing (plenty of anecdotal evidence on forums etc - http://www.google.com/search?hl=en&q=joomla+install+zlib&aq=f&oq=&aqi=g4)... this suggests that zlib is not supported widely enough to depend on it.

martin_q’s picture

Status: Active » Needs review
gábor hojtsy’s picture

Yeah, our issue is that we do need to generate these files as static packages on ftp.drupal.org. If we need to have both gzip and non-gzip support, that can be painful in terms of the number of files. We have 5600 releases parsed now, and if assumed 100 languages (only), we'd need to generate 5600*100 = 560.000 files already. If we generate both gzip and non-gzip .po files, that is a million files :)

What we can do is to:

- require zlib (but Joomla did not have that much luck for it) -- we can cross-check with the plugin manager people, they might require zlib as well
- optionally use zlib -- not as suggested above, but as in we automatically download stuff if there is zlib, otherwise we just have manual instructions for users to download
- look whether PHP has better stream support for gzip compressed HTTP streams and let ftp.drupal.org (which we access over HTTP anyway) to gzip on the fly in the web server -- kkaefer believed that PHP might have better support on the stream level even if no standalone zlib support
- generate both versions of the packages as discussed above -- a million files :)
- just generate .po files and optimize the crap out of it (as in leave out location comments, newlines, untranslated strings)

I'll check how much we can save by this ".po optimization".

gábor hojtsy’s picture

Was testing this "compression" by leaving out extra stuff thing on the latest Finnish translation:

$msgcat --use-first *.po -o fi-xl.pox
$msgcat --use-first *.po | msgattrib --no-fuzzy --translated -o fi-m.pox
$msgcat --use-first *.po | msgattrib --no-fuzzy --translated --no-location -o fi-s.pox
$msgcat --use-first *.po | msgattrib --no-fuzzy --translated --no-location --no-wrap -o fi-xs.pox

The results are:

716015 - fi-xl.pox
714169 - fi-m.pox
567729 - fi-s.pox
552405 - fi-xs.pox

Not including untranslated stuff and location comments gets us around 20% savings. As seen, we could still save 3% more if we also leave out all the wrapping at column 80. Location comments are only ever used to orient people and file names and line numbers don't help that much anyway.

The actual savings of course are highly dependent on the amount of strings translated. Not including untranslated could be a huge win for teams starting up and projects just partially translated.

gábor hojtsy’s picture

To estimate the load on the server, I've counted up the source stings:

$select sum(char_length(value)) from l10n_community_string;
+-------------------------+
| sum(char_length(value)) |
+-------------------------+
|                 8191897 | 
+-------------------------+
1 row in set (0.27 sec)

This is the total length of all source strings. So when translated, if we assume the translations are same length (they are not, but we can use this assumption for now), the data is about 16MB per language. That also assumes single-byte UTF-8 chars, which fit some language teams, but far from all.

Then we need to add up the "Gettext overhead", things like msgid "..." and msgstr "...", which I've estimated to be around 30 per string. Rough estimate, we can probably get it down by not using as much whitespace, etc. But 150.000 (number of strings) * 30 would be 4MB.

So one language (assuming the above) would total up at 20MB .po file data uncompressed (with the above "minimalization" techniques, so we transport as few info as possible). That summed up for 100 languages would be 2000MB =~ 2GB.

gábor hojtsy’s picture

Ok, so the current thinking is that 20MB in total per language might not look scary. We'd do best (in terms of low barrier to entry on the PHP setup side in Drupal) with on the fly HTTP gzip. The webserver can still cache the gzipped files, and we could serve them to people who have zlib, but do not send them to people without zlib. This would the be set up on ftp.drupal.org via HTTP.

I'll try to point Narayan and Gerhard on this issue, so we get feedback soon. If on the fly gzip does not fly, we can still generate the packages twice in two formats or just do it uncompressed. It would again be 20MB uncompressed per language with Drupal core or big modules like Ubercart with around 500k per language.

gábor hojtsy’s picture

Title: Figure out whether PHP supports gzip widely » Drupal (PHP) client requirements vs. localization packaging
seutje’s picture

subscribing

gábor hojtsy’s picture

[14:25] GaborHojtsy: JacobSingh: hey can you answer a quick question? how does Plugin Manager get away with possibly no support on the client for unpackaging a plugin?
[14:25] JacobSingh: It doesn't really, but Archive_Tar is (AFAIK) the best solution out there

Looks like we don't have any ultimate wisdom after all. So what about hosting those .po files uncompressed then?

gábor hojtsy’s picture

Talked to Narayan. He is saying we should be fine using simple .po files and compressing "on the fly" if the client supports it. But this later thing he needs to check with OSU OSL, our hosting service. For now we will assume uncompressed .po files then.

meba’s picture

100 languages * 5600 packages * 2 = 1m files. That's a lot in one directory.

But if you generate them into a structure of directories:

cs/gz/drupal-6.3.cs.po.gz
cs/plain/drupal-6.3.cs.po
hu/gz/og-1.0.po.gz
hu/plain/og-1.0.po

It's still 1m files but never more than 5600 files in one which is much better and there shouldn't be a problem with administering these files on the file system. The directory structure will probably take something like 30MB of disk storage on ext3 but it's not a problem either.

meba’s picture

You can even go further (this is widely used!):

cs/gz/a/amarok-1.0.cs.po.gz
cs/gz/b/barron-3.54.cs.po.gz
cs/plain/a/amarok-1.0.cs.po
cs/plain/b/barron-3.54.cs.po.gz

I would say it's never going to be more than 300 files in one directory then...And all of these directory paths are easy to construct, even at client side.

gábor hojtsy’s picture

Well, we assume no gz now, and not sure we should preplan the initial structure for a possible gz option later :) Otherwise I totally agree.

jose reyero’s picture

If it's any help we've gone through the same kind of assesment for Atrium installer, found out that client side support for compression varies widely, was specially bad for windows clients... (and you dont really want to handle that nor ask people to set up anything beyond standard php install) so we ended up with plain po files too.

About the folder structure, just in case it helps I had this write up here, http://www.reyero.net/en/node/178
Though I agree a plainer folder structure may be better and easier to handle, having a more complex one could have advantages, like being able to check the languages available for a given project so I think at the very least we should have a folder per project.

One important idea I want to point out is prefixing the path with the main core version (6.x, 7.x). While you can make any of them being a symlink, you also can build a better structure for future drupal versions without breaking backwards compatibility.

6.x/amarok/1.0/amarok.xx.po

Then say you want all the languages for the amarok project 6.x-1.0 release, either for some packaging for offline installs, or for mirroring it, maybe for some internal deployment tool, you just need to grab all files in the folder

6.x/amarok/1.0

There's also the issue of what happens when a new module release is out. At that point there may be a time lapse from the moment the package is out till we have the translation packaged. We could build symlinks like 'stable' => '1.0' or maybe just go with the latest dev version, which should be something like

6.x/amarok/1.x

Another point, I think with some ftp tools you can grab a folder.tar.gz file compressed on the fly, so we may want to have a folder structure that makes sense for this.

chx’s picture

Yeah , hashing totally helps. We (nowpublic) use the first three hex digits of the md5 hashes of our image filenames to split up into 4096 directories which gives a relatively even distribution and not terribly lot of files in any given directory. If you really want to go overboard, use AA/AA ie 256 * 256 directories (four hex digits), that will serve you quite well up to hundreds of millions of files.

martin_q’s picture

Assigned: martin_q » Unassigned
gábor hojtsy’s picture

Project: Drupal core » Localization server
Version: 7.x-dev » 6.x-1.x-dev
Component: base system » Code
Status: Needs review » Fixed

Thanks for all your input, file/packaging structure implemented in #594570: Packaging module for l.d.o (l10n_packager). Marking this fixed. Also given that this is not going to be a Drupal 7 core feature, moving to the server queue, where it was implemented.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

gábor hojtsy’s picture

Issue tags: +localized install