Problem/motivation
Currently when importing .po files, we read a file from start to finish in one sitting. This can take longer then allowed on a shared hosting environments, and might time out the PHP process. To fix this, we should make all .po file reading support reading only a section of the file and then continue reading from the end of that in another HTTP request and so on and so forth. This is solved in Drupal 7 with the gettextapi module.
Proposed solution
Introduce the possibility to save the position of the file when reading so we can save that state and return to that seek position in the next HTTP request. Therefore we add getSeekPos() and setSeekPos() to the stream reader.
Then using this, we can apply progressive batches to .po file reading, so we store the seek state of the reader in the batch and restore it until we reach the end of file. Then we move on to the next batch operation to read the next file.
To unify this, we also apply this progressive batch to importing single uploaded files coming from the locale import .po UI.
Comment | File | Size | Author |
---|---|---|---|
#27 | gettext-batch-changelog.patch | 685 bytes | Gábor Hojtsy |
#24 | batch_import_gettext_1637348_24.patch | 20.17 KB | vasi1186 |
#24 | interdiff_21_24.txt | 4.4 KB | vasi1186 |
#21 | batch_import_gettext_1637348_21.patch | 20.22 KB | vasi1186 |
#21 | interdiff_19_21.txt | 557 bytes | vasi1186 |
Comments
Comment #1
attiks CreditAttribution: attiks commentedtags
Comment #1.0
attiks CreditAttribution: attiks commentedsumm updated
Comment #2
steinmb CreditAttribution: steinmb commentedPatch committed :), changing status.
Comment #3
Gábor HojtsyAnybody wanna work on this one now? :) @steinmb?
Comment #4
Gábor HojtsyThe D8MI sandbox at http://drupal.org/sandbox/goba/1624820 has a branch with lots of code to use for this. It was set aside to help focus on the core gettext functionality that got committed.
Comment #5
vasi1186 CreditAttribution: vasi1186 commentedComment #6
vasi1186 CreditAttribution: vasi1186 commentedA fist version of the patch that needs review (and most probably some work still).
Comment #7
Gábor HojtsyA quick review of the current patch. Mostly code style / docs stuff:
Add @file level comments.
"managed continue" does not sound like proper English :) "manage continuation of a reader" or something like that maybe?
implementing
- Key/value (start with uppercase)
- of which one => out of which one?
// Make sure to re-read the header first to have metadata for plural forms.
Or something along those lines IMHO.
Document $seekpos and $items.
Amazing for standardizing this code. Yay, yay!
Document new common arguments.
I've seen this was moved from the file import code, but I don't think we need to rebuild menus anymore, they should not contain locale specific data.
Comment #8
clemens.tolboomI think an array of options would be better like
with default. These options can then easily be transferred to other functions?
Maybe naming the last $langcode onto $matches would help better understand this code?
Then we can change this in to
Comment #8.0
clemens.tolboomsummary updated
Comment #9
Gábor HojtsyUpdated the issue summary with a much cleaner version and giving a cleaner title (hopefully) as well.
Comment #10
Gábor HojtsyChecked for standard methods to do hibernation of PHP objects, since we are doing a very thin layer to do it ourselves. Found these two:
- __sleep() and __wakeup() magic methods, where __sleep() takes keys of properties it will serialize; http://de2.php.net/manual/en/language.oop5.magic.php#object.sleep (this makes some requirements on the constructor and is not an interface that can be ensured/checked for)
- The Serializable interface where we return the serialized data ourselves and get the serialized data, this is very similar to the batchstateinterface and is a PHP standard, so why not use this? http://de2.php.net/serializable
Comment #11
vasi1186 CreditAttribution: vasi1186 commentedAttached a patch that should apply now on the latest 8.x branch, and should solve the issues pointed out in #7 and #8
#10 not yet done.
Comment #12
clemens.tolboom@Gábor Hojtsy you're probably right in that the interface now is similar with the serializable one. I don't get the constructor part remark though. Our constructor is empty by design afaik?
Reading http://de2.php.net/manual/en/language.oop5.magic.php#object.sleep they refer to http://de2.php.net/manual/en/class.serializable.php which is imho better and more OOP oriented.
So if @vasi1186 like to try use http://de2.php.net/manual/en/class.serializable.php instead.
Comment #13
Gábor Hojtsy@clemens.tolboom: cool, we are proposing the same thing then :)
Comment #14
sxnc CreditAttribution: sxnc commentedJust tested the patch from #11 and it worked fine locally, saw the progress-bar and everything (used file sizes were around 600-900kb).
Comment #15
vasi1186 CreditAttribution: vasi1186 commentedAfter a closer look with Gábor, we actually got to the conclusion that we do not really need the BatchStateInterface at the moment. We used this only to set the pointer position in the file, and then get it back. So, in this case we decided to reduce the code and use just two functions to set and get the pointer in the stream. This position is stored in the context of the batch operation (as it was also before), so we actually never used the "hibernation" of an object in a true way, the seek position was set from outside all the time.
Attached a patch and an interdiff with the changes.
Comment #16
clemens.tolboomThe intent of an interface is to abstract stuff.
The purpose of getState | setState was to hide away filename, langcode, seekpos etc for the batch. That makes continuing a Batch step easier 'just grab the state' and is more OOP.
Is this because most of the state data is needed for the reporting too?
I'm indifferent to any good solution as Batch API is not OOP yet which then would probably require Serializable.
:-)
Comment #17
Gábor HojtsyYes, it would need to required serializable then, which is a whole bunch of data, that we do not keep in the batch otherwise. Or in other words, we have the file name and langcode in the batch operation setup already, way before any reader is instantiated, to even instantiate the reader at first, so we'd be duplicating that. When looking at the code, we realized, all we really need on top of what we have anyway already is to keep track of the seek position. We cannot really encapsulate the file name and langcode because the batch API needs it to even instantiate readers and designate batch operations + we cannot really encapsulate the writing options, because those are not even in the reader, they are on the writer. So after a serious rearchitecture of the batch API to work with more OOP-y code, we might be able to return to this, but we are not taking that on.
Comment #18
lazysoundsystem CreditAttribution: lazysoundsystem commentedJust to say that I've tested the patch from #15 and it works well.
Just one too many 'l's in 'useful' :)
Comment #19
vasi1186 CreditAttribution: vasi1186 commented:) solved that.
Comment #20
penyaskitos/possition/position/g
Otherwise, this patch is awesome, vasi1186++.
Comment #21
vasi1186 CreditAttribution: vasi1186 commentedthx!, updated the patch.
Comment #21.0
vasi1186 CreditAttribution: vasi1186 commentedUpdate summary with more meaningful and up to date info.
Comment #22
penyaskitoThis is RTBC for me.
Comment #23
attiks CreditAttribution: attiks commentednice work
Comment #24
vasi1186 CreditAttribution: vasi1186 commentedSmall change: seekpos -> seek, getSeekPost() -> getSeek(), setSeekPos() -> setSeek().
Comment #25
Gábor HojtsyThat change was in fact suggested by @webchick to avoid abbreviating names. Looks like should be still back to RTBC then :)
Comment #26
webchickOk, spent a bunch of time reviewing this yesterday; sorry I didn't get as far as follow-up/commit.
As Gábor said, one nit-picky thing that stood out was the naming of the getSeekPos() function, but that's been remedied now, so thanks!
The other less nit-picky thing is that anytime I see stuff like this:
...it's usually an indicator that we're overloading a function into doing way too much. It's akin to when we used to have hook_node($op, $node, $a3, $a4) in D6 and below; these were split into hook_node_load(), hook_node_save(), etc. in D7+ and as a result the functions are much more focused and are much more grokkable. It feels like we probably need something similar here.
I spoke to Gábor about this, and he acknowledged this concern, but also said that fixing it would require far-reaching changes to the underlying Batch API as well. I actually think that'd make a great follow-up for anyone who worked on this issue, because this capability of processing items in a "series of seeks" (I'm using the wrong words here, but), rather than a big stack of things to do, seems like really useful functionality to have as part of the general Batch API. And if we did that, it's likely we could clean up some of the code here.
However! None of that is worth holding up this patch, which looks like a great solution to the problem, and these implementation concerns aside, looks good. Therefore..!
Committed and pushed to 8.x. Thanks! If we could get a follow-up issue to introduce this functionality into Batch API "proper" though, that would be awesome. (Even more awesome if some folks here who had to understand the existing batch API to write this could help push that issue along with at least some pseudocode/description of how it could work, before you all forget everything you learned. :D)
Comment #27
Gábor HojtsyYay, thanks. Here is a CHANGELOG.txt followup :) Not sure the text is ok, but this should be quick to review.
Comment #28
webchickText looks good to me!
Committed and pushed to 8.x.
Comment #30
Gábor HojtsyOff the sprint then, thanks all!
Comment #30.0
Gábor HojtsyMore up to date solution explanation.