Project:Version Control API -- Git backend
Version:6.x-2.x-dev
Component:Code
Category:task
Priority:major
Assigned:halstead
Status:needs work
Issue tags:git phase 3, vc-next

Issue Summary

The main repo synchronizer (as in, what's run by calling VersioncontrolGitRepository::fetchLogs()) could stand to have its performance improved. A fair bit. I've attached the cachegrind output that was generated by parsing a Panels repo using drush. That's around 1450 commits. I'd post screenies, but skitch seems incapable of capturing stuff in an X11 window, so it can't see my kcachegrind.

There are a few big things that jump out:

We are making way, way WAY too many system calls to the binary. No big surprise there, but it's a problem - 7000 calls to exec(), taking 40% of execution time (inclusive). There is NO reason to have 7000 git shell calls for 1450 commits. And, pointing to some real flaw in our logic, there's also 7000 calls to shell_exec - which we use only when testing the git binary's location. And that's another 15% of execution time (inclusive). Yikes. So there's some easy wins right there.

The next bit of low-hanging fruit is under VersioncontrolEntityController::load(). Seems we spend 18% of execution time in there, 16.94% in the php function itself. Which means that it's probably the really inefficient sorting algorithms that are in there right now that are the culprit - they do NOT scale well, and get much worse as the number of in-memory-cached objects increases. (Which also explains why this process starts to get really _REALLY_ slow when we run multiple repositories through it). In fact, I actually think that fixing this stuff up could be the thing that really nails down massive speed increases.

After that, I think we're into micro-optimization territory, or needing a fairly significant refactor to get any additional benefits. Not advocating that latter option, just saying.

Note that I'm tagging this sprint 9, because the full repo-sync system isn't nearly as important as the incremental update that'll be triggered off of pushes (which still needs to be written...oi).

Comments

#1

Gah, cachegrind data is too big. Grab it from here:

http://doxy.samboyer.org/cachegrind.out.1294352671.bz2

#2

Status:active» needs work

Yeah, we definitely need to improve performance.

IMHO the real(more impact) solution for all the exec problem should be done with #973890: Create a reposync plugin using libgit2 php binding.

About the shell_exec calls, one static cache for that which shell invoke is definitely needed, patch attached for it(probably 1 instead of 7000 calls).

I need to review in detail other points to actually make an informed opinion :-p

AttachmentSizeStatusTest resultOperations
0001-Avoid-unnecessary-calls-to-shell_exec.patch1.24 KBIgnored: Check issue status.NoneNone

#3

#973890: Create a reposync plugin using libgit2 php binding is not an immediate-term solution, nor is it an excuse for inefficient logic. It will help a lot, but seeing as how we're unforeseeably far from getting any benefits from it, I'd rather not even bring it up right now as it's basically irrelevant to what we need to get done.

That patch looks like a good first step, go ahead and commit it.

#4

Status:needs work» active

Patch at #2 committed.

#5

Discovered that the bloat in the controllers is almost entirely due to VersioncontrolGitItem::determineSourceItemRevisionID doing single-loading of the potential source items. Turned off internal controller caching for that, hopefully should help a LOT with the performance-decrease-over-time problem.

#6

Assigned to:Anonymous» Jonathan Webb

Assigning.

#7

Last night I did a bunch of experimenting with replacing exec() calls to the git binary with calls to the Glip library. The only one I was able to get a significant improvement from was replacing all of the git ls-tree commands. Since it's the most used git command during log parsing that's a good thing.

This patch is paired down to the smallest set of changes and the way it's done it makes project/git_deploy a dependency (for Glip). This can be made even more efficient by only loading the repository once but I haven't done that here since that requires changing logic or adding a new global.

You can try it out and see what it does to your times.

What I'd really like to do in the future is go over the database tables and understand the way all of the data fits together and why we collect each part the way we do. Maybe there is a more efficient way to grab it.

I've also included the caching change sdboyer recommended.

AttachmentSizeStatusTest resultOperations
ls-tree_speed-increase.patch2.93 KBIgnored: Check issue status.NoneNone

#8

Status:active» needs work

Glip was gonna have to make it in sooner or later. Problems with the patch as posted, though:

  • It makes a call to git_deploy's glip-loading function. Which means a dependency on git_deploy - no can do.
  • Which leads to the problem of where to locate the glip library so it's not loaded over and over again. Frequent Drupal problem, unfortunately. The emerging practice is to use sites/all/libraries, so let's do that - expect the user to clone glip into sites/all/libraries/glip. Which means we also need a note added to the README. It'd be good if git_deploy was also updated to reflect this best practice.
  • I already added the missing 'may cache' => FALSE in an earlier patch. What I did notice, though, is that there were a bunch of other load calls which I'd made a git commit locally to turn off caching in that way, but had forgotten to export to into CVS. That's now fixed, see http://drupalcode.org/viewvc/drupal/contributions/modules/versioncontrol...

#9

Leaving this open as it may become enough or may need work depending on the out come of performance testing.

#10

#11

Assigned to:Jonathan Webb» Anonymous

Taking myself off this for now. +1 for replacing exec calls with some sort of php lib like Glip. If dependency is an issue, an uglier alternative to Glip integration would be to try to combine the work to be exec'd into fewer OS calls (by building some sort of temporary script and parsing the result). There is a lot of overhead to exec calls (loading & unloading the shell environment).

In performance testing I was using Diaspora which exhausted the available memory on my MacBook.

#12

Assigned to:Anonymous» halstead

I've reduced the number of exec() calls from thousands down to about 3 for initial import. Here are some times for fresh imports on my dev station. Down to ~15 seconds from about 110 minute for core.

I think it's pretty close and I should be able to grab what we are missing as it's identified.

halstead@avarice:/var/www/git-dev-skinny/drupal/sites/all/modules/versioncontrol_git$ time drush vcapi-parse-logs zen
Beginning synchronization of repository zen                                                                                               [ok]
Successfully synchronized repository zen                                                                                                  [success]

real 0m0.740s
user 0m0.460s
sys 0m0.160s
halstead@avarice:/var/www/git-dev-skinny/drupal/sites/all/modules/versioncontrol_git$ time drush vcapi-parse-logs views
Beginning synchronization of repository views                                                                                             [ok]
Successfully synchronized repository views                                                                                                [success]

real 0m3.352s
user 0m1.970s
sys 0m0.170s
halstead@avarice:/var/www/git-dev-skinny/drupal/sites/all/modules/versioncontrol_git$ time drush vcapi-parse-logs drupal
Beginning synchronization of repository drupal                                                                                            [ok]
Successfully synchronized repository drupal                                                                                               [success]

real 0m13.866s
user 0m8.050s
sys 0m0.170s

The highest memory usage while processing Drupal core was 157MB of resident memory and about 10 MB on disk.

Still to do,
Not exactly pretty but at least it's quick! (I wrote it with no sleep.)
Still missing types (FILE/DIR).
Needs to be tested against both tight and loose packs. (Should work.)
Need 40 zeros for empty parents.
Currently decides on tag/branch in a silly wrong way. :S Should be easy to make work correctly.

AttachmentSizeStatusTest resultOperations
1019976-log_parsing-initial_import_speedup.patch15.4 KBIgnored: Check issue status.NoneNone

#13

Seriously, this is so spectacular it makes me want to cry.

I'm gonna take a look at this later tonight.

#14

Didn't get to this tonight, but tomorrow for definite.

#15

I still do not try the patch, but using git-fast-export seems like the obvious way to do imports, that I completely failed to notice :-p, naturally it reduce the number of exec's so, hoping to take a look at this soon.

Thanks halstead!

#16

We now use DBTNG for all of our INSERTS. Yay.
Line numbers are included. Branches/tags are better. File operations are better. Probably took care of some more.

Still to do, process annotated tags. (These come in as tagger: in the export and we can use the same name for both author and committer.) Annotated tag data should be tracked in versioncontrol_operations
using these constants.
define('VERSIONCONTROL_OPERATION_COMMIT', 1);
define('VERSIONCONTROL_OPERATION_BRANCH', 2);
define('VERSIONCONTROL_OPERATION_TAG', 3);

I still need to do auto-increment checking to make sure the assumed values are what actually went in.

At this point processing Drupal core now uses 534MB of resident memory and takes about this long:

real  0m47.837s
user 0m39.170s
sys 0m0.670s

Views repository uses 148MB of memory and takes:
real 0m7.657s
user 0m5.930s
sys 0m0.180s
AttachmentSizeStatusTest resultOperations
1019976-log_parsing-initial_import_speedup.patch15.23 KBIgnored: Check issue status.NoneNone

#17

Don't load the whole export file into a single array. Replaced odd dept concept with a more standard counter. Replaced if/continue with an easier to read if/else structure.

AttachmentSizeStatusTest resultOperations
1019976-log_parsing_initial_speedup.patch14.86 KBIgnored: Check issue status.NoneNone

#18

Minor fix. Avoid strict warning. No need to include the auto_inc column.

AttachmentSizeStatusTest resultOperations
1019976-log_parsing_initial_speedup.patch14.87 KBIgnored: Check issue status.NoneNone

#19

Added error checking to make sure we are getting errors due to concurrency.
Watchdog $type cannot be longer than 16 characters so I've shortened it.

Not working but we have all of the data we need for {versioncontrol_operation_labels} in
$log->name_block->names[$log->commit] on line 194.

AttachmentSizeStatusTest resultOperations
1019976-log_parsing_initial_speedup.patch17.77 KBIgnored: Check issue status.NoneNone

#20

Ugh, I started trying to do this, but I just can't quite grok the datastructure you've put together well enough to figure out how to do it for sure, and my focus needs to shift to actually running the builds. :(

One thing I did notice - there's a baked-in assumption that serials autoinc by 1. Not so in general, and not so far d.o. The live dbs use only even ids as a part of the replication strategy. And even worse, the db used by staging uses only 4s and 9s. So that's just great fun to calculate. :( spose we need two vars for that - base_autoinc_val & serial_increment_by or something.

Sorry :/

#21

Oh, and - let's go with 'vc_git' for watchdog instead of 'VC Git'.

#22

Core now takes 58 seconds.
Changed watchdog to a better name.
Call git rev-list once per branch to build our vc_op_label mapping.
Added variables for autoincrement settings.

AttachmentSizeStatusTest resultOperations
1019976-log_parsing_initial_speedup.patch18.29 KBIgnored: Check issue status.NoneNone

#23

The db_last_insert_id() function was returning zeros so giving false errors. Backing out that change.

AttachmentSizeStatusTest resultOperations
1019976-log_parsing_initial_speedup.patch18.5 KBIgnored: Check issue status.NoneNone

#24

Now with automatic error recovery.

AttachmentSizeStatusTest resultOperations
1019976-log_parsing_initial_speedup.patch20.21 KBIgnored: Check issue status.NoneNone

#25

Committed #23, so let's get an update diff with just the new bits?

#26

Just the error recovery attempt by itself.

AttachmentSizeStatusTest resultOperations
1019976-log_parsing_initial_speedup_error_recovery.patch3.64 KBIgnored: Check issue status.NoneNone

#27

Status:needs work» needs review

Committed #26 too, now. Rolling it into the tree, let's see how this sucker fares in the wild under high concurrency...

#28

Status:needs review» needs work

Change to handle empty repositories.

AttachmentSizeStatusTest resultOperations
1019976-log_parsing_initial_speedup_empty_repo.patch963 bytesIgnored: Check issue status.NoneNone

#29

I don't know if it will make any difference but maybe start the transactions manually.

AttachmentSizeStatusTest resultOperations
1019976-log_parsing_initial_speedup_transactions.patch2.97 KBIgnored: Check issue status.NoneNone

#30

Okay everything since #26 plus changed drupal_set_message to watchdog().

AttachmentSizeStatusTest resultOperations
1019976-log_parsing_initial_speedup_26_forward.patch5.01 KBIgnored: Check issue status.NoneNone

#31

This patch rolled against 6.x-2.x fixes the off by one error we were seeing with the auto increment settings on git-dev.

AttachmentSizeStatusTest resultOperations
1019976-log_parsing_initial_speedup_off_by_one.patch20.79 KBIgnored: Check issue status.NoneNone

#32

OK, rolled this back. Just not quite ready for prod...we'll get it in ASAP after launch.

#33

So a work in progress but...
Removed DBTNG from queries involving auto increment and added the table locks back in. With these changes memory usage is cut in half and there are no issues with multiple workers. On my dual-core desktop machine this code imports all ~8900 repos in 8 minutes 4 seconds with 0 errors.

AttachmentSizeStatusTest resultOperations
1019976-log_remove_dbtng_add_locks.patch22.27 KBIgnored: Check issue status.NoneNone

#34

Quick update - I have a clear spot that halstead's work will slot in with the refactor & genericization of the log parser (which, btw, i'm trying to change to "[history] synchronizer"). Should allow for a pretty clean switch.

#35

Just wanted to get this out there, db_last_insert_id definitely won't work with DBTNG since it's using an entirely separate DB driver. Apparently DBTNG does not support the return value for the last insert with multiple value inserts, only single value inserts.

#36

OK, we are FINALLY ready to proceed with this (sorry) since #1092062: Introduce framework for generic repository history synchronization has progressed far enough to bolt on this system.

halstead's solution should be put in as the sole logic that lives under VersioncontrolGitRepositoryHistorySynchronizerDefault::initialSync(). That is, here: http://drupalcode.org/project/versioncontrol_git.git/blob/2f363866de29d8...

A quick explanation - the new generic sync framework has introduced methods that differentiate between three different types of sync operations: fully syncing from scratch (what halstead's parser does), fully syncing from an unknown state (what we currently do), and syncing in a specific range of data based on an argument (e.g., on a push. this is no yet written).

#37

Issue tags:+Release blocker

Tagging as a release blocker. Also, note that the topic branch on which all this code lives is called generic-parser.

#38

Issue tags:+vc-next

we're still "good enough" for now on this :(

#39

Issue tags:-Release blocker

clearly not a release blocker, therefore.