As more backends are being implemented, it becomes clearer that 1. backends are able to retrieve and manage commit log data in the database, and 2. it's unlikely that backends appear that don't store their item revisions in the database, and that this would probably be a major performance hit.
As pointed out by ezyang in the Mercurial backend's README.txt file, we could make things easier for backends if some of the item handling code and tables is moved into the API module. As an additional bonus, we could simplify and speed up queries for the retrieval of commits and maybe branch/tag operations. Given the similarity between the different backends' item revision tables, I could imagine to move the tables to versioncontrol.install like this (MySQL version):
db_query("CREATE TABLE {versioncontrol_item_revisions} (
item_revision_id int unsigned NOT NULL default 0, /* random iterative number */
vc_op_id int unsigned NOT NULL default 0, /* id of the associated commit */
path varchar(255) NOT NULL default '',
type tinyint NOT NULL default 0, /* file, directory or deleted file */
PRIMARY KEY (item_revision_id),
UNIQUE KEY (vc_op_id, path)
) /*!40100 DEFAULT CHARACTER SET utf8 */");
db_query("CREATE TABLE {versioncontrol_item_source_revisions} (
item_revision_id int unsigned NOT NULL default 0,
source_item_revision_id int unsigned NOT NULL default 0,
PRIMARY KEY (item_revision_id, source_item_revision_id)
) /*!40100 DEFAULT CHARACTER SET utf8 */");
The 'action' property can be left out of {versioncontrol_item_revisions} as it can be determined from the set of source items and the current item, and leaving it out has no impact on performance. Having those tables in the API module would also make it possible to retrieve item history within the API module, taking this burden off the backend modules. It would also encourage backend modules to go the extra mile for providing correct item source revisions, because incomplete or wrong source revisions will show earlier (in form of action descriptions and broken history logs) and can't be hidden by easy workarounds as is the case now.
So, all good stuff, right? The only drawback is that this *again* is a major change on the database tables in both API and backend modules, and as such needs to be done before Version Control API is rolled out on drupal.org. Thus (and also because of the performance improvements) this is pretty much a critical issue, and one that should be done as soon as possible.
Comments
Comment #1
jpetso commentedOh baby! I just had a great idea: now that we have the same kind of identifier for all operations (commits, branch ops, tag ops), we can actually unify the table format for associating any kind of operation with their items - so instead of specifying the vc_op_id of the commit in the item revisions table and ignoring that id when referencing it from branch or tag ops (that was a bit unclean anyways), we can leave the id out of the item revisions table and have all mappings in one single operation/items mapping table.
Let's revise the schema given above, and make it more like this:
That way, we can get rid of tables like {versioncontrol_*_commit_items}, {versioncontrol_*_item_tags} and {versioncontrol_cvs_item_branch_points}, plus everything is managed by the API module and backends don't need to care about any of those. Yay. Now it just needs to be implemented...
Comment #2
jpetso commentedRenaming, it's a two-part issue now (this is the other part).
Comment #3
jpetso commentedI said, *renaming*!
Comment #4
jpetso commentedI think there should also be the third column in the {versioncontrol_item_revisions} table that is specified for items by the API, namely this one:
revision varchar(255) NOT NULL default ''It's only being actively used by CVS, but it shouldn't really matter for performance and makes things a bit easier still, having all "core properties" in the table that is managed by the API module.
Comment #5
jpetso commentedOk, we need one another revision - Subversion manages to mess up a lot of expectations that apply to other version control systems. That includes the following points:
Both 'action' and 'replaced item' are properties of the commit action, so the solution might be a {versioncontrol_commit_actions} array in addition to the other ones that were mentioned before.
However, taking ezyang's input from the other issue into account, we might not want to introduce extra tables for specific types of operations (commit, in this case) and rather have some denormalization for performance and less table join complications. It would be just as possible to merge {versioncontrol_operation_items} and {versioncontrol_commit_items} into one table, leaving 'action' and 'replaced_item_revision_id' unused for branches and tags. Together with the 'revision' column for {versioncontrol_item_revisions} mentioned in the previous comment, we would get something like this (listing all item related tables as current table scheme proposal):
That would probably be the better solution... I'll try this out by implementing versioncontrol_svn with a similar structure (which will also be beneficial when updating to the centralized structure later).
Comment #6
jpetso commentedProgress! Or better even - I'm now pretty sure that I've developed the best schema that is possible for this purpose. Features:
Commit actions are a combination of operation items and item relations - but they'll disappear anyways in the API, a more detailed set of items can (and will) replace any commit action. So we'll have just versioncontrol_get_operation_items($operation, $include_source_items = FALSE) instead of versioncontrol_get_commit_actions(), versioncontrol_get_branched_items() and versioncontrol_get_tagged_items().
Yes, I am stoked about this schema. You may not like the fact that it's very abstract, but we'll just have the API a bit more concrete (not many changes are planned anyways, compared to now) and considering that even backends hardly have to care about item management anymore, the abstraction shouldn't hurt anyone but hunmonk who does the cvslog -> versioncontrol porting script (sorry, dude).
Without further ado, I proudly present the new "operations" portion of versioncontrol.install. (I am inclined to say "my best piece of architecture yet", but maybe I'm just temporarily overenthusiastic, let's wait and see ;)
Comment #7
jpetso commentedCutting off the "Major Issue #1:" completely makes the issue queue look odd, as this issue doesn't get the importance that it deserves and the major issues #2 and #3 lack a first one. Trying to find the golden middle way. ("Mission impossible"? ...well, not quite, but close.)
Comment #8
jpetso commentedCoding is now in progress, the first part (database changes in the API module) has been committed as commit 103150.
Comment #9
jpetso commentedMore progress:
Todo:
Comment #10
jpetso commentedThe first and second point on the above todo list are essentially done - versioncontrol_get_operation_items() and related functions work, and Commit Log shows items again. versioncontrol-backend.inc has been adapted to include versioncontrol_insert_operation() which replaces the former commit/branch/tag insertion functions, so I was able to get rid of lot of code :)
Some code in the CVS and SVN backends were also deleted, but porting those is not yet done, plus Commit Log is currently plastered with TODO items originating from the unification of commits, branches and tags into operations as well as the introduction of labels and the removal of the commit directory item. Those need to be tackled, parts of Commit Log might need a major overhaul. versioncontrol_project is in a bit of an unknown state - it should essentially work, but needs a review once API module, backends and Commit Log are done.
Here's an action plan on how we might be able to get it working again:
Items need a 'selected_label' property (similar to a single $operation['labels'] element) so that the get_current_item_{branch,tag}() functions can keep working. (In fact, with the label attached to the item, those functions will be unneeded and can be deleted.) Background: Due to the fact that items are now not created by the backends anymore, the backends also cannot attach the branch/tag to operation items anymore, but those will still be needed for navigating in repository browsers. So instead of doing get_current_item_{branch/tag}() on caller request, we might rather attach that 'selected_label' property on loading the item and let the backend provide them via callback (given either the operation in get_operation_items() or the successor item in get_source_items() as context).Done.Related to that, I just noticed that versioncontrol-backend.inc still has its original branch management functions (near the end of the file) *and* a newer _versioncontrol_ensure_label(). Reduce that to only one version, and determine if the insertion functions still need to be public. (The retrieval functions will at least be necessary in the new callback mentioned above.Done.Implement the mentioned callback in the CVS backend, so that we can delete versioncontrol_cvs_get_commit_actions() and versioncontrol_cvs_get_{branched,tagged}_items() with a good conscience. For the SVN backend, a stub implementation (returning NULL) should be sufficient as it doesn't (yet?) support branch and tag emulation anyways.Done.Hide the 'selected_label' item property from both callers and backends, and implement lazy loading by referencing either the operation (versioncontrol_get_operation_items()) or source item in item-based navigation (e.g. versioncontrol_add_source_items(), versioncontrol_get_directory_items() or the yet-to-be-written versioncontrol_add_successor_items()) if the 'selected_label' has not yet been determined by the backend. Operation or item should be recursively queried for the selected label up to the start of the navigation when the selected label is known in any case (or not, with NULL).Done.Add versioncontrol_add_successor_items() as the opposite of versioncontrol_add_source_items(). Implement versioncontrol_get_item_history() with the 2.x API (using versioncontrol_add_source_items() or the {versioncontrol_source_items} table directly, probably also needs some functionality that goes into the successor item direction). Also, rework versioncontrol_get_all_item_{branches,tags}() so that they provide requirements (for the backends) and documentation that plays nicely with all that label stuff.Done.Move the lines-changed information in the {versioncontrol_source_items} table so that we can scrap get_commit_statistics(), whose API is not nice anyways. That way, we can provide the information where it makes sense (in the item arrays) and get rid of the {versioncontrol_cvs_item_revisions} table, which only contains the lines-changed columns by now. Port the (two, if I remember correctly) usages get_commit_statistics() in Commit Log and versioncontrol_project to whatever it looks afterwards.Done.Finish the database update for CVS (needs to move commit branch associations to {versioncontrol_operation_labels}).Done.Port the CVS backend's log parser and xcvs-* trigger scripts to the new operation array format. See the versioncontrol_get_operations() function docs and the 2.x API changes page. Test and have a look if data gets transferred into the database correctly. (Commit Log should be able to show most of the stuff correctly, too.) Special areas of interest - having changed in the 2.x branch - are labels (previously branches/tags) and source items, and of course commit actions have been replaced by arrays of operation items (so the layout might be slightly different here, too).Done.Rework Commit Log so that it displays all three kinds of operations - commits, branches and tags - correctly and in a way that they make sense. Might also involve changing the output format a bit, as that is a) inflexible and too crowded anyways, and b) only suited for commits in the current form.Done.Review Commit Restrictions and versioncontrol_project if they need porting to the 2.x API (the existence of commit specific functions is a nice indicator for that), and if so, get them ported and working. I think the documentation (bits in hook_versioncontrol.php and versioncontrol_fakevcs.module) also needs some updating.Done.I'm going to become a big fan of small-chunked checklists. Never leave your house without knowing what exactly to do next!
Comment #11
jpetso commentedOk, I implemented point 1, 2 and 3 on the above list (now updated with shiny number ordering), but it kicks up a few new questions. Essentially, I'm still not really happy with the way that Version Control API tackles labels (= branches/tags), although after my commit today it's a lot better than it was before. Needs more work still, see point 4 and 5. That's the final architectural challenge, when we get that one right, everything else is pretty much a breeze.
Essentially, as long as we need to query the backends for items, it's not yet done... at the moment we're currently stuck between "store lots of item data and operate independently on it" on the one hand and "still leave some important data to the backends" on the other one, which is a bit awkward in terms of consistency and clarity of the used concepts. Handing over all item data to the backends is not an option (that's what 1.x did, and had severe issues in terms of performance and queryability) so I think we should really take the final step and store all label associations for all items too. We might need better item/operation management functions - update functions, that is - but I think it's the only way this thing will be able to work elegantly.
So, next up: a final database table for {versioncontrol_item_labels}. Anyone want to try coming up with a schema? (Hint: should be easy!)
Comment #12
jpetso commentedOk ok ok, now I remember why I decided not to have that {versioncontrol_item_labels} table formerly - it's just too much data to manage. Like, Mercurial stores a line that goes something like
0425331c06ea63efa363df4c26078cf45a460197 2.0.3which specifies the 2.0.3 tag. Starting from that commit (represented by the SHA-1 id), we would need to retrieve all items in the repository at that given commit, and write a tag association for each of those items in the database. Plus for branches it's even worse, because all items *and* all source revisions of those items would need to be associated to a branch when a new one is created. And when a label is being deleted (or a branch being merged), all of those revisions would have to get that label removed again. Might be a bad idea after all, so let's better leave it at that.
Still, the thing that bothers me is that we need to determine and propagate a selected label, and currently that's done at the time of item retrieval. That's bad, because a lot of items can be retrieved at a given point in time, while the (mostly unused) cost for determining the selected label might not be negligable. For example, if Commit Log retrieves an operation via versioncontrol_get_operation() together with its items via versioncontrol_get_operation_items(), and decides to call versioncontrol_add_source_items() for all those operation items in the 10 commits shown by default, that's already a sizable amount of items. Should a backend decide to back up its branch/tag assumptions by directly invoking the VCS executable, that will be a slowdown that we really don't want.
Of course, the solution to that is lazy loading of the 'selected_label' property. Then again, when determining the selected label coming from another (source or successor) item, that item also needs to know either its own selected label or at least how it can get to that. For a file log, we need to propagate that label finding stuff, and it's unrealistic to keep a record of how to get from any given item to the whole selected item history. (That would perform badly, too.) Rather than that, each item should only need to hold the operation or item that it got the selected label from, and query that one, which in turn queries its predecessor, and so on. In other words, by-reference storage of predecessor items, otherwise the same item would determine its selected label multiple times (thanks to PHP's copy-on-write semantics for arrays).
Now this whole thing, lazy loading and reference-copying instead of value-copying of items, that just screams for objects. But it's not yet time for that, needs to wait until the next major rework of Version Control API. So for the time being, as a sufficient workaround for not having objects, let's store the origin of an item (operation, source/successor item, or none of that required if there's already a selected label) by reference in each item, add the label information by altering that item, and hope that the caller doesn't copy around items all too much before the information is being retrieved. For that and lazy loading to work, 'selected_label' (xor its origin information until 'selected_label' is retrieved) needs to be private to the API module, and the selected branch/tag retrieval functions have to be revived as versioncontrol_get_selected_label(&$item).
With that in place, the whole thing should work at last. Sorry for the confusion spread afterwards, and I'll update the bullet list to reflect the current state of things. Now for real. Dude!
Comment #13
jpetso commentedThe SVN backend is now fully ported, and back at its 1.x feature set. Which is not all that much (no hook scripts, no branch/tag emulation support, no functional account/password export), but it's enough to populate 2.x databases with new commits again. Which means that you're not dependant on my database dumps anymore. Way to go! I updated the bullet list and removed the SVN tasks from there.
Comment #14
jpetso commentedI just saved the world with a rather large commit solving the bulk of action point 5 from the list in comment #10. That was the last major hurdle, everything is now going to be fine. Really. The vast majority of the API, except for the three functions remaining on the list (get_all_item_{branches,tags}() and the statistics stuff) should now be more or less stable, and more importantly, works again for Commit Log purposes (and others) if I'm correct.
I also tried out how easy it is to write an (experimental) repository viewer, and apart from the get_item() function that I also added, I'd like to add one more wishlist item for the 2.0 release:
-
Add a function called versioncontrol_fetch_item_operations() which searches the database for operations associated to a given set of item revisions (say, pretty similar to versioncontrol_fetch_source_items()) and adds those as an 'operations' element to the item arrays. Use cases: 1. for a given item history (which only contains the raw item arrays), we need the associated operations so that an item history log viewer can display the messages for each revision. and 2. pretty much every repository viewer displays the time of last modification and the associated author next to the item, some also display the commit message - of course, those are all operation properties and will not go into into an item array natively. So we need to retrieve operations for a given set of items, hence the aforementioned function.Done.Comment #15
jpetso commentedImplemented versioncontrol_fetch_item_commit_operations() (note the slight rename for correctness purposes), so the additional point from the last comment is done already. The repository viewer is now out in the wild - although without a tarball release, for good reasons - and, as of today, makes use of this functionality. Which means it looks great and is nearly on par with other repository viewers, apart from the tons of features that it doesn't have.
Anyways, versioncontrol_get_all_item_{branches,tags}() really needs to be redefined now, see bullet point no. 5 in the above comment #10. Enough feature fun, let's get back to the Real Work (tm)!
Comment #16
sdboyer commentedSo. Finally trying to ACTUALLY help out :)
I did a quick patch that added the versioncontrol_backend_implements() calls and updated some of the docblocks accordingly. It's my impression that, since we're abandoning the branch/tag distinction in favor of labels (yar for that), there may be value in collapsing them into the same function sometime later, but I suspect that a) raises other issues and more importantly b) is pretty easy to do when the time comes, so I just left it.
Yeah, I definitely had some moments of screaming for OO. I also had the funny reaction that, since I'm now used to the stupid-crazy speed of git, that git might actually be slower if we move it into the database than just querying the backend directly :P Although tbh, that's actually something that I think might be easily-ish done when the OO reworking is done. I dunno though, I've not grokked enough of this yet.
Comment #17
sdboyer commentedI'm not clear on how #6 is supposed to work. It seems to me like we should only be moving that data into the generic API table if it's something that all the backends provide. Or, put another way, there's other data that the other backends are still providing in their own tables that's more than just the lineschanged data. Which means to me that we're not eliminating those tables/that method of garnering operation data entirely - so what are we gaining by moving the lineschanged info into the API table?
Comment #18
jpetso commentedThe reasoning behind #6 (moving lines-changed data into the API module) was as follows:
We already provide capabilities that not all backends can handle (action types that don't map to CVS or SVN, log messages for branches and tags, ...anything else? ...) so the reason for backends storing their own "proprietary" information is more like "this data can't be (directly) used by callers or the API module". Lines-changed info, however, is directly usable without further requirements on the backends.
I thought we could get around the "not supported by backend X" issue by having a column that says if lines-changed info has been recorded, and two columns that capture the actual "lines added" and "lines deleted" values. (The "has been recorded" flag could also help to add that information if it's implemented at some later point and exists for some items already but not for all of them.)
That was my reasoning... I could have missed something though, do tell me if you disagree.
Comment #19
jpetso commentedI committed a different approach to versioncontrol_get_all_{branches,tags}() - also different to comment #16, see the associated commit message for more info. The CVS log parser is now also ported, though completely untested at the moment. (That's a bit of ignoring the original task order, but I just felt like it. Seems like we're making progress - slow, but steady :)
Comment #20
jpetso commentedThere were no objections against moving the lines-changed information into the API module, so that's now in there too. Updating the xcvs-* scripts is nearly done, too.
Comment #21
jpetso commentedThe CVS database update is complete! Now all data is accessible by the API module. Next up: porting (and testing) the xcvs-* scripts.
Comment #22
jpetso commentedThe xcvs trigger scripts for commitinfo, loginfo and taginfo are now functional again. Unfortunately I forgot to port the posttag script last time - and today is late already, I'm outta here for now - so this one's pending still.
Comment #23
jpetso commentedxcvs-posttag is working again, tags are being recorded correctly. Which means the xcvs-* scripts are back to their previous state. Before proceeding to the Commit Log "unified commit/branch/tag display" rework, I'll try to implement tag (and branch) deletions properly in the xcvs scripts. We'll need that anyway for the release node integration.
Comment #24
jpetso commentedTag (and branch) deletions are now properly recorded in the xcvs-* scripts, yay. Also, reviewed Commit Restrictions and versioncontrol_project: I fixed a couple of smaller issues, but overall there was little left to port.
So now, one issue left: Commit Log! (and some bits of documentation.) I'd say we're getting there? :D
Comment #25
jpetso commentedCommit Log is now working like it did before. That doesn't mean that it's perfect (some more modularization and refactoring would be cool) but in regard to this issue it's probably far enough to be marked as Done. Leaves a little bit of documentation, and then we can close off this era, er, I mean issue.
Comment #26
jpetso commentedI completed the last part (documentation updates) of the last bullet point just now. Version Control API is fully operational again. This issue was in the works long enough, time to say goodbye and take on new challenges. Hurray for zero gravity!™