MI#1: Move item handling and commit branches into the API module
| Project: | Version Control API |
| Version: | 5.x-1.2 |
| Component: | API module |
| Category: | task |
| Priority: | critical |
| Assigned: | jpetso |
| Status: | active |
Jump to:
As more backends are being implemented, it becomes clearer that 1. backends are able to retrieve and manage commit log data in the database, and 2. it's unlikely that backends appear that don't store their item revisions in the database, and that this would probably be a major performance hit.
As pointed out by ezyang in the Mercurial backend's README.txt file, we could make things easier for backends if some of the item handling code and tables is moved into the API module. As an additional bonus, we could simplify and speed up queries for the retrieval of commits and maybe branch/tag operations. Given the similarity between the different backends' item revision tables, I could imagine to move the tables to versioncontrol.install like this (MySQL version):
<?php
db_query("CREATE TABLE {versioncontrol_item_revisions} (
item_revision_id int unsigned NOT NULL default 0, /* random iterative number */
vc_op_id int unsigned NOT NULL default 0, /* id of the associated commit */
path varchar(255) NOT NULL default '',
type tinyint NOT NULL default 0, /* file, directory or deleted file */
PRIMARY KEY (item_revision_id),
UNIQUE KEY (vc_op_id, path)
) /*!40100 DEFAULT CHARACTER SET utf8 */");
db_query("CREATE TABLE {versioncontrol_item_source_revisions} (
item_revision_id int unsigned NOT NULL default 0,
source_item_revision_id int unsigned NOT NULL default 0,
PRIMARY KEY (item_revision_id, source_item_revision_id)
) /*!40100 DEFAULT CHARACTER SET utf8 */");
?>The 'action' property can be left out of {versioncontrol_item_revisions} as it can be determined from the set of source items and the current item, and leaving it out has no impact on performance. Having those tables in the API module would also make it possible to retrieve item history within the API module, taking this burden off the backend modules. It would also encourage backend modules to go the extra mile for providing correct item source revisions, because incomplete or wrong source revisions will show earlier (in form of action descriptions and broken history logs) and can't be hidden by easy workarounds as is the case now.
So, all good stuff, right? The only drawback is that this *again* is a major change on the database tables in both API and backend modules, and as such needs to be done before Version Control API is rolled out on drupal.org. Thus (and also because of the performance improvements) this is pretty much a critical issue, and one that should be done as soon as possible.

#1
Oh baby! I just had a great idea: now that we have the same kind of identifier for all operations (commits, branch ops, tag ops), we can actually unify the table format for associating any kind of operation with their items - so instead of specifying the vc_op_id of the commit in the item revisions table and ignoring that id when referencing it from branch or tag ops (that was a bit unclean anyways), we can leave the id out of the item revisions table and have all mappings in one single operation/items mapping table.
Let's revise the schema given above, and make it more like this:
<?php
db_query("CREATE TABLE {versioncontrol_operation_items} (
vc_op_id int unsigned NOT NULL default 0,
item_revision_id int unsigned NOT NULL default 0,
PRIMARY KEY (vc_op_id, item_revision_id)
) /*!40100 DEFAULT CHARACTER SET utf8 */");
db_query("CREATE TABLE {versioncontrol_item_revisions} (
item_revision_id int unsigned NOT NULL default 0, /* random iterative number */
repo_id int unsigned NOT NULL default 0,
path varchar(255) NOT NULL default '',
type tinyint NOT NULL default 0, /* file, directory or deleted file */
PRIMARY KEY (item_revision_id),
UNIQUE KEY (repo_id, path)
) /*!40100 DEFAULT CHARACTER SET utf8 */");
db_query("CREATE TABLE {versioncontrol_item_source_revisions} (
item_revision_id int unsigned NOT NULL default 0,
source_item_revision_id int unsigned NOT NULL default 0,
PRIMARY KEY (item_revision_id, source_item_revision_id)
) /*!40100 DEFAULT CHARACTER SET utf8 */");
?>
That way, we can get rid of tables like {versioncontrol_*_commit_items}, {versioncontrol_*_item_tags} and {versioncontrol_cvs_item_branch_points}, plus everything is managed by the API module and backends don't need to care about any of those. Yay. Now it just needs to be implemented...
#2
Renaming, it's a two-part issue now (this is the other part).
#3
I said, *renaming*!
#4
I think there should also be the third column in the {versioncontrol_item_revisions} table that is specified for items by the API, namely this one:
revision varchar(255) NOT NULL default ''It's only being actively used by CVS, but it shouldn't really matter for performance and makes things a bit easier still, having all "core properties" in the table that is managed by the API module.
#5
Ok, we need one another revision - Subversion manages to mess up a lot of expectations that apply to other version control systems. That includes the following points:
Both 'action' and 'replaced item' are properties of the commit action, so the solution might be a {versioncontrol_commit_actions} array in addition to the other ones that were mentioned before.
<?phpdb_query("CREATE TABLE {versioncontrol_commit_actions} (
vc_op_id int unsigned NOT NULL default 0,
action tinyint NOT NULL default 0,
current_item_revision_id int unsigned NOT NULL default 0,
replaced_item_revision_id int unsigned NOT NULL default 0,
PRIMARY KEY (vc_op_id, current_item_revision_id)
) /*!40100 DEFAULT CHARACTER SET utf8 */");
?>
However, taking ezyang's input from the other issue into account, we might not want to introduce extra tables for specific types of operations (commit, in this case) and rather have some denormalization for performance and less table join complications. It would be just as possible to merge {versioncontrol_operation_items} and {versioncontrol_commit_items} into one table, leaving 'action' and 'replaced_item_revision_id' unused for branches and tags. Together with the 'revision' column for {versioncontrol_item_revisions} mentioned in the previous comment, we would get something like this (listing all item related tables as current table scheme proposal):
<?php
db_query("CREATE TABLE {versioncontrol_operation_items} (
vc_op_id int unsigned NOT NULL default 0,
item_revision_id int unsigned NOT NULL default 0,
action tinyint NOT NULL default 0, /* only used for commit operations */
replaced_item_revision_id int unsigned NOT NULL default 0, /* only used for commit operations */
PRIMARY KEY (vc_op_id, item_revision_id)
) /*!40100 DEFAULT CHARACTER SET utf8 */");
db_query("CREATE TABLE {versioncontrol_item_revisions} (
item_revision_id int unsigned NOT NULL default 0, /* random iterative number */
repo_id int unsigned NOT NULL default 0,
path varchar(255) NOT NULL default '',
revision varchar(255) NOT NULL default '',
type tinyint NOT NULL default 0, /* file or directory (either normal or deleted) */
PRIMARY KEY (item_revision_id),
UNIQUE KEY (repo_id, path)
) /*!40100 DEFAULT CHARACTER SET utf8 */");
db_query("CREATE TABLE {versioncontrol_item_source_revisions} (
item_revision_id int unsigned NOT NULL default 0,
source_item_revision_id int unsigned NOT NULL default 0,
PRIMARY KEY (item_revision_id, source_item_revision_id)
) /*!40100 DEFAULT CHARACTER SET utf8 */");
?>
That would probably be the better solution... I'll try this out by implementing versioncontrol_svn with a similar structure (which will also be beneficial when updating to the centralized structure later).
#6
Progress! Or better even - I'm now pretty sure that I've developed the best schema that is possible for this purpose. Features:
Commit actions are a combination of operation items and item relations - but they'll disappear anyways in the API, a more detailed set of items can (and will) replace any commit action. So we'll have just versioncontrol_get_operation_items($operation, $include_source_items = FALSE) instead of versioncontrol_get_commit_actions(), versioncontrol_get_branched_items() and versioncontrol_get_tagged_items().
Yes, I am stoked about this schema. You may not like the fact that it's very abstract, but we'll just have the API a bit more concrete (not many changes are planned anyways, compared to now) and considering that even backends hardly have to care about item management anymore, the abstraction shouldn't hurt anyone but hunmonk who does the cvslog -> versioncontrol porting script (sorry, dude).
Without further ado, I proudly present the new "operations" portion of versioncontrol.install. (I am inclined to say "my best piece of architecture yet", but maybe I'm just temporarily overenthusiastic, let's wait and see ;)
<?php
(...other actions...)
define('VERSIONCONTROL_ACTION_REPLACED', 6);
define('VERSIONCONTROL_ACTION_SPECIAL', 7); // e.g. SVN's revprop changes, in case there's only those and no other modifications (which would be preferred over "special")
db_query("CREATE TABLE {versioncontrol_operations} (
vc_op_id int unsigned NOT NULL default 0,
type tinyint unsigned NOT NULL default 0,
repo_id int unsigned NOT NULL default 0,
date bigint NOT NULL default 0,
uid int unsigned NOT NULL default 0,
username varchar(64) NOT NULL default '',
revision varchar(255) NOT NULL default '',
message text,
PRIMARY KEY (vc_op_id),
KEY type (type),
KEY repo_id (repo_id),
KEY date (date),
KEY uid (uid),
KEY username (username),
KEY revision (revision)
) /*!40100 DEFAULT CHARACTER SET utf8 */");
db_query("CREATE TABLE {versioncontrol_operation_labels} (
vc_op_id int unsigned NOT NULL default 0,
label_id int unsigned NOT NULL default 0,
action tinyint NOT NULL default 0, /* VERSIONCONTROL_ACTION_{ADDED,DELETED,MOVED} for branches and tags, VERSIONCONTROL_ACTION_MODIFIED for commits (doesn't matter anyways) */
PRIMARY KEY (vc_op_id, label_id)
) /*!40100 DEFAULT CHARACTER SET utf8 */");
db_query("CREATE TABLE {versioncontrol_labels} (
label_id int unsigned NOT NULL default 0,
repo_id int unsigned NOT NULL default 0,
name varchar(255) NOT NULL default '',
type tinyint NOT NULL default 0, /* VERSIONCONTROL_OPERATION_BRANCH or VERSIONCONTROL_OPERATION_TAG */
PRIMARY KEY (label_id),
UNIQUE KEY (repo_id, name, type)
) /*!40100 DEFAULT CHARACTER SET utf8 */");
db_query("CREATE TABLE {versioncontrol_operation_items} (
vc_op_id int unsigned NOT NULL default 0,
item_revision_id int unsigned NOT NULL default 0,
PRIMARY KEY (vc_op_id, item_revision_id)
) /*!40100 DEFAULT CHARACTER SET utf8 */");
db_query("CREATE TABLE {versioncontrol_source_items} (
item_revision_id int unsigned NOT NULL default 0,
source_item_revision_id int unsigned NOT NULL default 0,
action tinyint NOT NULL default 0, /* now stored per item relation, not per commit. that makes it possible to combine "modified" and "replaced" actions in one single structure, and extensible for further actions even */
PRIMARY KEY (item_revision_id, source_item_revision_id)
) /*!40100 DEFAULT CHARACTER SET utf8 */");
db_query("CREATE TABLE {versioncontrol_item_revisions} (
item_revision_id int unsigned NOT NULL default 0,
repo_id int unsigned NOT NULL default 0,
path varchar(255) NOT NULL default '',
revision varchar(255) NOT NULL default '',
type tinyint NOT NULL default 0, /* file or directory (either normal or deleted) */
PRIMARY KEY (item_revision_id),
UNIQUE KEY (repo_id, path)
) /*!40100 DEFAULT CHARACTER SET utf8 */");
?>
#7
Cutting off the "Major Issue #1:" completely makes the issue queue look odd, as this issue doesn't get the importance that it deserves and the major issues #2 and #3 lack a first one. Trying to find the golden middle way. ("Mission impossible"? ...well, not quite, but close.)
#8
Coding is now in progress, the first part (database changes in the API module) has been committed as commit 103150.
#9
More progress:
Todo:
#10
The first and second point on the above todo list are essentially done - versioncontrol_get_operation_items() and related functions work, and Commit Log shows items again. versioncontrol-backend.inc has been adapted to include versioncontrol_insert_operation() which replaces the former commit/branch/tag insertion functions, so I was able to get rid of lot of code :)
Some code in the CVS and SVN backends were also deleted, but porting those is not yet done, plus Commit Log is currently plastered with TODO items originating from the unification of commits, branches and tags into operations as well as the introduction of labels and the removal of the commit directory item. Those need to be tackled, parts of Commit Log might need a major overhaul. versioncontrol_project is in a bit of an unknown state - it should essentially work, but needs a review once API module, backends and Commit Log are done.
Here's an action plan on how we might be able to get it working again:
Items need a 'selected_label' property (similar to a single $operation['labels'] element) so that the get_current_item_{branch,tag}() functions can keep working. (In fact, with the label attached to the item, those functions will be unneeded and can be deleted.) Background: Due to the fact that items are now not created by the backends anymore, the backends also cannot attach the branch/tag to operation items anymore, but those will still be needed for navigating in repository browsers. So instead of doing get_current_item_{branch/tag}() on caller request, we might rather attach that 'selected_label' property on loading the item and let the backend provide them via callback (given either the operation in get_operation_items() or the successor item in get_source_items() as context).Done.Related to that, I just noticed that versioncontrol-backend.inc still has its original branch management functions (near the end of the file) *and* a newer _versioncontrol_ensure_label(). Reduce that to only one version, and determine if the insertion functions still need to be public. (The retrieval functions will at least be necessary in the new callback mentioned above.Done.Implement the mentioned callback in the CVS backend, so that we can delete versioncontrol_cvs_get_commit_actions() and versioncontrol_cvs_get_{branched,tagged}_items() with a good conscience. For the SVN backend, a stub implementation (returning NULL) should be sufficient as it doesn't (yet?) support branch and tag emulation anyways.Done.I'm going to become a big fan of small-chunked checklists. Never leave your house without knowing what exactly to do next!
#11
Ok, I implemented point 1, 2 and 3 on the above list (now updated with shiny number ordering), but it kicks up a few new questions. Essentially, I'm still not really happy with the way that Version Control API tackles labels (= branches/tags), although after my commit today it's a lot better than it was before. Needs more work still, see point 4 and 5. That's the final architectural challenge, when we get that one right, everything else is pretty much a breeze.
Essentially, as long as we need to query the backends for items, it's not yet done... at the moment we're currently stuck between "store lots of item data and operate independently on it" on the one hand and "still leave some important data to the backends" on the other one, which is a bit awkward in terms of consistency and clarity of the used concepts. Handing over all item data to the backends is not an option (that's what 1.x did, and had severe issues in terms of performance and queryability) so I think we should really take the final step and store all label associations for all items too. We might need better item/operation management functions - update functions, that is - but I think it's the only way this thing will be able to work elegantly.
So, next up: a final database table for {versioncontrol_item_labels}. Anyone want to try coming up with a schema? (Hint: should be easy!)