MI#1: Move item handling and commit branches into the API module

jpetso - January 31, 2008 - 17:04
Project:Version Control API
Version:5.x-1.2
Component:API module
Category:task
Priority:critical
Assigned:jpetso
Status:active
Description

As more backends are being implemented, it becomes clearer that 1. backends are able to retrieve and manage commit log data in the database, and 2. it's unlikely that backends appear that don't store their item revisions in the database, and that this would probably be a major performance hit.

As pointed out by ezyang in the Mercurial backend's README.txt file, we could make things easier for backends if some of the item handling code and tables is moved into the API module. As an additional bonus, we could simplify and speed up queries for the retrieval of commits and maybe branch/tag operations. Given the similarity between the different backends' item revision tables, I could imagine to move the tables to versioncontrol.install like this (MySQL version):

<?php
db_query
("CREATE TABLE {versioncontrol_item_revisions} (
  item_revision_id int unsigned NOT NULL default 0,  /* random iterative number */
  vc_op_id int unsigned NOT NULL default 0,  /* id of the associated commit */
  path varchar(255) NOT NULL default '',
  type tinyint NOT NULL default 0,  /* file, directory or deleted file */
  PRIMARY KEY (item_revision_id),
  UNIQUE KEY (vc_op_id, path)
) /*!40100 DEFAULT CHARACTER SET utf8 */"
);

db_query("CREATE TABLE {versioncontrol_item_source_revisions} (
  item_revision_id int unsigned NOT NULL default 0,
  source_item_revision_id int unsigned NOT NULL default 0,
  PRIMARY KEY (item_revision_id, source_item_revision_id)
) /*!40100 DEFAULT CHARACTER SET utf8 */"
);
?>

The 'action' property can be left out of {versioncontrol_item_revisions} as it can be determined from the set of source items and the current item, and leaving it out has no impact on performance. Having those tables in the API module would also make it possible to retrieve item history within the API module, taking this burden off the backend modules. It would also encourage backend modules to go the extra mile for providing correct item source revisions, because incomplete or wrong source revisions will show earlier (in form of action descriptions and broken history logs) and can't be hidden by easy workarounds as is the case now.

So, all good stuff, right? The only drawback is that this *again* is a major change on the database tables in both API and backend modules, and as such needs to be done before Version Control API is rolled out on drupal.org. Thus (and also because of the performance improvements) this is pretty much a critical issue, and one that should be done as soon as possible.

#1

jpetso - January 31, 2008 - 17:53

Oh baby! I just had a great idea: now that we have the same kind of identifier for all operations (commits, branch ops, tag ops), we can actually unify the table format for associating any kind of operation with their items - so instead of specifying the vc_op_id of the commit in the item revisions table and ignoring that id when referencing it from branch or tag ops (that was a bit unclean anyways), we can leave the id out of the item revisions table and have all mappings in one single operation/items mapping table.

Let's revise the schema given above, and make it more like this:

<?php
db_query
("CREATE TABLE {versioncontrol_operation_items} (
  vc_op_id int unsigned NOT NULL default 0,
  item_revision_id int unsigned NOT NULL default 0,
  PRIMARY KEY (vc_op_id, item_revision_id)
) /*!40100 DEFAULT CHARACTER SET utf8 */"
);

db_query("CREATE TABLE {versioncontrol_item_revisions} (
  item_revision_id int unsigned NOT NULL default 0,  /* random iterative number */
  repo_id int unsigned NOT NULL default 0,
  path varchar(255) NOT NULL default '',
  type tinyint NOT NULL default 0,  /* file, directory or deleted file */
  PRIMARY KEY (item_revision_id),
  UNIQUE KEY (repo_id, path)
) /*!40100 DEFAULT CHARACTER SET utf8 */"
);

db_query("CREATE TABLE {versioncontrol_item_source_revisions} (
  item_revision_id int unsigned NOT NULL default 0,
  source_item_revision_id int unsigned NOT NULL default 0,
  PRIMARY KEY (item_revision_id, source_item_revision_id)
) /*!40100 DEFAULT CHARACTER SET utf8 */"
);
?>

That way, we can get rid of tables like {versioncontrol_*_commit_items}, {versioncontrol_*_item_tags} and {versioncontrol_cvs_item_branch_points}, plus everything is managed by the API module and backends don't need to care about any of those. Yay. Now it just needs to be implemented...

#2

jpetso - January 31, 2008 - 18:19

Renaming, it's a two-part issue now (this is the other part).

#3

jpetso - January 31, 2008 - 18:19
Title:Major issue #1: Move item handling into the API module» Major issue #1a: Move item handling into the API module

I said, *renaming*!

#4

jpetso - February 10, 2008 - 19:52
Component:Code» API module

I think there should also be the third column in the {versioncontrol_item_revisions} table that is specified for items by the API, namely this one: revision varchar(255) NOT NULL default ''

It's only being actively used by CVS, but it shouldn't really matter for performance and makes things a bit easier still, having all "core properties" in the table that is managed by the API module.

#5

jpetso - February 13, 2008 - 17:39

Ok, we need one another revision - Subversion manages to mess up a lot of expectations that apply to other version control systems. That includes the following points:

  • Replaced items: a commit action can not only contain current and source items but also a replaced item, saying that this item was deleted and added (or copied/moved) in the same revision. On the API side, I want this to be just another element in the commit actions array ('replaced item') which will of course be optional, and in the database we need to treat it separately from the source items so that the replaced item can be restored separately on loading the commit action.
  • When moving directories, child items are only moved implicitely in Subversion. This means that modified and deleted items don't necessarily need to have similar source and current item paths, and that in turn makes it impossible for us to calculate the 'action' based on the contents of a commit action array, it needs to be stored explicitely.

Both 'action' and 'replaced item' are properties of the commit action, so the solution might be a {versioncontrol_commit_actions} array in addition to the other ones that were mentioned before.

<?php
db_query
("CREATE TABLE {versioncontrol_commit_actions} (
  vc_op_id int unsigned NOT NULL default 0,
  action tinyint NOT NULL default 0,
  current_item_revision_id int unsigned NOT NULL default 0,
  replaced_item_revision_id int unsigned NOT NULL default 0,
  PRIMARY KEY (vc_op_id, current_item_revision_id)
) /*!40100 DEFAULT CHARACTER SET utf8 */"
);
?>

However, taking ezyang's input from the other issue into account, we might not want to introduce extra tables for specific types of operations (commit, in this case) and rather have some denormalization for performance and less table join complications. It would be just as possible to merge {versioncontrol_operation_items} and {versioncontrol_commit_items} into one table, leaving 'action' and 'replaced_item_revision_id' unused for branches and tags. Together with the 'revision' column for {versioncontrol_item_revisions} mentioned in the previous comment, we would get something like this (listing all item related tables as current table scheme proposal):

<?php
db_query
("CREATE TABLE {versioncontrol_operation_items} (
  vc_op_id int unsigned NOT NULL default 0,
  item_revision_id int unsigned NOT NULL default 0,
  action tinyint NOT NULL default 0, /* only used for commit operations */
  replaced_item_revision_id int unsigned NOT NULL default 0,  /* only used for commit operations */
  PRIMARY KEY (vc_op_id, item_revision_id)
) /*!40100 DEFAULT CHARACTER SET utf8 */"
);

db_query("CREATE TABLE {versioncontrol_item_revisions} (
  item_revision_id int unsigned NOT NULL default 0,  /* random iterative number */
  repo_id int unsigned NOT NULL default 0,
  path varchar(255) NOT NULL default '',
  revision varchar(255) NOT NULL default '',
  type tinyint NOT NULL default 0,  /* file or directory (either normal or deleted) */
  PRIMARY KEY (item_revision_id),
  UNIQUE KEY (repo_id, path)
) /*!40100 DEFAULT CHARACTER SET utf8 */"
);

db_query("CREATE TABLE {versioncontrol_item_source_revisions} (
  item_revision_id int unsigned NOT NULL default 0,
  source_item_revision_id int unsigned NOT NULL default 0,
  PRIMARY KEY (item_revision_id, source_item_revision_id)
) /*!40100 DEFAULT CHARACTER SET utf8 */"
);
?>

That would probably be the better solution... I'll try this out by implementing versioncontrol_svn with a similar structure (which will also be beneficial when updating to the centralized structure later).

#6

jpetso - February 21, 2008 - 00:32
Title:Major issue #1a: Move item handling into the API module» Move item handling and commit branches into the API module

Progress! Or better even - I'm now pretty sure that I've developed the best schema that is possible for this purpose. Features:

  • Virtually no redundancies or unused columns.
  • Increases the number of "core" tables only by one - 3 tables removed by centralizing into {versioncontrol_operations} (commits, branch_ops, tag_ops), one table repurposed (branches -> labels), and 4 tables newly introduced (operation_labels handles commit branches as well as branch and tag name associations, the other three are for item storage).
  • Handles both items and commit branches (therefore also obsoleting Major issue #1b).
  • can be joined in the same ways for commits, tags and branches (given a NULL item and label - or 0, technically - so also operations without items or branches/tags can be joined)
  • and generally handles all use cases that I could think of. Please try to find stuff that I didn't think of.

Commit actions are a combination of operation items and item relations - but they'll disappear anyways in the API, a more detailed set of items can (and will) replace any commit action. So we'll have just versioncontrol_get_operation_items($operation, $include_source_items = FALSE) instead of versioncontrol_get_commit_actions(), versioncontrol_get_branched_items() and versioncontrol_get_tagged_items().

Yes, I am stoked about this schema. You may not like the fact that it's very abstract, but we'll just have the API a bit more concrete (not many changes are planned anyways, compared to now) and considering that even backends hardly have to care about item management anymore, the abstraction shouldn't hurt anyone but hunmonk who does the cvslog -> versioncontrol porting script (sorry, dude).

Without further ado, I proudly present the new "operations" portion of versioncontrol.install. (I am inclined to say "my best piece of architecture yet", but maybe I'm just temporarily overenthusiastic, let's wait and see ;)

<?php
(...other actions...)
define('VERSIONCONTROL_ACTION_REPLACED', 6);
define('VERSIONCONTROL_ACTION_SPECIAL'7); // e.g. SVN's revprop changes, in case there's only those and no other modifications (which would be preferred over "special")

db_query("CREATE TABLE {versioncontrol_operations} (
  vc_op_id int unsigned NOT NULL default 0,
  type tinyint unsigned NOT NULL default 0,
  repo_id int unsigned NOT NULL default 0,
  date bigint NOT NULL default 0,
  uid int unsigned NOT NULL default 0,
  username varchar(64) NOT NULL default '',
  revision varchar(255) NOT NULL default '',
  message text,
  PRIMARY KEY (vc_op_id),
  KEY type (type),
  KEY repo_id (repo_id),
  KEY date (date),
  KEY uid (uid),
  KEY username (username),
  KEY revision (revision)
) /*!40100 DEFAULT CHARACTER SET utf8 */"
);

db_query("CREATE TABLE {versioncontrol_operation_labels} (
  vc_op_id int unsigned NOT NULL default 0,
  label_id int unsigned NOT NULL default 0,
  action tinyint NOT NULL default 0, /* VERSIONCONTROL_ACTION_{ADDED,DELETED,MOVED} for branches and tags, VERSIONCONTROL_ACTION_MODIFIED for commits (doesn't matter anyways) */
  PRIMARY KEY (vc_op_id, label_id)
) /*!40100 DEFAULT CHARACTER SET utf8 */"
);

db_query("CREATE TABLE {versioncontrol_labels} (
  label_id int unsigned NOT NULL default 0,
  repo_id int unsigned NOT NULL default 0,
  name varchar(255) NOT NULL default '',
  type tinyint NOT NULL default 0, /* VERSIONCONTROL_OPERATION_BRANCH or VERSIONCONTROL_OPERATION_TAG */
  PRIMARY KEY (label_id),
  UNIQUE KEY (repo_id, name, type)
) /*!40100 DEFAULT CHARACTER SET utf8 */"
);

db_query("CREATE TABLE {versioncontrol_operation_items} (
  vc_op_id int unsigned NOT NULL default 0,
  item_revision_id int unsigned NOT NULL default 0,
  PRIMARY KEY (vc_op_id, item_revision_id)
) /*!40100 DEFAULT CHARACTER SET utf8 */"
);

db_query("CREATE TABLE {versioncontrol_source_items} (
  item_revision_id int unsigned NOT NULL default 0,
  source_item_revision_id int unsigned NOT NULL default 0,
  action tinyint NOT NULL default 0,  /* now stored per item relation, not per commit. that makes it possible to combine "
modified" and "replaced" actions in one single structure, and extensible for further actions even */
  PRIMARY KEY (item_revision_id, source_item_revision_id)
) /*!40100 DEFAULT CHARACTER SET utf8 */"
);

db_query("CREATE TABLE {versioncontrol_item_revisions} (
  item_revision_id int unsigned NOT NULL default 0,
  repo_id int unsigned NOT NULL default 0,
  path varchar(255) NOT NULL default '',
  revision varchar(255) NOT NULL default '',
  type tinyint NOT NULL default 0,  /* file or directory (either normal or deleted) */
  PRIMARY KEY (item_revision_id),
  UNIQUE KEY (repo_id, path)
) /*!40100 DEFAULT CHARACTER SET utf8 */"
);
?>

#7

jpetso - February 21, 2008 - 00:40
Title:Move item handling and commit branches into the API module» MI#1: Move item handling and commit branches into the API module

Cutting off the "Major Issue #1:" completely makes the issue queue look odd, as this issue doesn't get the importance that it deserves and the major issues #2 and #3 lack a first one. Trying to find the golden middle way. ("Mission impossible"? ...well, not quite, but close.)

#8

jpetso - February 26, 2008 - 00:39

Coding is now in progress, the first part (database changes in the API module) has been committed as commit 103150.

#9

jpetso - March 3, 2008 - 10:30
Version:5.x-1.0-rc4» 5.x-1.2

More progress:

  • The search query has been unified and ported in commit 103866. I tend to find this very cool :)
  • For more performant queries I decided to have searchable source items "cached" in the {versioncontrol_operations} table, see commit 103646 for the exact changes and explanation.

Todo:

  • Merge versioncontrol_get_commit_actions(), versioncontrol_get_branched_items() and versioncontrol_get_tagged_items() into one single function versioncontrol_get_operation_items(). Commit actions will be an array of items (instead of actions) with source and replaced items being just another property. Those should be able to be retrieved by a new function, versioncontrol_add_source_items(&$operation_items), and probably a TRUE/FALSE parameter in versioncontrol_get_operation_items() which causes it to automatically call this function (or combine the queries and add source/replaced items by itself, whatever is better). The Subversion backend has the closest implementation of what will become get_operation_items() - still named get_commit_actions() there, of course - perhaps we should start from there.
  • versioncontrol-backend.inc needs to be adjusted as well, so that versioncontrol_insert_operation() inserts the operation items by itself. It should also sort the items alphabetically before inserting them.
  • That should be the two major tasks for the API module itself - afterwards, we need to port the VCS backends (starting with the CVS one) and port all get_commit_actions() usages to get_operation_items().

#10

jpetso - September 5, 2008 - 17:24

The first and second point on the above todo list are essentially done - versioncontrol_get_operation_items() and related functions work, and Commit Log shows items again. versioncontrol-backend.inc has been adapted to include versioncontrol_insert_operation() which replaces the former commit/branch/tag insertion functions, so I was able to get rid of lot of code :)

Some code in the CVS and SVN backends were also deleted, but porting those is not yet done, plus Commit Log is currently plastered with TODO items originating from the unification of commits, branches and tags into operations as well as the introduction of labels and the removal of the commit directory item. Those need to be tackled, parts of Commit Log might need a major overhaul. versioncontrol_project is in a bit of an unknown state - it should essentially work, but needs a review once API module, backends and Commit Log are done.

Here's an action plan on how we might be able to get it working again:

  1. Items need a 'selected_label' property (similar to a single $operation['labels'] element) so that the get_current_item_{branch,tag}() functions can keep working. (In fact, with the label attached to the item, those functions will be unneeded and can be deleted.) Background: Due to the fact that items are now not created by the backends anymore, the backends also cannot attach the branch/tag to operation items anymore, but those will still be needed for navigating in repository browsers. So instead of doing get_current_item_{branch/tag}() on caller request, we might rather attach that 'selected_label' property on loading the item and let the backend provide them via callback (given either the operation in get_operation_items() or the successor item in get_source_items() as context). Done.
  2. Related to that, I just noticed that versioncontrol-backend.inc still has its original branch management functions (near the end of the file) *and* a newer _versioncontrol_ensure_label(). Reduce that to only one version, and determine if the insertion functions still need to be public. (The retrieval functions will at least be necessary in the new callback mentioned above. Done.
  3. Implement the mentioned callback in the CVS backend, so that we can delete versioncontrol_cvs_get_commit_actions() and versioncontrol_cvs_get_{branched,tagged}_items() with a good conscience. For the SVN backend, a stub implementation (returning NULL) should be sufficient as it doesn't (yet?) support branch and tag emulation anyways. Done.
  4. Implement versioncontrol_get_item_history() with the 2.x API (using versioncontrol_add_source_items() or the {versioncontrol_source_items} table directly, probably also needs some functionality that goes into the successor item direction). Hm... yeah, I think that was the thing that made me want to add a {versioncontrol_item_labels} table, as we could then automatically traverse the history tree while preserving the branch and not having to ask the backend about the successor or source items on the same branch.
  5. Rework versioncontrol_get_all_item_{branches,tags}() so that they provide requirements (for the backends) and documentation that plays nicely with all that label stuff. If we do a {versioncontrol_item_labels} table as pondered above, we could also move this wholesale into the API module and take the burden of implementing them from the backends. On the other hand, that would require the backends to track labels (branch/tag creations, deletions and renames) more accurately, so that the current state of branch/tag assignments should ideally always match their database representation. Might be hard to get completely right, but I think with all this label data that we already need to manage, there's no turning back now... let the backends be data managers and the API module be the query engine, all the way.
  6. Move the lines-changed information in the {versioncontrol_source_items} table so that we can scrap get_commit_statistics(), whose API is not nice anyways. That way, we can provide the information where it makes sense (in the item arrays) and get rid of the {versioncontrol_cvs_item_revisions} table, which only contains the lines-changed columns by now. Port the (two, if I remember correctly) usages get_commit_statistics() in Commit Log and versioncontrol_project to whatever it looks afterwards.
  7. Finish the database update for CVS (needs to move branch and tag associations to {versioncontrol_operation_items}) and implement it for SVN (which should be easy, as there are only commits and the SVN backend maps well to the new schema).
  8. Port the log parsers of CVS and SVN backend as well as the CVS backend's xcvs-* trigger scripts to the new operation array format. See the versioncontrol_get_operations() function docs and the 2.x API changes page. Test and have a look if data gets transferred into the database correctly. (Commit Log should be able to show most of the stuff correctly, too.) Special areas of interest - having changed in the 2.x branch - are labels (previously branches/tags) and source items, and of course commit actions have been replaced by arrays of operation items (so the layout might be slightly different here, too).
  9. Rework Commit Log so that it displays all three kinds of operations - commits, branches and tags - correctly and in a way that they make sense. Might also involve changing the output format a bit, as that is a) inflexible and too crowded anyways, and b) only suited for commits in the current form.
  10. At that point, the CVS and SVN backends should import stuff correctly, and the Commit Log should show all the commits/tags/branches along with their items as intended. Review Commit Restrictions and versioncontrol_project if they need porting to the 2.x API (the existence of commit specific functions is a nice indicator for that), and if so, get them ported and working.
  11. Done! Version Control API is now restored to the 1.x branch's fully working state. (Right?) Proceed with #216373: Major issue #2: Direct hook access to the query.

I'm going to become a big fan of small-chunked checklists. Never leave your house without knowing what exactly to do next!

#11

jpetso - September 5, 2008 - 17:39

Ok, I implemented point 1, 2 and 3 on the above list (now updated with shiny number ordering), but it kicks up a few new questions. Essentially, I'm still not really happy with the way that Version Control API tackles labels (= branches/tags), although after my commit today it's a lot better than it was before. Needs more work still, see point 4 and 5. That's the final architectural challenge, when we get that one right, everything else is pretty much a breeze.

Essentially, as long as we need to query the backends for items, it's not yet done... at the moment we're currently stuck between "store lots of item data and operate independently on it" on the one hand and "still leave some important data to the backends" on the other one, which is a bit awkward in terms of consistency and clarity of the used concepts. Handing over all item data to the backends is not an option (that's what 1.x did, and had severe issues in terms of performance and queryability) so I think we should really take the final step and store all label associations for all items too. We might need better item/operation management functions - update functions, that is - but I think it's the only way this thing will be able to work elegantly.

So, next up: a final database table for {versioncontrol_item_labels}. Anyone want to try coming up with a schema? (Hint: should be easy!)

 
 

Drupal is a registered trademark of Dries Buytaert.