looking around in the DB tables for my (4.6) drupal site using the dba.module, i noticed that drupal is currently doing something very ineffecient for storing the different revisions of each node. the "node" table just has a longtext field called "revisions", which contains serialized copies of the entire node for each revision. looking at the DB schemas for 4.7, i see that things have changed (there's now a "node_revisions" table), but it seems like once again, we're storing the entire node (with separate fields for the teaser and body) for each revision. for nodes with lots of text that are edited frequently, this becomes a huge waste of space in the database. while i haven't done any benchmarking, it's probably slow as hell to unserialize and process all that data once the number of revisions gets big, too (4.7 seems to have at least solved this part of the problem).

my first thought was "why doesn't drupal just record the diff in the DB, like most revision control systems do?"

then i had a (potentially) better idea... why doesn't drupal just use a real revision control system for this? it seems like a collosal waste of time for the drupal development community to reinvent a tool that other people have already made (a far more powerful and mature version of that tool than we've got, for that matter). wouldn't subversion itself be a nice fit for a backend to store the body of any node that has been configured to save revisions? the "revisions" field (or a row in the 4.7 node_revisions table) would just hold a new svn tag identifier for each unique revision of a node (tagging is nearly instantaneous in subversion, unlike cvs). since all nodes have a unique nid, we could use a simple convention for the filenames that each node's body were stored under. we'd still cache the most current revision in the node table, both for performance and for compatibility with all the existing modules and APIs. however, the parts of node.module that are trying to store or retrieve a given revision could just run some svn command(s) to either retrieve or commit and tag a given revision of a given node.

eventually (and i know this kind of proposal would actually be years from really existing within drupal, which is why i set the "Drupal version" to "none" for this post) we could ask the nice folks who develop subversion if they'd be willing to:

  • make a version that uses MySQL as a backend instead of BerkeleyDB (they already must have a level of indirection for this, since they're now supporting both BerkeleyDB and the file system as the backend storage... seems like a MySQL backend is a logical next step for them, if it's not already in the works)
  • provide a php API so that applications like drupal could just invoke this API to access the svn repository directly (instead of having to fork()/exec() svn command-line tools)

once drupal was using svn as a backend to store node revisions, it'd be trivial to provide all sorts of powerful node revision functionality within drupal ("blame annotated source", better viewing of diffs, diffs across arbitrary revisions, RCS keyword substitution on nodes... anything svn can do already). we'd save a ton of space in the DB, and we'd never have to worry about manually adding new functionality to our revision management system, since we wouldn't be maintaining one anymore.

is this totally crazy? (yes, i know i'm painting an overly optimistic picture) ;)
has this been discussed before? (i couldn't find anything like this searching the forums)
is any of this worth investigating?

i'm curious what everyone thinks about this.

thanks,
-derek

Comments

sami_k’s picture

Adrian has talked about this and actually he brought it up at OSCMS Summit as something that he wanted to do. If you're interested in championing such an effort and have some time to do so, you should definitely talk to him. However do note that Adrian, and pretty much everyone else here, is extremely busy so if you can't personally contribute, it's not likely that anything as such will get done in any given timeline other than what individual developers have available... That's my POV, Adrian may have a different POV, perhaps he can chime in and do his own IMO.

--
life waits for no man.

dww’s picture

so if you can't personally contribute, it's not likely that anything as such will get done

i'm willing to help, i just need to know if this is worth persuing at all, and if so, hash out some of the top-level design before i start writing code. so far, my drupal "expertise" (if you can call it that) is in a few of the contrib modules. i haven't looked closely at much of the core drupal code yet, so i'd like input and direction from some of the developers who will eventually decide if any of my work is actually rolled into a future version or not. ;)

___________________
3281d Consulting

puregin’s picture

As you point out, the new 4.7 has a completely new revision system with a separate table for revisions. The content (and teaser) are actually stored in the node_revisions table, not the node table.

For most applications, the overhead of storing a few revisions per node rather than diffs is acceptable.

I agree, though, that it would be sweet to have subversion or another revision control system for something like writing a book, or other applications where revisions are either large or frequent.

--
puregin

PixelNurse’s picture

Sadly the 4.7 revs table doesn't include the status of the revision, so 4.7 is gonna put the lid on any hopes I had to fix a huge problem I have with Drupal. Do you know if there's any chance of them adding a status field? The serialised array in the nodes table in 4.6 contains it.

http://drupal.org/node/50506

sepeck’s picture

Your best bet would be to ask on the development mailling list at this point. It's not that it's imposible but the current changes reflect a fairly long term effort to improve performance of the revisions. So, if not in 4.7, then perhaps 4.8? Lay the ground work and begin networking now. :)

-Steven Peck
---------
Test site, always start with a test site.
Drupal Best Practices Guide -|- Black Mountain

-Steven Peck
---------
Test site, always start with a test site.
Drupal Best Practices Guide

mlncn’s picture

I do think revision control should be pursued, or else we should work on a Drupal way (modeled on Mercurial/Git?) of saving revisions as diffs. Even for a little piece of my Community Managed Taxonomy module which allows people to vote on term descriptions, I would have loved versions saved as diffs. This would build in ensuring uniqueness at the same time it saves space and stores, by definition, information about what changed..

Of course, the whole subject of being able to vote on changes to a text could be ten research papers across five disciplines.

~ben

People Who Give a Damn :: http://pwgd.org/ :: Building the infrastructure of a network for everyone
Agaric Design Collective :: http://AgaricDesign.com/ :: Open Source Web Development

benjamin, Agaric

colan’s picture

Just in case folks come across this old deprecated thread, I've got an idea for doing this with Git that would work in Drupal 7+. See Using Git for efficient field storage in Drupal for details.