Data API: further steps [#145684]

Issue http://drupal.org/node/113435 takes some first steps towards a unified data API for Drupal that might conceivably be possible for Drupal 6.

The aim of this issue is to begin to map out the further steps beyond that. This is almost certainly for post-6.x work, but it's important nonetheless to start mapping it out, as http://drupal.org/node/113435 only makes sense as a step towards something.

Key to any future Data API is the pending Schema API, http://drupal.org/node/144765. When it is in, and enhanced with additional table and field properties (foreign key and other field relationships, Drupal-specific parameters like whether a particular data type takes SQL rewriting), we presumably will have all the schema knowledge required to programmatically create, update, load, and delete objects along with all their linked data.

But what is our best approach? How best to take advantage of this new schema awareness?

To start off, let's try to define some of what a Data API should accomplish. Here's a preliminary list of suggested goals:

All data transactions are completely independent of form/user input.
Data have a consistent structure in all their forms. E.g., what's returned by a load operation can be sent directly to an update operation without change of format.
All objects are of a single type. Here we need to choose between our two current forms: array and object.
Create and update operations can accept data without IDs set and respond appropriately. For example, we can save a node with an associated user (author), that user being identified not by ID but by an array of properties. If the user does not exist, she/he will be created as a user. If the user does exist already, the corresponding user id will be saved.
Data API implementations, e.g., save a node, include no direct SQL. All needed database operations are handled by the API.

And here, for discussion and review, are some initial suggestions as to how we could implement this.

We bite the bullet and change all our data objects to arrays.
(Or, hey, objects, if that's the consensus. I'm with what chx argued awhile back though IIRC--arrays are more efficient, and since we're not using the object properties arrays make more sense.)
We change the way we current load properties into objects to reflect their data structure.
Currently we have two ways we load relational data into objects: directly, and as a nested array or object. For example, for a book node, the 'parent' property goes directly in the $node object. Some other properties, though, are nested in an array or object (e.g., organic group properties of a node). This nested approach is what's likely to work best for a data API.
To enable schema-aware handling of a data item array, we need to reflect the relationships somehow in the data structure. The most obvious way is an approach analagous to the Forms API. That is, when loading data (or directly creating a data array), we mark keys in a way to identify them as foreign references. A node, then, would have direct properties from its primary table--the node table. All of its other properties would be in arrays associated with their respective tables of origin. E.g., a book with a parent have, instead of a 'parent' key, a '@book' key, which is an array of properties held in the book table, in this case, the parent.
We implement a set of API methods (save, load, delete) that can handle all object types.
A full data operation should be possible by calling these methods alone. We remove all our existing object-type-specific methods, e.g., node_load. However, we still pass the results through dataapi hooks, allowing modules to e.g. add additional properties.

Barry has already shown, in his Schema module, how we can do schema-aware node loads.

A few weeks ago I started sketching in some of what this might look at http://cvs.drupal.org/viewcvs/drupal/contributions/modules/dataapi/. Please be kind, this is rough code and I wrote it in ignorance of the excellent work Barry and frando were doing. It needs updating to the Schema API and considerable further work before it's anything more than notes. That said, it does suggest some of what we might be able to do with a schema-aware set of data API methods.

Does this general direction make sense? Is it enough to justify implementing drupal_save(), drupal_load(), and drupal_delete() methods now in http://drupal.org/node/113435, even though at present they only wrap and don't yet replace the various and diverse node_load(), comment_save(), term_delete methods?

Comments

Comment #1

Chris Johnson commented 30 May 2007 at 20:18

I wrote a database API for another more general application which pretty much met goals 1 through 5 about a year ago. Let me mention that I copied generously from best-practice examples, and that I used OO, PHP5 code to do it. The OO aspect is particular noteworthy because almost all of the things which make this kind of API generic and easy to maintain are OO-based patterns and behaviors, e.g. inheritance and polymorphism.

It's a noble goal. When's code freeze? Doing it right is going to take a few weeks at a minimum, I'd bet.

Comment #2

Chris Johnson commented 30 May 2007 at 20:34

I just read through the example code.

Is the idea that to, say, load the node object, dataapi_load() will first load the entire row from the node table, and then loop through related tables, e.g. node_revisions, to get entire rows of associated data there to build the node object?

I have some qualms about performance and amount of code needed, but I'm not sure I fully understand the proposal well enough to make them into reasonable criticisms and suggestions.

Comment #3

Nick Lewis commented 4 June 2007 at 01:03

Dries says we have 4 extra weeks. This sounds fun! At lot of what you wrote wasn't completely clear to me.

In regards to data transactions being completely independent of user input and forms, are you suggesting that the data model should be used to build a form array? For example:

// some property contains a reference to the schema id 
$array['#schema_id'] = 'node'; 


// this gets passed into a function that uses various schema attributes to make a best guess 
$form = drupal_build_form($array);


// keys could be assumed to be hidden 
$form['nid'] = array('#type' => 'hidden');
// non binary integers are assumed as text fields 
$form['created'] = array('#type' => 'textfield'); 
// binary integers are assumed to be radios 
$form['published'] = array('#type' => 'radio');
// some fields can be altered, or removed on a case by case basis.... 
unset($form['updated']);

Am I totally off base?

As far as choosing a type, I am totally in favor of using arrays, but ironically for a reason that's not supported yet by the schema api. A challenge in using nested array structures is that its difficult to determine the difference between a property and a child element -- however, the form api, and existing functions like element_children, element_property would make this a breeze. Just to sketch out how I think this might look:

$node = array(
'#schema' =>array('#id' =>  'node'), 
'#nid' => 1, 
// lack of pound is a flag that that this is a seperate entity with a new set of behaviors, and characteristics
'comments' => array(
'#entity_attribute_X' => array(),
// id 
'1' => array(
'#cid' => 1, 
'user' => array(
'#uid' => 1

)
)

I'm not suggesting that the above is a good idea, but rather that seperating out data elements from their properties is a good idea? Maybe? I'm jus trying to imagine what this would look like from a low level....