This is a kind of follow up to http://drupal.org/node/63049 [Extend database abstraction layer, to include table creation.].
Back then, I did a patch that provided a couple of functions for table creation and alteration.
Two days ago, I picked up the topic again, and reworked it completely.

The idea of this patch is to have an abstract definition of all data models (iow, tables). For now, these data models are used to create all tables. In the future, they might be useful for other purposes as well, who knows - data reflection can serve many purposes.

The implentation is so far very simple and only a first step: Each data model (=table) gets a callback function that returns the definition array (similar to form definitions).
The table creation functions that create a 'CREATE TABLE ...' SQL statement out of these data models are database engine specific (i.e. they live in database.foo.inc). That makes it much much easier to support more SQL dialects (as the dialects differ a lot when it comes to table creation): One has just to port 3 functions over, and all table creation works on a new engine.

So here's a example how the data models look in the current, early state of this patch:

function model_cache() {
  $model['fields'] = array(
    'cid'     => array('type' => 'string', 'length' => 255, 'default' => '', 'not null' => true),
    'data'    => array('type' => 'blob', 'size' => 'long'),
    'expire'  => array('type' => 'number', 'default' => '0', 'not null' => true),
    'created' => array('type' => 'number', 'default' => '0', 'not null' => true),
    'headers' => array('type' => 'text')
  );
  $model['primary key'] = array('cid');
  $model['indexes'] = array(
    'expire' => array('expire')
  );

  return $model;
}

Have a look at the code for more. I also added some initial documentation.

So,
so far, the attached patch does the following:

  • Adds a data model callback function (model_modelid) for all required tables (everything that's defined in system.install). I started of by creating one or two models manually, and then I decided to write a converter that parses the db_query("CREATE TABLE ...) statements in system.module and creates php code for the callback functions.
    By now (5 hours later) it finally creates valid, complete, full-featured and nice-looking data model functions for all required tables!
  • Adds a generic drupal_process_model($model_id) function in database.inc that calls the table creation function
  • Adds the table creation function for mysql. This function (db_create_table) takes a data model as argument and creates a sql CREATE TABLE .. statement out of it. Two helper functions are added to keep things readable and generic.
  • Removes all db_query("CREATE TABLE...) statements from system.install, and calls drupal_process_model for each $model_id instead.

All of the above works. Drupal installs successfully without any CREATE TABLE statements in system.install.

There are still quite some well-known TODOs

  • Add data models for the remaining modules. This should be fairly easy with the converter and maybe some tweaking.
  • So far, I've put the data models into system.install, but I'm not at all satisfied with this. I'm not sure where to put them else, though. system.install is a very bad place for them, IMO, as this way they're not available outside of the installation process. I'm sure there will soon be situations where either core or contrib wants access the data models at runtime, to enjoy full data reflection. So - they need a better place.
    I've got a few ideas floating around, e.g. a name.model file for each model in its module's directory, and loading them dynamically when needed, or a modulename.models file for each module. I'd like to get more opinions on that, though
  • The syntax of the models and the generic data types: Too many options? Too few? Or are there other possible syntax improvements?
  • Postgres support. With this patch applied, drupal only installs on mysql, so don't try on postgres yet ;). Would be great to have some support here from someone who's really familiar with postgres, but should be fairly ease
  • fix the very few remaining schema differences (between the one created from the data models which are in this patch and the regular, current schema as specified in HEAD's system.install), either manually or by improving the converter
  • probably a lot more that comes up in this thread soon

Later improvements (on which this patch does not depend immediately) include adding functions for table alteration, to get the full benefit of never having to declare data definitions in sql dialects anymore. I have working implentations for nearly all alterations for mysql, but that can come later, I wanted to keep this issue focused.

Some reviews would be great - especially concerning the overall implentation.

Comments

Frando’s picture

StatusFileSize
new73.62 KB

Hm. The patch ;)

I also just commited my converter into my sandbox for those who're interested or want to play around with it:
http://cvs.drupal.org/viewcvs/drupal/contributions/sandbox/frando/conver...

merlinofchaos’s picture

subscribing

sun’s picture

Subscribing. This is very interesting, in particular: providing table definitions and information about primary keys to other core modules is the badly needed basis for internationalization support in Drupal.

moshe weitzman’s picture

subscribe .. .mirrors some great work by barry at http://jaspan.com/schema-project-database-abstraction-reflection-and-mig.... i asked him to post here.

yched’s picture

Looks nice.
I know that dopry and KarenS started something similar during the hackfest after last OSCMS.
The approach was slightly different (no model_* callbacks, bases on SQL standards for column attributes, and they had working table alteration)
I don't really have any personal preference, I just hope efforts will get merged instead of scattered, because this is definitely something that's in the air and that would be cool to have in D6.

One of the things that come to my mind after quickly browsing through the patch is that we may want to abstract the table name from the model_* callback name.
Could be useful for :
- cache tables
cache_menu, cache_page, etc, or contrib cache_views, cache_content, etc, all share the same model, it would remove some code duplication (and ensure all cache tables share the same structure)
- CCK tables, which are generated dynamically (and not at install time)
CCK could use table alteration function too, but it's probably OK of these come only later :-)

yched’s picture

edit : there is no cache_menu table anymore in D6, but, well, you get the point

bjaspan’s picture

Frando, you and I are on the same wavelength and, sadly, have duplicated each other's code. Please see the Schema module, now available in CVS contrib, as well as the post Moshe referenced: http://jaspan.com/schema-project-database-abstraction-reflection-and-mig...

My approach is to keep this not dependent on core patches if possible. That way, it can develop and progress on its own and, when ready, can easily be dropped into core.

The big difference I can see between what you wrote in your post (I haven't looked at your patch yet) and my code is that I use database reflection via the INFORMATION_SCHEMA table instead of parsing the CREATE TABLE statements syntactically. (Ironically, I wrote a CREATE TABLE parser for exactly this purpose in June 2006 in Perl but feel that it is the wrong approach.) Otherwise, I think we're doing the same thing and clearly should combine forces.

I too will have more to say later.

Frando’s picture

StatusFileSize
new116.93 KB

Barry, I read your blog post some time ago, but unfortunately didn't realize that there is some real code by now.
But let's work together from now on!
I'll still try to work towards core, though, as that's where a data model abstraction engine belongs, IMO.

So, some progress today:

  • I tweaked my converter which creates the php function syntax for the data models (out of the CREATE TABLE statements in the module.install files). It now creates working models for all of core! There are still some minor issues, but nothing serious anymore.
  • Each data model now lives in its own file, modelname.model. Previously, system.install contained most of core's table declarations - no more! Each model file lives in the directory of the module to which it belongs. These files are only loaded on-demand.
  • A hook_models. Similar to the new hook_themes, this is where modules declare which models they provide. Each implementation should return an array with elements of the form 'modelname' => details.
    In the details array one can specify the file that has to be included, the callback function that contains the definition and also 'callback arguments', which are passed to the callback function - this hopefully makes it possible to reuse model definitions as proposed by yched!
    For now, I've let my converter create these hook_models for all of core automatically, without any special details (they are set to some sane default values by drupal_get_model_info).
  • drupal_get_model returns a model definition array.
  • The good news: After a very little install.php tweaking (namely, loading the .module files for all modules that declare models that need to be created on installation time) and updating install.php (iow, removing all CREATE TABLE statements and adding calls to drupal_get_model and db_create_table) drupal installs on mysql without any problems!

I'm not yet totally sure whether it's best to place each model in its own file, but I like the idea more and more. It's certainly a better place than the .install files, and it makes it possible to load them on-demand, and I feared that problems might arise during the installation process, but everything just worked like a charm ;-)

The code is still quite rough, but there's some very basic documentation there now.

adrian’s picture

I like everything apart from the name "model" for tables.

A data model != a table.

Something like a 'node' has a data model, and that data model might consist of several different fields stored in several different tables.

bjaspan’s picture

Update: The Schema module (in CVS contrib) now generates schema declarations for all CCK content type and field tables; see modules/schema_content.inc.

Schema can then use its database inspection capability to validate the database structure against the defined schema. It discovered that, as of Drupal 5 at least, CCK does not properly maintain the NULL/NOT NULL and default properties of columns. (I've temporarily hacked schema_content.inc to generate a schema declaration based on what CCK does rather than what the field modules say to do so the admin/build/schema/report page does not show errors while I develop the code.)

Next up: Using Schema to load arbitrary nodes, including CCK nodes with many fields, in a single SQL query.

It is obvious that Frando and I need to agree on a common schema declaration structure and merge our code. Our underlying concepts and approaches are similar enough that it should not be difficult.

profix898’s picture

subscribing

bjaspan’s picture

Update: Schema module now supports loading multiple node-related tables in a single query. Code is in CVS. It works like this:

Each module can define the joins for its table:

function mymodule_schema() {
  $schema['mytable']['cols'] = array(
    'nid' => array('type' => 'int', ..., 'join' => array('system', 'node', 'nid'), 'join-type' => 'one'),
  );
  return $schema;
}

This means mymodule declares a single table called mytable with a single column named 'nid' that joins to node.nid defined by system.module. It could declare other columns and indices, of course.

When a node is loaded, Schema's hook_nodeapi is called. It looks through the total schema to find all joins to node (because this is nodeapi, we know we're loading from table node) and builds up a SELECT/JOIN list. Suppose the schema contains mymodule as above and another module with a table m2 that joins to node on vid and has columns 'foo' and 'bar'. schema_nodeapi builds this query:

SELECT mytable.nid as mytable_nid, m2.vid as m2_vid, m2.foo as m2_foo, m2.bar as m2_bar
FROM node n 
LEFT JOIN mytable ON mytable.nid = n.nid
LEFT JOIN m2 ON m2.vid = n.vid

It then saves the results of this query in $node->schema_data. schema_data['mytable'] contains the array of matching rows from the mytable table and schema_data['m2'] contains the array of matching rows from the m2 table.

Now, mymodule's mymodule_nodeapi does not need to make its own SELECT query, and neither does m2_nodeapi. They can both just use $node->schema_data which is waiting for them. So, we've turned three queries into one.

Because Schema supports CCK, this works for all CCK content types and fields; they get loaded in one query (* see below about 1-to-many joins). This means we can eliminate CCK's field cache table, cutting the data storage requirements for CCK nodes in half without a performance penalty.

Currently, this happens in schema_nodeapi() but it obviously could be moved directly into node_load() to eliminate another query.

One obvious question is about 1-to-many joins, for example a CCK field that can have multiple values per node. Schema module can currently handle zero or one of those per node load. The schema_data['tablename'] array gets an element for each row returned from that table. Supporting multiple multiple-value joins per node load in one query will require some more data massaging that I haven't gotten to yet.

One potentially substantial optimization involves node types. For any given node, only a subset of tables that join to node.nid/vid are relevant: content_type_page is only relevant for page nodes, and the book table is only relevant for those node types that book.module is configured to book-enable. Each module could encode in its schema the node types the join is relevant for. For example, if book.module is only enabled for 'page' nodes, it could say:

function book_schema() {
  $schema['book'] = array(
     'cols' => array(..., 'join' => array('system', 'node', 'nid'), 'node-type' => array('page'))
  );
  return $schema;
}

Now, imagine that we keep a cache mapping nid => node type (note that a node, once created, never changes its type). Now, when schema_nodeapi() goes to assemble its one-query-to-load-them-all, it can omit all the LEFT JOINs for tables that do not care about this node type. Ironically, because right now this operation is in schema_nodeapi() and NOT in node_load(), the node type is already available so I don't even need the cache. But for this to be in node_load() a cache is necessary. Since node types are write-only, this could be done as simply as having cron output a file defining a array containing the mapping, so no query is even needed to get the nid => node type cache, just a file load.

I'm quite sure there are bugs to be worked out, improvements to make, etc.

Whew! Long message. Feedback?

bjaspan’s picture

Oh, by the way, Schema module currently supports Drupal 5, not HEAD.

bjaspan’s picture

FYI, I just created a group at g.d.o for discussion of these topics: http://groups.drupal.org/database-schema-api. It may still be in a moderation, though.

bjaspan’s picture

I've begun a thread for deciding on the data structure for table definitions at http://groups.drupal.org/node/3694. I suggest dropping the way Schema does it and using a slightly-modified version of the way Frando is doing it.

yched’s picture

Any news on the core aspect of this ?

While Barry's work with contrib "Schema" looks really promising, the good thing about Frando's effort is that it's a limited, well defined functionality, targeting core inclusion _right now_. Lots of other interesting features (schema inspection...) can be considered, but obviously they'll hit core, well, later at best.

Several discussions (issue queue, dev list) have shown a consensus about a DDL for core (including Dries, who was not really hot about this a few months back). It is so clearly wanted that at least 3 separate attempts have started since D6 branch was opened. It would be a shame if this got lost just because too many people wanted it.

Code freeze is in a month. Frando's patch is the most advanced effort. I'd say shaping it ready for core inclusion by then seems doable. What do you think ?

chx’s picture

+1 on getting this into core because this one is pretty close IMO

Frando’s picture

Thanks for the feedback! I'll submit a much updated and improved later today or tomorrow (including parts of Barry Jaspan's work, with whom I discussed some more parts of this whole thing).
Just to name a few changes that are to come: 'model' will be 'schema', there'll be functions consistent with db_create_table for table manipulation (db_add_column, db_drop_column, db_add_index, ...), and reusing schema definitions for several tables will actually work (e.g. for the several cache_* tables).

yched’s picture

great news !

One question : i think remember your patch stays rather close to MySQL idioms for the description of the columns,
meaning the database.*.inc files for other db types have to translate from MySQL.
One interesting thing in dopry's DDL effort at latest OSCMS hackfest was that it relied on standard SQL syntax, and let db specific .inc files do the translation to their specific syntax.
I see some discussion has been going on in the DB Schema API Group (thanks Barry for making that happen :-) ). Has anything been agreed about that ?

bjaspan’s picture

I am all for getting a DDL abstraction into core as soon as possible. I have been lobbying for it for a year. I implemented the code in March. I spent all of OSCMS talking to everyone who would listen about it. If the consensus is that this will go into D6, great!

I am concerned that merging into core right now is premature. We do not yet know exactly how we want the API to look and, by putting into core, we are effectively preventing any experimentation on it by anyone but core developers. If CCK had started this way, it would have taken much longer to get to its currently level of functionality and maturity.

However, if the community wants it to go in now, let's do it. We can always fix it later.

A few comments:

- Re: comment #16, I have to dispute yched's comment that Frando's is "the most advanced effort." The Schema module is ready for use. It supports mysql and pgsql, _install and _uninstall hooks, all CCK types and fields, and I was planning to announce it for Drupal 5 module developers to use starting this week (unfortunately, by moving Schema's functionality into database .inc files now, that will be impossible; everyone will just have to wait for D6).

Frando and I have talked and work together on several aspects of the design. His patch later today will include substantial functionality copied from Schema, which of course I welcome. I have made a concerted effort to share credit when appropriate. I hope others will do the same.

- Re: comment #16, yched describes schema inspection as an "interesting feature" but one that should not yet go into core. Frando and I agree it really is not optional if we want a decent system and Frando says his patch later today will include it, copied from Schema, which is good.

- Re: comment #19, the Schema module already supports MySQL and PostgresQL equally. It defines a fixed set of types and the .inc files translate into the appropriate SQL, just as yched suggests. The set of types are the combined work of myself, Frando, and a few other contributors to the Database Schema group. I don't know if Frando's patch will copy the PostgresQL SQL-generation and database inspection support from Schema as well; if not, I can submit it separately.

- Because I wrote the database inspection and comparison code, and then implemented PostgresQL support, I discovered that the MySQL and PostgresQL schemas that Drupal is currently using, implemented by the existing CREATE TABLE/INDEX statements, are not the same (this is one of many reasons the inspection functionality is not optional). We are going to have to resolve those differences via a core schema update. pgsql support currently uses auto-incrementing columns; mysql does not. Also, the table indexes have completely different names in the two databases, so we'll have to rename all the indexes in one or the other (probably in mysql).

On with the work! :-)

bjaspan’s picture

Now that database back-end abstraction is coming, I suggest we convert each database .inc file into a module. Please see http://drupal.org/node/140263 for details about why and how.

bjaspan’s picture

Here's another thought. The Schema module currently does several things:

  1. defines a schema data structure abstraction
  2. defines a hook modules can use to export one their schema
  3. converts the structures into CREATE TABLE statements
  4. inspects the database to generate a schema data structure representing it
  5. compares the schema returned by all modules against the one generated by database inspection, reporting any missing or extra tables or indexes as well as all differences between column and index specifications
  6. pretty-prints the data structure so module developers can just cut-and-paste the structure for their existing tables into their hook_schema implementations (making adoption much more likely)
  7. using join relationships, knows how to generate and execute queries to load entire entities at once (this is not done yet)
  8. exports schema definitions for CCK (this is just temporary and clearly should be part of the CCK module)

My and Frando's code overlaps for #1-4. I'd say that database.inc should know how to do #1, #2, and #3, and the database.*.inc files should know how to do #4. However, there is no real need to have #5-7 as part of the database subsystem itself; they don't actually involve accessing the database, they just operate at the schema level. So they should logically remain in a different module. Schema can remain in contrib for now so I can continue developing it and, if/when it becomes necessary, can move into core as a module.

Thoughts?

yched’s picture

Barry : truly sorry if I my post in #16 sounded like I was disregarding your work with Schema module - I am impressed by this, really.

My primary goal was to try to revive (if needed - I felt it was, but I might have been wrong) the D6 inclusion of this, and I do feel a (solid) core DDL patch for D6 is better than a richer DDL contrib module for D5. It's only my opinion, though, and I'm no core committer of course, not even one of the 'high-profile' core guys.

I think the #1-#8 featureset you describe makes perfect sense and provide a clear view of the situation. #5-#7 would make a great D6 contrib Schema for experimenting killer D7 features. I fear Schema would remain experimental if #1-#4 do not live in core.

Anyway, many thanks to both of you for working on this together. This should definitely rock.

Frando’s picture

StatusFileSize
new106.2 KB

So,
new patch attached.
I'll try to briefly describe the provided API and the workflow - there's also now quite some comments in the code, so having a look at the patch directly would be a great idea.

All modules that have tables implement hook_tables. hook_table basically just returns an array of the form 'tablename' => 'schemaname'. This already allows simple schema reuse, which is implemented for the cache_* tables. More options can be specified, e.g. to allow more complex schema inheritance/reusement (*1).

Each schema is defined in a callback function that lives in a modulename.schemas file by default (the .schema-files are included on-demand, that's another reasons why there's a hook_tables).

The schema definition syntax is basically a merger of Barry's and my version with ideas taken from dopry's DDL abstraction work. There's a comment on top of drupal_get_schema() that explaines the syntax.

A major addition towards the last revision of my patch is that the 'type' key in the field definition now represents a 'drupal field type' instead of directly specifying a datatype. These field types are nothing more but a collection of sensible default values that all can be overridden in the schema definition. This reduces the complexity of the schema definitions a lot, and also allows us to declare more complex field types in the future, e.g. a 'serial' type. For now, there's a field type for each supported data type (int, varchar, float, text, blob, datetime)

Furthermore, there are now table manipulation functions: db_add_field, db_drop_field, db_create_index, db_drop_index, db_set_default_value

(*1) You can have an array instead of 'schemaname', in which you can manually specify the callback function, callback arguments to pass on and the file in which the callback function lives.

some known TODOs (not exhaustive, of course):

  • re-check the schema definition syntax. More properties, like 'collate', 'charset', 'foreign key', or even relationships, can still be added easily later on, of course.
  • Postgres support. There's already working table creation code in Barry's schema.module.
  • And the hard one: Module installation. The problem is that the to-be-installed-module's hook_table lives in a yet uninstalled .module file and thus doesn't get invoked. (Note: This is only about module installation - the first drupal installation works perfectly, the required modules are just added to the module list in install.php). I'm not sure what's the best way to solve this, either have a special drupal_get_schema_install()-function, or load the to-be-installed module and add it to the module list before creating the tables, or ... Opinions?

@#19: In the new patch, things are a little more generic. If you point me to some concrete examples in the definition syntax that are more MySQL-ish than Generic-SQL-ish, I'll gladly change them!

@#22: Barry, that sounds like a perfect plan (actually, that's exactly what I just planned to propose). The only little thing where I'm unsure is the table inspection code. This is a great piece of code, I created all the .schemas-files with it and some php-code-writing stuff, and it will be super-useful for module development and updating, but outside of development tasks, when (for which kind of tasks) do you think will it be needed or be useful?
So I'm currently unsure wether it's better to leave it in schema/coder/devel-module, or to put it into the patch.

I just commited the code that I used to generate the .schemas file and the hook_table implentations to my sandbox, for those who want to play around. This is basically a slightly updated version of Barry's schema inspector. ( http://cvs.drupal.org/viewcvs/drupal/contributions/sandbox/frando/inspec... )

So - I'd say that the code in this last patch is already quite solid (in terms of what I was planning), so reviews would be great here, as would be ideas concerning the TODOs.

moshe weitzman’s picture

@Frando - I think .module files should be loaded during module install. Why not? Their functions could be useful in an .install file ... i admit to not really understanding the issue yet, but i'll throw a quick opinion in anyway :)

Great work, Frando and Barry. Your proposed plan is sounding excellent.

bjaspan’s picture

Frando,

Thanks for your patch. Here are my initial comments after a brief scan:

- I'm not currently aware of a case where we will need *some* of the schema definitions for core but not all of them. We'll either be creating all the tables, or comparing all the tables, or not. Therefore, I see no real advantage to breaking up all the core definitions into separate files. It just causes lots of little files, little functions, and extra changes to the code (e.g. install.php has more modules to load manually). Just put the whole core schema in one place, in one function, in one file. And, incidentally, every module's hook_schema (or hook_tables, whatever) should just return an array containing all of its tables instead of a separate function for each table.

- You have defined "type" and "datatype" in the schema structure. Why both? I do not see a need and think it is unnecessary complexity. See my next item.

- Your drupal_field_types() is an excellent idea in that it simplifies a module author's job of defining their schema and I want the schema stuff to be as simple to use as possible. I liked your idea so much I added it to Schema module---and then my inspection/comparison code showed me its flaw. Not all int, float, varchar, and text columns have a default value. In core, in fact, about half do not (IIRC). If you provide a default 'default' value, then for columns that do not want a default value, you have to provide a way to specify "no default." Sure, a column spec can include 'no default' => true, but now you are just replacing having to put 'default' into the colspec with having to put 'no default' in, so you haven't gained anything. And having to add 'default' is better than 'no default' because everyone who has ever written CREATE TABLE already expects to have to say 'default' but does not expect to have to say 'no default.'

If you remove the 'default' values to drupal_field_types, what you have left is datatype, size, and not null. 'size' => 'normal' is the default for all of them and is unnecessary anyway. 'datatype' is always just the same value as 'type' and so is redundant. So we are left with 'not null' => true which you specify for all types except 'blog' and 'datetime.' Why do you think those two types should default to 'not null' => false but the others should default to 'not null' => true? I think that is confusing. Again, everyone who has ever written CREATE TABLE expects "can be null" as the default. If we change that, we'll just confuse people, especially if we only change it for some types but not others.

So we've eliminated from the drupal_field_types the values for datatype, size, not null, and default, whch leaves us with... nothing. Again, I *loved* this idea, but in the end it turns out to provide no benefit.

- You suggest that your type/datatype abstraction will help with a 'serial' type. Schema already implements a 'serial' type for mysql and postgres, and your patch also has to from day #1 because core support for postgres requires it.

- I'd like to hear more justification for the separation of hook_tables from hook_tables_info. Providing callback functions to hook_tables_info seems to me again like unnecessary complexity. If modules want to reuse schema definitions they can do it on their own without the help of callback arguments; see what I did in modules/schema/modules/schema_content.inc (the CCK support) with $base. So you are requiring every module to implement two functions, hook_tables and hook_tables_info, instead of one and I do not see the advantage. Maybe I'm missing it; enlighten me.

- In _db_create_field_sql, you have:

$query .= " DEFAULT '" . $details['default'] . "' ";

This is incorrect because you always wrap $default['default'] in single quotes. The default value should match the type of the column. $details['default'] needs to distinguish between unset, '', 0, and '0' because they are all very different. Check how Schema does this.

- In schema_install, you removed the creation of the _unsigned domains for pgsql. These should probably be owned by the database.pgsql.inc file, but they have to go somewhere.

Regarding your TODOs:

- Yes, PostgresQL support is in Schema. I do not currently want to rewrite it until I agree more with the design of this patch. You are welcome to.

- Regarding module installation, just getting the module loaded is pretty easy---just use drupal_load. For example, from schema.install:

function schema_install() {
  drupal_load('module', 'schema'); // load schema.module
  $my_schema = module_invoke('schema', 'schema'); // call schema.module's hook_schema
  schema_create_schema($my_schema); // create my schema
}

The same works for uninstall.

- You asked what the database inspection code is for outside of the development. The answer is the yet-to-written db_update_schema() (or db_update_tables() or whatever) function which a module invokes from mymodule_update_n() in its .install file. db_update_schema() compares a module's schema to the actual database and add or removes tables, columns, and indexes as necessary. So when a module author wants to add a new column, she just adds it to her schema array, calls db_update_schema($my_schema), and that's it. It won't be useful for complex cases which module authors will continue to handle manually but for the vast majority of cases, update functions will become as simple and clean as install functions will be.

So, the database inspection code can remain in Schema module, but then practically every module with a .install file will end up depending on Schema.

yched’s picture

Thanks Frando for the patch. A few comments :
- I agree with bjaspan that the layout of core table definitions seems like it could be simplified.
- the current hook_tables seem to be rather heavy "just" to make schema duplication possible.
Couldn't the hook_tables implementations return directly the array of table definitions (someting like Views module hook_views_tables) :

'node' => array(
  'fields' => array(
    'nid' => ...
    'vid' => ...
  ),
  'indexes' => array(
    ...
  ),
'cache' => cache_schema(),
'cache_page' => cache_schema(),

Then, the tables that could use schema duplication (cache_*, cck data tables) would defer the actual definition to a helper function, available for other schemas to use :

function cache_schema() {
  return array('fields' =>... 'indexes' => ...);
}

- I do not see anything to support 2-params data types (FLOAT(M,D)). I think SQL standards talk about 'scale' and 'precision' properties ( but maybe I'm missing it ?)
- I was surprised to see that your patch does not touch database.inc, all the new db API functions are defined in database.mysql.inc. i assume this is temporary ?
- bjaspan : I still think that we have a win if D6 freezes before the db_inspection / automated schema update part can be pushed in. it will just make Schema.module a great tool for developpers during D6 lifecycle. Of course, we have to take extra care that the proposed D6 DDL keeps the door open.

I should be able to provide more feedback in the next few days...

bjaspan’s picture

Some comments on ycheds comments.

We agree about core table layout, and how hook_tables should just return an array of table definitions. Great.

Regarding data types like float(m,d), yes, SQL and all databases define types that our table structure does not support. However, Drupal does not currently use them. In the interest of simplicity, I think we should only put in what we need for now and add more as it comes up. For example, in the schema group at g.d.o, KarenS mentioned the need for a datetime type, so we have added it.

Here's an idea: We can allow database-engine specific overrides in the data structure. For example, if you want to use a float column but on MySQL know that you want a specific scale/precision, you can write:

'mycolumn' => array('type' => 'float', 'mysql_type' => 'float(M,D)')

MySQL will use float(M,D) exactly and other engines will translate 'float' as per their type map.

Regarding db inspection/schema update, I'm happy to have it remain in the Schema module. It modules need to depend on Schema to use it, that's fine.

bjaspan’s picture

Two more requirements:

- Add hook_schema_alter(). When core collects all the results of hook_tables together, it should call a hook_schema_alter() to allow other modules to modify it. This can be just one call operating on the whole combined schema; no need to call it separately for each table. This will be crucial for some things the Schema module wants to do. It will also allow some really useful functionality that I discussed on the developer list a while back.

- The functions that generates and returns SQL to create or alter tables/indexes needs to return an array of strings, not just a string. Creating a table is not always a one-query operation; in pgsql, for example, it requires separate CREATE TABLE and CREATE INDEX commands. gabriele.turchi@l39a.com, who is creating the Oracle engine for Schema (which will become database.oracle.inc now that we are aiming at core), says that Oracle only accepts one SQL statement per db_query. So we need an array of commands to interate over.

bjaspan’s picture

Frando and I just talked via IM and agreed on a whole bunch of the topics raised in comments. One we disagree with is whether the schema definitions for core should be in per-module .schema files or one system.schema file. In the big picture, it obviously doesn't matter much. However:

- My point is that putting them together in one system.schema file will result in a substantially smaller, simpler patch to core and fewer new, small files, making it more likely the patch will get applied.

- His point is that it is generally better design to keep things grouped logically. We have separate directories for each module so each module's schema should be defined there.

We are both right, of course. The question is which right idea should prevail.

Please vote.

yched’s picture

Well, the absence of a real float field type is one of CCK's current shortcomings, and we will have to adress that sooner or later, so a float(m,d) will be needed at some point. I did not actually check, but I guess some contribs like location or gmap do use some high precision numeric column. Actually, we can not really tell from core's tables and columns what datatypes are used or not.

As to whether group or ungroup core table schemas, I think I remember Dries saying somewhere that having a single place to look to get an idea of the tables set up by drupal core was important for some people trying to evaluate drupal before actually installing it. If we move towards using our own array schema definitions, meeting this requirement could mean providing a sample .sql file as documentation anyway...

sun’s picture

+1 for per-module .schema files

Will this new abstract table creation feature install core module tables on demand? It seems like tables for some optional core modules, f.e. node_counter (statistics), files and file_revisions (upload) and others, are always installed, regardless whether those optional core modules have been enabled.

moshe weitzman’s picture

We shouldn't cram too many improvements into one patch. I would stick with current core and do the whole schema as one file.

bjaspan’s picture

Regarding making new database engines into modules, as I suggest in http://drupal.org/node/140263, the task is almost completely implemented by a previous patch at http://drupal.org/node/19522. I just re-rolled that patch for HEAD and marked it for review. Someone should please review that patch ASAP so it can get committed before the changes we are making in this thread.

Please do not suggest functional changes to http://drupal.org/node/19522, just review it as-is. Once that patch is in, we can add "database engine as a module" later very easily.

bjaspan’s picture

#31: If float scale and precision really is important we can certainly add it. I'll look into how mysql and pgsql do it and make a suggestion.

#31-#33: That's two for one .schema file and one for many .schema files. But if Dries implicitly votes for one .schema file, he wins. Anyone else with an opinion?

#32: If we have one .schema file we will create all core tables at once, as we do now. If we have per-module core .schema files they *could* be created on demand, but then they would also all need their own .install files, so that's *two* new small files for every core module. Empty tables don't cost much. I'd say it isn't worth the effort.

bjaspan’s picture

Regarding float scale and precision, both mysql and pgsql allow them to be specified but ultimately the specifications only end up selecting between float:normal and float:big. I suspect that no one using Drupal is being careful enough with the complexities of floating-point arithmetic that specifying precision and scale is really useful.

On the other hand, we might want to consider adding support for NUMERIC type which offers exact fixed-precision arithmetic, for example for money. We could allow a colspec like this:

  'salary' => array('type' => 'numeric', 'precision' => 8, 'scale' => 2)

for those of us who might earn up to $999999.99 per year (precision is total # of significant digits, scale is max digits after the decimal point).

Both mysql and pgsql offer this, so I suspect the others all do too, and this will be easy to add to the type maps.

nedjo’s picture

It's very exciting to see this work. Schema representation is a key need in Drupal, something that will open new horizons for e.g. uniform object handling methods, and eventual web service support. If we know our schema, we can infer most of what we need to know about how to load, save, delete, and present data, including linked data.

Woefully unaware of this work (but it doesn't matter, I haven't really repeated anything), I've been sketching in over the past couple of days some ideas on how a Drupal core data api might make full use of schema knowledge in CRUD functionality, see http://cvs.drupal.org/viewcvs/drupal/contributions/modules/dataapi/. Inevitably my version of schema representation was different, see e.g. a partial node table represented here http://cvs.drupal.org/viewcvs/drupal/contributions/modules/dataapi/modul....
No matter, the basics are the same, and once we get a starting point in we can look at the merits of additional data parameters. (Some of what I'd envisioned is not strictly schema information, but needed for Drupal-specific data processes, like whether a particular table should invoke db_rewrite_sql(), whether a primary key uses db_next_id(), etc.)

I look forward to reading and reviewing this important patch.

yched’s picture

bjaspan : The db engines as modules patch seems a great idea, I'll try reviewing this asap.
Thanks for the update about float / numeric.

Regarding the core tables, I agree with moshe that we should probably stick to the current line, that seems to be that default enabled modules get their schema in system.install, and others (book, statistics, search...) get their schema in their own .install file.

smk-ka’s picture

Re: Floats / Numerics
From what I know, floats and doubles are seldom used. Nearly every application requires a predictable precision, that is, numerics with fixed precision should really be higher on the list. They are (or should be) used by e-commerce applications and for storage of geo coordinates.

bjaspan’s picture

Re one vs multiple schema files: I actually did not realize until just now that there are separate .install files for optional core modules. I thought system.install created all core tables. The reason I thought that is that system.install handles *updates* for all core tables, even those it does not create! It seems to me that having the tables defined in one module with the updates in a different module is confusing and wrong, but that's the way core is currently set up.

It is no longer clear to me how to we should arrange this. If we put all core tables in one system.schema, we will still have to patch all the optional .installs to remove the CREATE TABLE statements, so the goal of "minimizing the size of the patch" is not achieved by putting everything in one schema array. Also, the optional core .install files have their own hook_uninstall(), too, so by putting everything in one schema we'd be *removing* that functionality. That won't win any points.

I think I am leaning towards the idea that every module with a .install file (which presumably contains CREATE TABLE) should have its own schema array since, after all, the schema array's most basic purpose is to replace the CREATE TABLE statements. That allows the optional core modules to get installed and uninstalled separately, as they can now. Also, I think the updates should also go in each module's .install file, but that is unrelated to this patch.

I still really do not like the idea of creating a separate .schema file for every module with a .install file. It just feels like Yet Another Little File Syndrome. I do appreciate the fact that these files get parsed on every page load, however, and we don't want to add more code text (especially a large one in system.module) to every module.

What about putting the hook_schema function in the .install file? Whenever we would normally load the .schema file, we just load the .install file instead. This means that to construct (and cache) the Drupal-wide schema, every module's .install file will get loaded. This will only happen when modules are installed/enabled/disabled/uninstalled and when CCK types change; any other time the schema is needed, it can come out of the cache.

nedjo’s picture

I've reviewed the code but not installed and tested.

Excellent. This is paving the way not only for programmatic database creation and updating operations but for a schema-aware set of CRUD operations.

Some notes and suggestions:

  • +1 on hook_schema_alter(), it's going to be needed.
  • +1 to a single array of schema data per module rather than one per table.
  • Since we're building complex nested arrays, we should if at all possible follow the same format as FAPI. It's familiar to our developers. We have relevant methods for iterating through it (element_properties(), element_children(). Particularly as we're altering existing arrays, this parallel structure will be important. As with FAPI, our table and field names should be the keys, with property keys using the # prefix. Currently, the suggested approach of having table names as array keys but handling field names differently is non-intuitive. So:
    
    function system_schema() {
      $schema = array();
      $schema['system'] = array(
        '#primary_key' => array('filename'),
        '#indexes' => array('weight' => array('weight')),
      );
      $schema['system']['filename'] = array(
        '#type' => 'varchar',
      );
      $schema['system']['name'] = array(
        '#type' => 'varchar',
      );
      ...
      $schema['url_alias'] = array(
        '#primary_key' => array('pid'),
        '#indexes' => array('src' => array('src')),
      );
      $schema['url_alias']['pid'] = array(
        '#type' => 'int',
        '#unsigned' => TRUE,
        ...
      );
      ...
      return $schema;
    }
    
    
  • +1 on separate schema files per core module. One of the things we need to be doing is reducing the special treatment we give to core modules (beyond, perhaps, those that are always required). Rather, we need methods and approaches applicable alike to core and contrib modules.
  • For the same reason, the install.php changes specifically loading non-required modules (taxonomy, comment) should be dropped. A solution for contrib modules will cover these as well.
  • As they grow in size, we might look at whether we should put the seldom-used schema operations (CREATE TABLE, etc.) in separate .inc files loaded as needed.
  • Some minor coding styling issues: constants (TRUE, FALSE, NULL) should be in all caps, comments should be capitalized with punctuation.

Thanks for your excellent work. I'd be happy to take on some of this, for example, rewriting the arrays in Forms API style, let me know what I can do to help.

bjaspan’s picture

Nedjo, thanks for your input.

First, I'd like to make a general comment I've been thinking about. The more I think about the schema stuff and this patch, the more complex and unresolved I realize many of the issues are (this was basically why I wanted everything to stay in contrib for a while though I am now onboard with trying to get something into D6 core). Therefore, I believe that for this to get into D6, it needs to be as SMALL and SIMPLE as possible. Only the barest minimum need to achieve a useful purpose; everything extra can remain in the Schema module, in contrib, until we understand it better.

I'll define the barest minimum as that which is necessary to remove switch ($db_type) from .install files. This requires a database-neutral schema data structure, the functions necessary to convert that into per-database CREATE TABLE statements, and a set of db_{add,remove}_{column,index} functions for hook_update_n() functions that operate on the same database-neutral concepts and constants (e.g. type names) as the schema structure. We have all of this already; we just need to remove what is unnecessary.

This means that I think D6 core should *not* have database inspection, comparison/error reporting, schema pretty-printing, or basically anything else. That can all remain in the Schema module.

I want to see the next patch (from myself, Frando, or whomever) reflect this approach.

Now, moving on to your comments:

+1 on hook_schema_alter(), it's going to be needed.

I agree that hook_schema_alter() is powerful and important. However, it is far from clear what its semantics should be. If a contrib module adds or removes columns or indexes from some other module's schema, what happens? Does the system automatically perform the changes? Does the admin have to approve? etc. etc. We don't know how it should behave, so we should leave it out (even though it was my idea to put it in).

+1 to a single array of schema data per module rather than one per table.

Agreed. One thing I'm still toying with: Should the system-wide $schema array be keyed by ['tablename'] or ['modulename']['tablename']? If the latter, then mymodule_install() can look like this:

function mymodule_install() {
  $schema = drupal_get_schema();
  db_create_schema($schema['mymodule']);
}

Otherwise, drupal_get_schema() can take a $modulename argument, iterate through the schema, and return the relevant tables. Maybe that is better.

Since we're building complex nested arrays, we should if at all possible follow the same format as FAPI ... As with FAPI, our table and field names should be the keys, with property keys using the # prefix.

I agree that we should re-use existing concepts whenever possible. But I am not sure your proposal to make column names be the top-level elements of $schema['tablename'] makes sense. I'm not sure it doesn't, either, but grouping columns, primary keys, unique keys, and indexes together was the intuitively obvious way to do it for myself and several others. You are turning that organization on its head. I wouldn't want to see us create a more confusing structure just to mimic a successful structure designed for a different purpose. I'll have to think about it.

+1 on separate schema files per core module.

Do you have an opinion on a new .schema file vs. putting hook_schema into the .install file?

Thanks!

nedjo’s picture

D6 core should *not* have database inspection, comparison/error reporting, schema pretty-printing

Agreed.

While I'm totally behind the database operations (table creation, etc.) as a driving need for schema data, I'm more focused on the additional possibilities of schema-aware data APIs. That's a lot of the reason why I think FAPI-style arrays make sense. I imagine they'll be what we want to use in operations based on schema information.

Also, I'm foreseeing there will be additional, Drupal-based properties we want to add (eventually, not in the first round) to schema information that aren't strictly needed for the task of schema creation and maintenance.

We don't know how [hook_schema_alter()] should behave, so we should leave it out

Agreed, for now. It's not essential to the first steps here.

Should the system-wide $schema array be keyed by ['tablename'] or ['modulename']['tablename']?

How many places do we need to know what module the schema data are associated with? If only in install, then in the given module install, maybe we could do something like $schema = module_invoke('modulename', 'schema');.

Do you have an opinion on a new .schema file vs. putting hook_schema into the .install file?

The .install file I think would be the wrong place because we're going to need frequent access to schema information and the .install file should be loaded only on rare occasions as it contains data rarely needed.

Maybe we should consider introducing this directly into the .module files. Once we have schema information available, I think in the end we're going to want to use it for most everything. Query builders, data API, CCK in core--this all starts with schema data.

nedjo’s picture

Forgot to mention, related to this schema discussion, just posted a patch to introduce table naming conventions linking object types with their primary tables, http://drupal.org/node/140860. This is a key incremental step toward schema-aware data APIs. A quick review or two would be great if you have a moment.

bjaspan’s picture

While I'm totally behind the database operations (table creation, etc.) as a driving need for schema data, I'm more focused on the additional possibilities of schema-aware data APIs.

Oh, so am I. See my earlier comments about single-query loads and my subtle hints in various posts about using schema data as the basis for incremental data migration between Drupal servers. This bears on the CRUD APIs, too, course.

Also, I'm foreseeing there will be additional, Drupal-based properties we want to add.

Most definitely.

Q: Should the system-wide $schema array be keyed by ['tablename'] or ['modulename']['tablename']?

How many places do we need to know what module the schema data are associated with? If only in install, then in the given module install, maybe we could do something like $schema = module_invoke('modulename', 'schema');.

Indeed, that is exactly how I originally envisioned it. See the Schema module's .install file which uses itself to create its own tables. It uses module_invoke('schema', 'schema'). The trick with this approach is that the .module file isn't loaded when hook_install() is run. We could change that, of course.

Q: Do you have an opinion on a new .schema file vs. putting hook_schema into the .install file?

The .install file I think would be the wrong place ...

Maybe we should consider introducing this directly into the .module files. Once we have schema information available, I think in the end we're going to want to use it for most everything.

This is what we call "going around in circles." :-) Putting hook_schema in the .module file is the obvious choice, but then we are parsing a lot of extra program text (especially for system.module) that is not needed for the vast majority of page loads. I've had some additional thoughts on this:

1. Generating the system-wide schema will be a somewhat expensive operation. We'll collect the results from all hook_schema()s, then pre-compute inverted join maps, entity-loading queries, and who knows what else (a lot of this will get done in hook_schema_alter(), in fact). When we're done, we'll just cache the generated schema in the database. It only changes when modules are installed/uninstalled or CCK types change. So we will almost *never* actually need to call hook_schema() to get the base data, and there is little reason to load it.

2. We can delay-load it from the .module file:

function mymodule_schema() {
  require_once(drupal_get_path('module', 'mymodule').'/mymodule.schema'));  // load mymodule.schema
  return _mymodule_schema();  // defined in mymodule.schema
}

So now we have hook_schema() whenever we load the .module file but do not have the parse the large array most of the time.

Any of these approaches will work, we just need to pick one.

nedjo’s picture

If we assume that schema information (a) are frequently needed and (b) seldom change, presumably we should be caching them. drupal_get_schema() would then load from cache or, if no cache is available, load directly and cache the result. Then, indeed, we should have schema arrays in separate .inc files. For parallelism, we should load them the same way we load .install files. Probably we should rename and generecize module_load_install() to take a $type argument:


/**
 * Load a module include file.
 */
function module_load_include($module, $type) {
  // Make sure the relevant API is available.
  include_once './includes/'. $type .'.inc';

  $include_file = './'. drupal_get_path('module', $module) .'/'. $module .'.'. $type;
  if (is_file($include_file)) {
    include_once $include_file;
  }
  else {
    return FALSE;
  }
}

Frando’s picture

I was about to post a really lenghty comment here, but it got lost unfortunately. So I'll try to make it short now.

I basically agree with most things that were said. I'll not individually +1 now.
And I also agree that this patch should be simple, to be able to get it in D6, but I also think that it should be extendabale enough already in D6 so that contrib can build on it.

Now some individual comments:
- drupal_get_schema(): In my last patch, a 'module' property is already automatically added to each schema array. So I'd leave $schema_id as the only parameter of drupal_get_schema, and add a second function, drupal_get_schemas($module), which returns an array of all schemas that belong to a certain module. IMO, that would keep things simple and easy.

- hook_schema_alter(): Generally, there are two use cases for hook_schema_alter(). The very difficult one is automatic table modification. There be dragons, so that won't get in D6, IMO. However, hook_schema_alter() is the only way to ensure that the schema definition arrays are consistent with the actual tables. There are modules that add columns to existing tables. Without hook_schema_alter, the schemas returned by drupal_get_schema would then not match the actual tables anymore, which is very bad. That's why we should add it in this patch, but with a more limited scope. That means that db_create_table processes the un-altered schema definitions, e.g. by passing a second parameter, $altered = false, to drupal_get_schema. By default, drupal_get_schema should invoke hook_schema_alter, though.
Now, if a module wants to add a field to an existing table, it just adds the field in hook_install via db_add_field, but also implements hook_schema_alter to add the field to the schema definition.
Voilà, problem solved: The actual db tables and the schemas returned by drupal_get_schema remain in sync without dealing with automatic schema modification based on hook_schema_alter for now.

-hook_schema in .module vs. .install vs. .schema: As Barry said, all of these work, we just need to pick one. hook_schema in .module is out of the game, IMO, as that'd be too much code parsed each time. So .install and .schema. I vote for .schema, but .install would also work for me.

Frando’s picture

StatusFileSize
new144.67 KB

And another round ;)

Included in this patch are the results of the recent discussion here.

  • simplified the API: hook_schema() now directly returns the definitions. To save parsing times, the actual schema arrays are stored in a function _modulename_schema in a modulename.schema file. This function is called by hook_schema. Small contrib modules could just stick their schemas into their module files if they wanted to..
  • The schemas now get cached in the database. The cache can be rebuild with one function call (currently done after module activation/deactivation)
  • Recreated all the schemas. Display width is now being set.
  • Added a 'numeric' type with 'precision' and 'scale' properties
  • Added drupal_alter('schema', $schema). This should be used to keep the actual tables and their schemas in sync, e.g. if a module adds a column to an existing table it implements hook_schema to change the schema accordingly. For table creation, we have always to pass the unaltered schema definition (that's as simple as
       $schemas = module_invoke('aggregator', 'schema');
      db_create_schemas($schemas); 

    ), though, as everything else would get *really* bad without giving it lots and lots more of thought and code..

  • And finally: updated all .install filed that create table to use their schemas.

I did not convert the definitions to use fapi #properties. Barry and I both thought that this is not really required at the moment: Currently, all keys apart from field names and table names would be #properties. Field and table names have fixed positions in the array tree anyway (that's the difference to FAPI: in FAPI, the hash is needed as a differentiator between properties and children), so the change would be purely cosmetic at the cost of more typing. I'm not insisting on this, though, so if there are more voices in favor of the FAPI style, I might get convinced ;)

Barry started working on the Postgres part, so this is hopefully getting close ;)

Frando’s picture

Oh, and I just did a mysqldump of an installation of current HEAD, all modules enabled, and one of HEAD with my last patch (installed after applying, of course) and all modules enabled. The diff reveiled only one remaining actual schema difference:
without patch: `cid` varchar(255) character set utf8 collate utf8_bin NOT NULL default '',
with patch: `cid` varchar(255) NOT NULL default '',
That basically means we have to add 'character set' and 'collate' properties for fields in the next revision, IMO.

bjaspan’s picture

One quick comment: The schema type 'numeric' should map to the MySQL type 'numeric' (a.k.a. 'decimal'), not 'float'. Only size normal makes sense; there is no type for numeric:big.

I wonder if we should define all the sizes (tiny, small, medium, and big) for all types, even if they all map to the same database type (e.g. numeric:tiny through numeric:big would all just be 'numeric') or if we should report an error if an invalid type:size combo is specified.

nedjo’s picture

This is looking great.

I'm not stuck on the FAPI format. If there's a need, we can come back to this at any time. Meantime the current approach works fine.

(Loading a module in the install process is certainly a plus, more than once I've wanted access to the module's methods while installing.)

Frando’s picture

I wonder if we should define all the sizes (tiny, small, medium, and big) for all types, even if they all map to the same database type (e.g. numeric:tiny through numeric:big would all just be 'numeric') or if we should report an error if an invalid type:size combo is specified.

I think, for each type, we should either define one or all sizes. Otherwise, things might get difficult, as the different engines implement a different set of type-size-combinations. Or, if we'd replace the sizes with numbers (and then use constants), we could define only the available types for each engine and fall back to the next higher one.

bjaspan’s picture

I definitely think the type:size combos that are allowed in a schema structure should be engine-independent; that's one major goal of having the structure to begin with.

I like the idea of defining either one size (e.g. normal) or all sizes (tiny through big) for each type simply because it makes the documentation easier. For every type except int on MySQL, some/most of the size options for a type will map to the same thing, but that's fine.

The "fall back to the next higher size" will be implemented via the type map tables. ie: int:medium will be mediumint on mysql but bigint on an engine that only has int and bigint.

Frando’s picture

StatusFileSize
new144.74 KB

Another reroll.

Changes to last patch:

  • sync with HEAD
  • NUMERIC type implemented properly
  • corrected the db_type_map, as discussed above
  • schema definitions: type = 'serial' used for all auto_increment fields (instead of using 'int' and setting auto_increment = true). db_process_field in database.mysql.inc then adds auto_increment = true for type serial.
nedjo’s picture

Looking good. Some small remaining details, some of them mentioned already.

1. Can you explain the need for special treatment of the core modules with schemas by explicitly loading them in install.php? Won't this just be done just like non-core non-required modules?

2. All the nearly identical hook_schema() functions seem superfluous. Instead it looks like we should load the .schema includes as needed and have the hook_schema() functions there. Here's another draft of (this time) a pair of functions for this. These should replace the current module_load_install().


/**
 * Load a module include file.
 */
function module_load_include($type, $module) {
  // Load a relevant API if available.
  if (is_file('./includes/'. $type .'.inc') {
    include_once './includes/'. $type .'.inc';
  }

  $include_file = './'. drupal_get_path('module', $module) .'/'. $module 
.'.'. $type;
  if (is_file($include_file)) {
    include_once $include_file;
  }
  else {
    return FALSE;
  }
}

/**
 * Load an include file for each of the modules that have been enabled in 
the system table.
 */
function module_load_include_all($type) {
  foreach (module_list(TRUE, FALSE) as $module) {
    module_load_include($type, $module);
  }
}

So before calling module_invoke_all('schema') we would call module_load_include_all('schema'). And before each module_invoke('schema', 'modulename') we would call module_load_include('schema', 'modulename').

3. Similarly, it looks like we could remove most hook_install() implementations altogether. Instead:


/**
 * Install the schema for a module, if present.
 */
function module_install_schema($module) {
  drupal_load('module', $module);
  module_load_include('schema', $module)
  if ($schemas = module_invoke($module, 'schema')) {
    db_create_schemas($schemas);
  }
}

We could invoke this during module install. If the module doesn't require any further setup, it doesn't need an install function.

4. There's an extra line before function drupal_get_schema($name).

Should I do a patch iteration with these small changes?

Frando’s picture

Status: Needs work » Needs review
StatusFileSize
new134.62 KB

And here we go again:

1. Can you explain the need for special treatment of the core modules with schemas by explicitly loading them in install.php? Won't this just be done just like non-core non-required modules?

The problem is that when system_install() gets invoked, the module system is not yet fully available, as there aren't any database tables yet and thus no system table. That means that the module list gets constructed manually (in install.php). In current HEAD, the tables for system, taxonomy, comment, block, user and node modules are all created by system.install. To create their tables with their schema definitions living in the according modulename.schema files, we have to invoke hook_schema for each of them, and therefore have to add them to the module list. We might want to reduce the number of automatically created tables, but let's leave that for another patch.

Changes to last patch:

  • hook_schema now just lives in modulename.schema. drupal_load_schemas() just includes all .schema files. Remember that the schema array will be cached, it will only be rebuild upon module installation or explicit call.
  • For including the schema files, two helper functions are added to module.inc, as proposed by nedjo. I did not yet remove module_load_install, that should be left for another patch.
  • Updated all .install files to use the new drupal_install_schema()

I'm not sure wether fully automatic schema creation / deletion would be too much of "magic" - more opinions would be great. For now, I just left the two lines that are required to create all tables for a module in hook_install().

I'm setting this to "needs review" for now. There are still some known TODOs (Postgres support, collate & charset properties, barry is working on the former, and the latter should be fairly simple and is only needed for one field in the whole of core), but apart from that, this seems to be quite solid by now and up for reviews, imho.

Frando’s picture

Title: Data models, reflection / abstract table creation » Schema reflection / abstract table creation
bjaspan’s picture

Status: Needs review » Needs work

We're making great progress here (thanks, Frando!) but I think we should wait until we think the patch is actually done before marking it "needs review." I will try to get to the PostgresQL change by the end of tomorrow. We also need mysqli.

yched’s picture

StatusFileSize
new133.19 KB

Update : rerolled against latest HEAD

I fixed a couple bugs in some modules _install functions (calling non existent 'db_create_schemas')
actually, it seems

function book_install() {
  drupal_install_schema('book');
}

is enough (no drupal_load, no module_invoke...), so I modified all _install functions that way.

Seems to work perfectly fine AFAICT : first drupal setup, individual module installation work as expected.

We might want a 'drupal_uninstall_schema' function to mirror drupal_install_schema on module uninstall ?

Bjaspan : could you do a small summary on which issues need a push before this can land ?

Frando’s picture

Thanks for the reroll!
Yes, drupal_install_schema is enough, that's what I wrote that function for .. dunno why I forgot to update all .install files. Thanks!

drupal_uninstall_schema could be added, but is not really needed (as we don't have to deal with including module files during uninstallation, db_drop_table does the job, I think). But I'd say let's add it for the sake of consistency.

Remaining issues are adding 'collate' and 'charset' properties (really minor one for now, only one core field would set them) and postgres support. Barry, what's the status on the latter?

Frando’s picture

Status: Needs work » Needs review
StatusFileSize
new137.63 KB

I just realized that collate and charset properties are not at all needed in drupal core. Hm. I thought some column used collate utf8_bin, but I was wrong. So that is not needed for now (could still be added in a follow-up patch, but let's keep that one simple).

The only change in this patch towards yched's reroll is that I added drupal_uninstall_schema($module), which drops all tables that are defined in a module's schema. This function is used in all uninstall hooks now, which makes them simpler and more consistent to the install hook.

So, from my point of view, this is ready now apart from postgres support and a some review.
I'm setting this to CNR, as everything that's in the patch is really up for reviews, so that this can get in quickly when Barry's Postgres work is done.

Frando’s picture

StatusFileSize
new145.24 KB

Missed database.mysqli.inc in the last patch

yched’s picture

nitpicking :

// Set missing required values
drupal_process_schemas($schemas);

might not be needed in drupal_uninstall_schema ?

yched’s picture

Sorry, looking closer, it probably is needed. Forget it.

yched’s picture

PS : you left your own personal settings.php in the patch ;-)

Frando’s picture

StatusFileSize
new144.6 KB

removed settings.php

webchick’s picture

Status: Needs review » Needs work

system.install's hunk no longer applies.

bjaspan’s picture

webchick is correct, of course. Also, this code is NOT ready for review. As I work on the pgsql patch, it is clear we still have some issues to resolve. We do not want to waste reviewers' time with not-ready code.

bjaspan’s picture

Please see the message I just posted on the development list about mysql and pgsql's schemas not being the same.

yched’s picture

Just wondering : it seems pgsql does not know anything about the '3' in 'INT(3)'.
If it's only a MySQL-ism (it seems this is not part of the sql standards per http://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt), do we really need to support the 'disp_width' stuff for INT columns ?

(feel free to flame me quiet if you discussed this already, I know there has been a lot of talk that I missed...)

bjaspan’s picture

Correct, int(3) is a mysql-ism, and 'disp_width' is for nothing but storing the '3', and is basically useless. IMHO, there was never any point in using int display widths for Drupal; it is only relevant if your primary UI to the database is a mysql-cli-like client.

In fact, I'm pretty sure some developers think int(1) means "one-byte int" instead of "four-byte int, only one digit of which is displayed."

Frando's code supports it only because Schema's code supports it only because I did not want to "take away" any features from core tables in the schema structure that might give a core maintainer reason to refuse the patch.

But yes, it is a dumb feature.

yched’s picture

In fact, I'm pretty sure some developers think int(1) means "one-byte
int" instead of "four-byte int, only one digit of which is displayed."

I was one of those until, er, recently.

I think this would be a good opportunity to unsupport this and remove the confusion, but I do see your point about not raising unnecessary eyebrows...

bjaspan’s picture

FYI, I am currently working on the pgsql stuff which is going to require/influence some changes elsewhere in the patch. Since we don't have revision control on patches, I ask that no one else modify the patch until I'm done. Thanks.

dww’s picture

this totally rocks... i'm so happy to see this on its way into D6. maybe i can just forget about the taxonomy vocabulary for database compatibility i was thinking of adding to release nodes on d.o, since this will effectively be a solved problem. ;) i guess there's still the possiblity of irresponsible or careless module authors using MySQL-isms in their queries, like "REPLACE INTO", but for the most part, pgsql will Just Work(tm) everywhere, once this hits. jah^H^H^H barry, frando and nedjo be praised! ;) when this is back to CNR, i'll make a concerted effort to carve time out to give it a careful review, since this has been on my radar for ages... thanks, y'all!

bjaspan’s picture

A snafu: Key/index column prefix lengths. I knew this would be an issue eventually but I was hoping it would somehow magically disappear. It didn't.

A unique key or index can specify that only a prefix of a column is included. In mysql, the syntax is KEY colname(length). In all of core, this feature is used 4 times: node's node_type and node_title_type indexes, term_synonym's name index, and locale_source's source index. However, in mysql, a type text column can *only* be in an index if a length is specified, so if we do not support it then locale_source.source cannot be indexed at all. That won't win us any points.

My suggestion is that column elements in primary key, unique key, and index specifications can either be names, as they are now, or an array of (name, length). So, we could have

$schema['table']['primary key'] = array('col1', array('longcol', 15));

The primary key would be col1 and the first 15 bytes (or whatever) of longcol.

Comments?

Frando’s picture

at #75:
I also spotted the index length issue, and also thought in the lines of barry's solution. It's the simplest approach. +1 from me. We're getting closer and closer ;)

bjaspan’s picture

StatusFileSize
new151.31 KB

Here is my latest patch. I'd like people who have been active here to play with it before we mark it CNR or announce it on the dev list. Also, schema.module is now ported to HEAD so you can install it to experiment. See if you get unexpected errors, etc.

Things to try: 1. Clean installs with this patch applied. 2. Clean install without the patch applied, then apply the patch and run update.php. Verify that schema.module reports 0 mismatches (mysql) or 3 (pgsql, all in indexes). 3. Bang around the system after the patch is applied, make sure everything works.

Here are my nodes describing what I did.

- Files in this patch:

  install.php
  update.php
  includes/*.inc
  modules/*/*.install
  modules/*/*.schema

  cvs diff -F^f install.php update.php includes/*.inc modules/*/*.install
  diff -u -N --from-file /dev/null modules/*/*.schema

- Supports mysql, mysqli, and pgsql.

- Clarified terminology (variable and function names).  A schema is
  one or more table specifications; we only have one system-wide
  schema, formed by merging every module's schema which consists of
  one or more tables.

- Reworked drupal_process_schema() somewhat.  Removed redundant
  'table' attribute.  'size' => 'normal' is now handled in db engine.

- update_sql().  Most schema API calls will be from update functions
  so they need to use update_sql.  So, db_create_table() might as
  well, too.  Moved update_sql() from update.php to database.inc and
  made all schema API functions use it.

- drupal_get_schema() can now return the entire schema, not just one
  table.  schema.module needs this.

- mysql changes moved into database.mysql-common.inc, included by
  mysql.inc and mysqli.inc.

- system.install: restored creation of domains and functions for pgsql

- locale.install: See comment in locale_install.  For mysql,
  locales_source.source and locales_target.target need to be blob to
  be case-sensitive but should be text on pgsql and elsewhere.  Thus,
  those tables now have type => text with mysql_type => blob.

- Reconciled mysql and pgsql schemas.  Changes that required .schema
  file changes: 

  system.schema: new tables: batch, cache_form, cache_menu.
  aggregator_feed: rename 'link' key to 'url'
  blocks.delta: fix default: was 0, now '0' (type is varchar)
  boxes.bid: type serial
  boxes.format: size small
  comments.format: type small
  comments: rename 'lid' index to 'nid'
  node_revisions.vid: type serial
  cache.serialized: size small
  cache_filter.serialized: size small
  cache_page.serialized: size small
  files.fid: type serial, no default
  users.uid: type serial, no default
  term_synonym: name index on name(3)
  node: node_type index on type(4), node_title_type index on type(4)
  locales_source.source: type text, mysql_type blob
  locales_target.translation: type text, mysql_type blob
  locales_source: add index source on source(30)
  locales_target: rename index lang to language

- Reconciled mysql and pgsql schemas via system_update_6016().  DO NOT
  REGENERATE .schema FILES UNTIL THIS UPDATE IS APPLIED.
bjaspan’s picture

StatusFileSize
new165.47 KB

Previous patch was missing a file and in the wrong format for all new files. New patch attached; no changes other than diff format.

bjaspan’s picture

Frando currently owns the "patch lock"; he'll be cleaning up some field/column terminology along with whatever else he finds. When he's done, I've found a few more schema reconciliation fixes that are needed for the new tables I added to system.schema in the last patch. Perhaps after that we'll be ready for review. :-)

bjaspan’s picture

TO DO: http://drupal.org/node/127539 just committed a patch changing the core schema for the batch table. We have to incorporate the change.

bjaspan’s picture

TO DO: http://drupal.org/node/140666 committed a patch which changes the core schema regarding primary keys for several tables. We have to incorporate the change and rename our system_update_6016() function.

There may be other schema changes being committed too, these are just the ones I've been tracking. Code changes are easy to notice because cvs update deals with them or reports a conflict. Schema changes are less obvious.

chx’s picture

menu saw sweeping changes. http://drupal.org/node/137767

Frando’s picture

StatusFileSize
new153.83 KB

So here's another patch.
Changes to Barry's last one:

  • Reconciled the schema definition with current HEAD. Incorporated the recent menu table changes and the index/key patch that got commited.
  • Documentation improvements
  • Code style corrections
  • Made function and variable namings more consistent
  • Another round of testing. Module installation/uninstallation etc. No problems spotted.

For mysql, there are zero schema differences spotted by schema.module.
For postgres, there seem to be 5 left, but all are minor ones (indexes and keys only):

Tables for which the schema and database are different.
taxonomy

    *
      term_synonym
          o indexes name: missing in database

system

    *
      url_alias
          o unique keys dst_language: missing in database
          o unique keys dst: unexpected (not an error)

node

    *
      node
          o unique keys nid_vid: missing in database
          o indexes node_title_type:
            declared: array('title', array('type', 4))
            actual: array('title')
          o indexes node_type: missing in database
          o unique keys nid: unexpected (not an error)

locale

    *
      locales_source
          o indexes source: missing in database

block

    *
      blocks
          o column delta:
            declared: array('type' => 'varchar', 'length' => 32, 'not null' => 1, 'default' => 0)
            actual: array('type' => 'varchar', 'length' => '32', 'not null' => 1, 'default' => )
bjaspan’s picture

Excellent. I've taught schema.module how to inspect column-prefix keys so tomorrow I'll verify that pgsql compares with zero errors and mark this as ready for review.

Frando’s picture

StatusFileSize
new168.29 KB

A file (database.mysql-common.inc) was missing in my last patch.

bjaspan’s picture

Latest patch is available at a new issue, http://drupal.org/node/144765, ready for review. Frando and I decided it would be better to create a new issue rather than force reviewers to wade through the history of this one. The new issue links to this one, though.

Changes in this patch:

- Fixed handling of unique key names in pgsql
- database.pgsql.inc and database.mysql-common.inc: call _db_process_field() before _db_create_field_sql() in db_add_field().
- system.schema changes:
  - change batch.bid to type => serial, unsigned => TRUE
  - restore cache_form.serialized to size => small
  - change menu_links.mlid to type => serial
  - rename index menu_expanded_children to expanded_children
- system_update_6017():
  - pgsql: leave default on batch.bid, that's how auto-incr is implemented
  - pgsql: change batch.sid to batch.token
  - pgsql: fix index locales_source.source to be on array('source', 30)
  - pgsql: rename index menu_links.menu_name to menu_links.menu_name_path
  - pgsql: rename unique key node.nid to node.nid_vid

  - mysql: batch.bid: make auto-increment, unsigned
  - mysql: change cache_form.serialized to size > small
  - mysql: change menu_links.mlid to auto-increment
  - mysql: rename index menu_expanded_children to expanded_children

This results is zero mismatches reported by schema.module for mysql, mysqli, and pgsql. Whew!

magico’s picture

Seems I'm gonna use this to upgrade my code at http://drupal.org/node/122241

yched’s picture

Status: Needs work » Closed (duplicate)

The patch now lives in a fresh issue in http://drupal.org/node/144765
Let's mark this one duplicate