Database Layer: The Next Generation
| Project: | Drupal |
| Version: | 7.x-dev |
| Component: | database system |
| Category: | task |
| Priority: | critical |
| Assigned: | Crell |
| Status: | closed |
So the installer doesn't work yet, but for MySQL everything else should. For those who want to try it out, you can check out a copy of Drupal with the new database code in it from chx's public svn repository: https://4si.devguard.com/svn/public/pdo . A proper patch will be forthcoming as soon as the installer works, which should be Very Soon Now(tm). For a complete list of the new features it offers, see: http://www.garfieldtech.com/blog/drupal7-database-update .
If you want to help with the installer, let chx know so he can give you commit access to it. :-)
This patch will also deprecate about a half-dozen old feature requests for the database layer, as it works a whole bunch of features into one cohesive system. I'll go flag those as duplicate shortly.

#2
subscribing.
#3
subscribing.
#4
Subscribing.
#6
For those following at home: I just installed Drupal with the new layer (and previous tests as long as a week ago showed installed Drupal working). We still need to work on error handling.
#7
yes, another "subscribing" (with hopefully more informative followups :)
#8
We have the system installing and reinstalling, users, node and upload tests are passing. When all tests pass I will submit the patch, not until. It's close, though.
#9
subscribing
#10
@chx: Where are those tests that are failing, or are you just talking in general that not everything's working? If there's a list of what's wrong, I'd like to pitch in.
#11
@Susurrus: Right now we're going through module by module in SimpleTest and finding problems. Usually it's because the backward compatibility layer we've included can't handle every edge case, so some queries need to be updated to the new format anyway. I think we're over half way through. The biggest roadblock right now, I think, is figuring out what to do with pager_query(), which right now is all kinds of manual weird ugliness that is frequently dynamically built, badly. The new system has its own "real" query builder, but I'm not yet sure how we're going to handle pager_query(). It may end up being a special case of the query builder.
If you do a checkout of the svn repository mentioned in the first post, you can see the current status of the code. The installer works, so go ahead and run it. Start running SimpleTests, find places that fail, and let us know / convert the queries as needed. Either of us is in IRC on and off throughout the day if you need help. Thanks!
#12
I'm new to PDO, but I thought that FETCH_ORI_ABS, FETCH_ORI_REL do the same as the pager_query().
#13
No, that's something else, actually. That's for fetching an arbitrary row from a result set. pager_query() dynamically specifies the LIMIT and ORDER BY portion of the query based on user input.
#14
Hope to mention that we still have an alternative choice of PDO implementation from http://drupal.org/node/172541, which based on existing DB API architecture plus totally 5 databases supporting. I think both issues have no conflict and should able to contribute for each other, e.g. this issue target for ORM/active record/PDO, but the other focus on multiple database supporting plus backward compatible. Hope we may have more join development for better implementation.
#15
subscribing
#16
Smal, trivial patch, hardly needs any explanation.
#17
It may not need an explanation, but I'll offer one anyway. :-)
This patch does the following related tasks:
OK, so, why? Well, here's a list of features that this architecture allows:
hook_query_alter(). 'nuff said.Later follow-ups can also offer the following, but we wanted to get the base system in place first.
db_rewrite_sql()in favor ofhook_query_alter()We've already run extensive SimpleTests on this patch, and it is extremely stable. There's a few edge cases being investigated, but all functionality should work as is.
I've tried to provide copious inline documentation. I think the patch is more documentation than code, actually. Architectural and usage-guide docs will be forthcoming, but I didn't want to spend too much time on those until the patch was out and written.
Committers: Although I've been working on this for a while, it would not be anywhere near as powerful or stable without chx's help this past month. Please credit him as co-author.
Review please. :-) (Attached patch has some cosmetic changes.)
#18
About stability. This patch passed all existing simpletests plus the pager one I have even written for it -- with two exceptions: one, if locale is not enabled the modules test fail -- but manually doing the same does not fail. As simpletest had module issues recently, I am not worried, Two, blogapi fails in HEAD and I have not yet figured it out why so it fails with this one too. Aside from that all simpletests work.
#19
This patch implements my object-based transaction system, so I of course support that. :-) Reviewing this patch will give me something to do on the plane to Drupalcon.
#20
Big patch, I'm reviewing it (slowly).
#21
Just curious but did we do any performance comparisons? This patch has the advantage that we can exploit database specific optimizations, but at the same time, how much slower did the average select query became? Do we have any early insight in this?
#22
ajk did benchmarks a few days before the patch was rolled and he found that there is no slowdown and sometimes there is some speedup. the patch is not really about performance but features. Of course, we took as much care as possible with performance, for eg. not using native prepared statements for mysql because those are not cached by mysql.
#23
As chx's mentions, I've been looking at how it performs for a while now as the question was going to be asked sooner or later. And since HEAD hasn't deviated too far from Drupal 6 now is a good time to publish some results so far. Rather than clog up this issue with lot's of performance/profile related junk I created a more detailed post here:- http://drupal.org/node/229132
If anyone wants more info, drop me a line. I'll continue to profile this patch as it matures but based on the forum post above it's a +1 from a performance angle at the moment at least.
#24
Here's a few things I came up with on the plane.
Functional problems:
* Functions and variables are CamelCase, which isn't Drupal style.
* Transaction object doesn't handle nested transactions properly. It needs a static counter.
* Doesn't implement LOW_PRIORITY for UPDATEs, even though DELAYED is supported.
Questions:
* Does the inheritance of PDO allow freeing result sets?
#25
About CamelCase, we feel that as PHP largely uses CamelCase for its objects while it uses lowercase for functions so we though we will adopt. LOW_PRIORITY would be easy to support but what's the point...? What do we gain by supporting it?
#26
Regarding LOW_PRIORITY, I just think that we should be consistent in support of non-standard MySQL extensions. If we support DELAYED, I think we should support the other options.
#27
subscribe
#28
subscribing
#29
DELAYED is extremely useful for performance so we are supporting it -- do we really want to debate this in this issue? is this the big issue w/ this? can't we have a followup issue ?
#30
hello Strauss,
several days ago, i had sent a letter to your mailbox: shout at thatotherpaper dot com.
i hope that you can reply it.
many thanks.
#31
@DavidStrauss: How useful is LOW_PRIORITY to Drupal in practice? IMO, we should support SQL features that are of value to Drupal. INSERT DELAYED (or an equivalent functionality in another database) has a very definite benefit for logging, caching, etc. Where would we use LOW_PRIORITY?
#32
Subscribing. I agree with only supporting features of benefits. If there's no real use in supporting LOW_PRIORITY lets keep it out. It can always land at a later stage once this gets in.
#33
Subscribe. I've started reviewing this patch. I already have a list of comments; no show-stoppers but some things to fix.
#34
OK, I'm really late to this thread. Bear with me. Can someone explain why replication (and the major changes that come with it, like db_select() vs. db_update()) outweighs a solution like memcached, which many would argue is a much simpler solution to scaling than clustering? It doesn't seem like MySQL Clustering is that impressive -- and for obvious, fundamental reasons relating to the challenges of master-master, master-slave db replication, and it costs you a whole lot of new code complexity.
#35
@Crell I agree that it's much less useful. I was arguing from API consistency, not pragmatism. I won't be offended if we leave out LOW_PRIORITY support; I just want it to be a conscious decision.
#36
should this patch include this fix? http://drupal.org/node/220064
#37
@36: I don't think so. The bug in #220064 is that a query is broken: it is treating strings and ints as directly comparable, and the latest PostgreSQL is picky enough to complain about it (as opposed to mysql that will sleep with anyone it can recognize). Arguably, the query is broken because our schema is broken. But in no case does our API to execute queries have anything to do with this.
Someone might argue that we should use schema information to automatically cast data types when necessary to make queries like this one work. I'd say no; the solution is not to write broken queries.
#38
@bjaspan - if it should not be included, let's move that patch to 6 .x and get it in sooner.
#39
Regarding DELAYED, are we ok that it's not supported on InnoDB ?
http://dev.mysql.com/doc/refman/5.0/en/insert-delayed.html
Something somewhere reminds me d.o is using InnoDB tables for something. Not sure if this is/was the case though.
#40
Could we please stop painting the bikeshed? If not, I am perfectly happy to roll a patch without INSERT DELAYED if that's needed to move on but really, instead of painting the bikeshed, let's discuss the atomic plant:
#41
@AjK: No, I'm not happy with it. If core is depending on MySql extensions then it is just plain wrong. I use InnoDB exclusively because the shear numbers of data requires me to use transactions. Also, note, that if the reason for using INSERT DELAYED is the logging processes then the queued delayed inserts may not make it at all to the DB and therefore make it harder to debug issues.
#42
My review is still underway and I have some feedback that will require changes (nothing show-stopping), but one immediate issue is that the schema.inc and schema.*.inc files are missing from the patch. Could we have a new patch file including them?
You may find cvsutils (http://www.red-bean.com/cvsutils/) helpful in tricking CVS into including newly created files in a cvs diff. I had to do this for the Schema API patch.
#43
Also, if at some point in the future this patch is sitting still for a while, I'll volunteer to go through and make (just a few) changes for further compliance with coding standards. From what I've seen so far, there are really very few "nitpicky" changes that need to be made, and the code comments are excellent. The biggest offender I've seen so far is two spaces between most sentences, rather than core's preferred one.
#44
WRT to ajk's question in #39: Yes, drupal.org indeed uses innodb for everything but the search tables (which are myisam). However, I don't think that using insert delayed would be a problem for e.g. watchdog inserts. We should use the archive type which is IMO more suitable for this.
#45
@chx in #40. Ticks to 1, 2 and 3.
The patch came in at #17 and this is now #45 without a re-roll/etc. I can see DELAYED is supported but it's not actually used by Core. If it's the only stumbling block so far congratulations to Crell and Chx on some fine work.
Remove INSERT DELAYED if it will settle things if you like, call it a natural finish and get it committed. I'm sure Keith would be rather rolling his patch to remove extraneous spaces/lines against HEAD.
We all just need somewhere to park our bikes ;)
#46
p.s. assuming we get the missing files bjaspan refers to.
#47
The patch is not ready to apply in its current state. I set it to CNW for a reason. I plan to finish my initial review and post the results tomorrow at the code sprint.
#48
I personaly don't like classes, at least for what developers will actually use (although I think this whole abstraction is completly doable with functions). And I don't see much value in chainable methods (specifically on
db_insert()calls). Some examples:<?phpdb_insert('my_query', 'node')->fields($fields)->execute();
?>
How this differ, from:
<?phpdb_insert('my_query', 'node', $fields);
?>
is there cases when the following syntax is necessary?
<?php$foo = db_insert('my_query', 'node');
$foo->fields($fields);
$foo->execute();
?>
Or is there cases when you want to call
$foo->execute()before setting fields using$foo->fields()? IMO this:db_insert('my_query', 'node')->execute()->fields($fields);doesn't make any sense and it doesn't works as jQuery's chainable methods does, i.e. the order is not important unless you are filtering/traversing DOM elements.And for the
duplicateUpdate($update_fields)method. Is there cases where$insert_fieldswithout the keys is different from$update_fields? If not, sodb_insert()should be capable of knowing where a record already exists, and do run a UPDATE if so. So thedb_insert()should be renamed to something else likedb_save()ordb_write()as it not only does INSERTs but UPDATEs too.Batch UPDATEs seems to be a case for chainable methods in
db_insert()however I don't see any example of this use in the patch.What's the problem with a
$paramsarray with conditions and other properties for doing a select?My opinion is that we need to provide a simple interface for trivial tasks, not enforce users to write complex code for very simple data manipulation stuff.
#49
Subscribing.
Interesting pattern question there, especially in light of Drupal's stance on OOP.
Chained methods are neat - jQuery does everything like this. But can we mix patterns like this in Drupal too? Calling member methods on the return value of other functions strikes me as fitting more appropriately into prototype-oriented programming than either functional or object-oriented programming... but I'm not well-versed in patterns.
#50
@Gerhard: Did you see my response in #41? You state the opposite as I when I give a reason for not using DELAYED for watchdog (well, I called it "logging processes"). If php or mysql aborts the messages are likely not to be captured to the table that have already been written. I don't think we want this even in a production system. I've found many problems looking at the watchdog log of my production system. Production systems usually contain more activity and prove to contain nuances that development and test systems do not by nature have.
@chx: I do not care whether objects are used so +1 for that. I do not care that CamelCase or even mixedCase is used as long as we all agree to modify the Drupal Coding Standard so +1 for that. And I'm not sure about the chainable query builder; recidive made some interesting points. As long as we have improved performance and improved multiple engine support with ease of use I can live with the changes. I'm using Drupal DB API to process 100K rows daily in batch and then processing the changes as nodes daily; some days the node changes could be near the 100K mark as well. All the nodes have a minimum of four terms and pathauto is setup to add path aliases. I need performance most of all.
#51
Here is what InsertQuery::fields says: "This method may be called multiple times. If it is, multiple sets of values will be inserted. The order of fields must be the same each time.". That's useful, isn't it? PDO already uses objects and methods and we would need to provide a wrapper for each -- we thought that's not the best.
#52
Changing back to 'code needs work' as in #47.
#53
Regarding INSERT DELAYED: Non-InnoDB tables will work fine with DELAYED, they just won't actually delay. There is no performance cost here, just a performance gain. Non-MySQL databases can ignore the DELAYED flag if they don't care or use whatever their native equivalent is, if it has one.
Regarding using OOP: PDO itself is completely OOP. Making the internal API OOP makes the code easier and cleaner, and more flexible. This is a case where using a table saw instead of a screwdriver makes a great deal of sense. Drupal's coding standards will also need to be clarified to ClassName::methodName(), which is the nearly universal standard for OOP, PHP or otherwise.
Regarding the Fluent API: That was used for 2 reasons: Flexibility and terseness. The original version of the API did take the $fields array directly in the constructor. That was factored out in order to allow for multi-insert statements, which does improve performance. (How it does so will vary with the database, but it should always be a win or at least a wash.) Also, jQuery methods do care about order. Think filter(), find(), etc. We're not going quite that complex.
Regarding duplicateUpdate(), we are going to modify it to specify the key fields to make the operation easier (vis, doesn't require hitting schema API) on non-MySQL databases. I may rename the method.
I agree with Barry about it not being the job of the API layer to fix broken queries.
#54
Barry's having some wireless issues at the code sprint (as are most people), so below are his comments, passed via sneakernet:
# General
Big patch, long review.
This is a read-only review (I have not yet run any code) and I've only gotten through the files listed below after one hash (e.g. # database.inc).
I do not (yet) love the query builder interface but I do not (yet) hate it either. Some general comments:
* My big concern is that having to construct $fields (and possibly $update, though see below) separately from the query-builder logic spreads the information about the query out in a way that will make it harder to interpret code quickly.
* Passing $fields as ($key => $value) requires that the caller construct the array from data it probably already has in non-array form. Then, the called function (e.g. InsertQuery::fields()) then splits it apart again! This seems wasteful. If it is easier/more natural for callers to pass fields and values as varargs (I think it is), and it is less work for member functions to use the args in that format, then why not just do it that way? e.g.
db_insert($id, $table)->fields('foo', 'bar)->values($foo, $bar)->execute();
* I think Henrique's suggestion of direct auto-executing functions is useful because these simple operations are very common:
<?phpdb_insert($id, $table, $fields);
db_update($id, $table, $fields, $conditions);
db_delete($id, $table, $conditions);
?>
* One part I really do hate is anything that forces me to use named parameters. Sure, it's incredibly useful in some situations, but it is unnecessarily verbose for others. But I haven't actually used this API enough yet to have a firm opinion.
includes/schema.inc and includes/schema.*.inc are missing from the patch, so I am highly confident that no one else has reviewed them yet. :-)
Wrap all comments to 76 characters. I can do this easily in emacs if you want.
# default.settings.php
Comments on:
<?php$databases['default']['default'] = $info_array;
$databases['default']['slave'][] = $info_array;
$databases['default']['slave'][] = $info_array;
$databases['extra'] = $info_array;
?>
$databases['default']['default'] seems unnecessary given that $databases['extra'] works. Why not $databases['default'] by itself?
# database.inc
## abstract class DatabaseConnection
function transactions() is not a great name; it is ambiguous. We could use transactionsSupport() or hasTransactions() or a variety of more-clear choises.
In fact, abstract function supportsTransactions() is declared later in the file. So function transactions() is probably just obsolete.
Transaction support is not necessarily a property of a database engine; with mysql, it is a property of the table engine type (right?). I do not know how this works with mixed-engine-table queries/databases. This function may need a more expressive interface.
beginTransaction() is specified to "succeed silently" if transaction support is not available. However, if someone writes code that requires transactions for proper behavior (e.g. they assume that rollBack() will work), "silent success" without transactions is the wrong behavior. beginTransaction() should should accept a boolean, $required = FALSE, which when TRUE says it should return an error code indicating that transactions are not available. That way, the module can refuse to work instead of just breaking.
runQuery(): If the $statements cache stored the un-prefixed version of queries (instead of prefixed, as it now does), prepared statements could be retrieved without the overhead of prefixing first.
prefixTables(). This is just copied from database.inc, but the array_key_exists('default', $db_prefix) block has redundant code and could be improved. OTOH, it works.
query(): @return doc is less informative than runQuery(). Why?
## class Database
fetchAllAssoc(): This method overwrites values if keys are duplicated ($data[$key] = $record). We could easily provide a fetchAllAssocMult() that returns all records in an array: $data[$key][] = $record. Could be useful for multi-object loads.
## interface QueryConditionInterface and class DatabaseCondition
condition($field, $operator = NULL, $value = NULL, $num_args = NULL)
"This method can take a variable number of parameters. If called with two parameters, they are taken as $field and $value with $operator having a value of =."
Please, please, please do not create a new interface with crazy semantics like having the second operator be $value if two are provided or $operator if three are provided. Yuck.
Suggestion: condition($field, $value, $operator = NULL). So you'd say condition('created', time(), '<') for "created before right now". For the common case of "=", there is no change. But we avoid a multiply-interpreted set of arguments.
DatabaseCondition::_DatabaseCondition()
If $process_type (e.g. num_args) is 1, it treats $field as something other than a string. This is not documented behavior.
## class Query
parseWhere() turns a data structure into a string. It should be called unparseWhere() or __toString().
UPDATE and DELETE queries can require joins (I've written Drupal modules that use and depend on this) but this functionality does not appear to be supported. It isn't clear why not; the join code is already in class SelectQuery, it could be moved up to Query (it is meaningless only for Insert which could fail if there are any). Or you could derive {Select,Update,Delete}Query :: JoinQuery :: Query.
## class InsertQuery
If each sub-array in $insertValues is key-value, why must it match the order of fields in $insertFields? I guess you want to be able to do INSERT INTO table (iterate over the fields) VALUES (iterate over the values), but you could also just iterate over $insertFields as keys for the $insertValues block, no?
Actually, the doc says "$insertValues itself is an array of arrays. Each sub-array is an array of field names to values to insert." But actually, $this->insertValues[] = array_values($fields). Am I confused or is the doc wrong?
The paradigm of passing ->fields($fields) requires that the caller assemble an array($key => $value) just so that ->fields() can dis-assemble it and re-assemble it again in a slightly different form. The caller's work is unnatural and inconvenient anyway. So why not just pass an array of fields and an array of values (indexed or named)?
Same applies to UpdateQuery.
## class UpdateQuery
InsertQuery and SelectQuery fields() functions add to the list of fields, but UpdateQuery()'s resets the list of fields. This is incongruent.
The duplicateUpdate() interface is not quite right yet. It is cumbersome and inelegant to have to construct two separate $fields and $update arrays since they almost always have nearly the same content (for keys, that is).
In most cases, an "update or insert" query wants to set all the same fields for update that would have been set by insert, using the primary key for the UPDATE WHERE clause. Sometimes (e.g. for timestamps), the update wants to exclude a small set of fields. Less often (I can't think of an example offhand), the update wants to exclude almost all fields, updating only a couple.
I therefore suggest replacing duplicateUpdate() with two functions:
orUpdate($fields = NULL) performs the fallback update, updating only those fields specified in $fields. If $fields is a string, it is converted to a one-element array. If $fields is empty, all fields passed to fields() are updated.
orUpdateExcept($fields = NULL) performs the fallback update, updating all fields passed to fields() except those in $fields. If $fields is a string, is a converted to a one element array; this is useful because often you just want to skip updating a single field (e.g. timestamp).
For example,
<?phpdb_query("UPDATE {". $table ."} SET data = %b, created = %d, expire =
%d, headers = '%s', serialized = %d WHERE cid = '%s'", $data,
$created, $expire, $headers, $serialized, $cid);
if (!db_affected_rows()) {
@db_query("INSERT INTO {". $table ."} (cid, data, created, expire,
headers, serialized) VALUES ('%s', %b, %d, %d, '%s', %d)", $cid,
$data, $created, $expire, $headers, $serialized);
}
?>
would be replaced with
<?phpdb_insert('cache_set', $table)->fields($fields)->orUpdate()->execute();
?>
and could also be replaced with the more intuitive
<?phpdb_insert('cache_set', $table)->orUpdate()->fields($fields)->execute();
?>
If, however, we decided that when updating a cache entry the 'created' field should remain unchanged, we would write:
<?phpdb_insert('cache_set', $table)->fields($fields)->orUpdateExcept('created')->execute();
?>
With duplicateUpdate() or orUpdate*(), non-mysql database drivers will have to retrieve the schema to determine the primary key(s) in order to generate the appropriate queries. This will take some time. It's great to have this info come automatically from the schema, but since these are for run-time queries we may want to benchmark the difference vs. specifying the keys explicitly to these functions.
## class SelectQuery
Should countQuery() know when just changing to COUNT(*) will not work (e.g. when there is a GROUP BY) and return an error/throw an exception? If/when this supports sub-queries, it could also just be SELECT COUNT(*) (SELECT original query).
fields() takes varargs, incongruent with Insert and Update. Maybe this is inevitable, but above I suggest changing the semantics of fields() for Insert and Update anyway.
Other __constructors take the table to operate on. This one does not, and seems to want a special-case join() call (condition and values ignored) to set the base table. Why not just accept the base table in the constructor? Yes, it would need to accept an optional alias as well.
The "first join sets the base table" makes no sense for leftJoin() and rightJoin(). This is a good reason to set the base table in the constructor.
Should we support full outer joins? Granted, no one ever uses them, but it is easy enough to do.
addJoin() doc does not list $type.
having() docs says: "@param $snippet. A portion of a HAVING clause as a prepared statement." What is "a portion ... of a prepared statement"?
__toString(): appends "$params['type'] .' JOIN '" but $params['type'] already contains "JOIN". So, ummmm, I do not see how any join query can work currently. :-) Like I said, though, this is a read-only review so far.
## db_* and other top-level functions
The comments on db_and() and db_or() are reversed, or I'm confused, in which case they should be clarified to explain the apparent reversal.
db_query() and friends wrapper functions duplicate some query-string-processing code that could be shared.
db_rewrite_sql() is clearly deprecated, but how is node_access going to decide what node queries to modify and which not? You can't just use the fact that node is the primary table; some queries want all nodes regardless of access rules. Can I set $options['node_access'] = TRUE or something if I want the node access rules to apply? Then the node_access module would just look for that option tag.
db_distinct_field(): Does this mean anything? I thought DISTINCT was a query-level property, not a field-level property.
## Schema functions
db_field_names(): This is a Schema data structure operation. It should not be db-specific and does not need to be dispatched. database.inc (or schema.inc) should provide the one implementation.
# database.mysql.inc
## DatabaseConnection_mysql
queryTemporary(): runQuery() runs prefixTables() so why does this function need to also?
escapeTable(): No different than the function it overrides.
execute(): Does not call drupal_alter('query').
__toString(): Remove spaces from ' DELAYED '.
#55
I am going to work on a pgsql driver at code sprint today.
#56
#17.11. "A Fluent API query builder for SELECT statements that weighs in at only a few hundred lines (not counting comments). It doesn't replace the Views query builder, yet, but I will be working with merlinofchaos at DrupalCon and afterward to beef it up to the point that it does."
Adding join info to schema might help us here. Or maybe we can leave that to 'core fields API' and just provide the interface to generate such queries?
#57
I've discovered a bit of a snafu.
Here is one example scenario. BOOTSTRAP_LATE_PAGE_CACHE calls variable_get('cache') which, if the variables cache entry has expired, calls cache_set('variables'). cache_set() calls db_insert()->duplicateUpdate(). On pgsql (or any database without its magic "INSERT ... ON DUPLICATE UPDATE" feature), duplicateUpdate() requires calling drupal_get_schema() to identify the primary key of the table to generate the WHERE clause for the generated UPDATE query. Problem: drupal_get_schema() is defined in common.inc, not loaded until BOOTSTRAP_FULL, but we're only at BOOTSTRAP_LATE_PAGE_CACHE.
We can move drupal_get_schema() into bootstrap.inc so it is loaded earlier, but drupal_get_schema() calls drupal_alter('schema'). drupal_alter() is also defined in common.inc, and also requires all modules to be loaded anyway, so moving the function to bootstrap.inc does not help. drupal_alter('schema') isn't used or documented or used, so we could just yank it.
I did yank it. However, then the UpdateQuery generated for duplicateUpdate() calls drupal_alter('query'). This fails for the same reason drupal_alter('schema') does; drupal_alter() just cannot run during early bootstrap.
This explains, interestingly enough, why the drupal_alter() call in database.mysql.inc's InsertQuery_mysql::execute() is commented out. The reason this problem shows up with pgsql is because its duplicateUpdate() handler calls the normal UpdateQuery class which has a non-commented out call to drupal_alter().
It's late and at the moment I do not have a proposed solution.
#58
One other point: Any query that calls a join method that wants to use the schema to determine the join columns will have the same problem.
#59
Why not just move drupal_alter()? I know that some modules are not loaded and won't have a chance to alter but thats tough luck. they can get loaded earlier by implementing hook_boot() or marking selves for bootstrap=1 in system table.
#60
I was thinking that while waking up this morning. Perhaps it is the answer. After all, modules already can't alter queries that run before they are loaded---or after they are loaded---so there won't be a big change.
The bottom line is that by defining a database system that uses the schema and creating drupal_alter('query'), we're increasing the amount of code that must/might be loaded at BOOTSTRAP_DATABASE. We always wanted to use the schema information for more intelligent query handling anyway , I just hadn't realized this consequence until now.
#61
As Barry suggested earlier, I am planning to refactor duplicateUpdate() to require less schema interaction. I don't know if that will eliminate the issue, but it seems a good idea for non-MySQL performance anyway. That said, if we are going to make more use of the schema then we'll need it earlier. I don't think there's a good way around that other than leveraging dynamic/conditional code loading. (vis, http://drupal.org/node/221964 or something like it. Maybe more class autoload magic?)
I started on that at the code sprint, but got distracted by crappy wireless and an incredibly stupid bug on my part so not a great deal of progress was made, although there was plenty of discussion.
I am also not entirely convinced that we even need drupal_alter() on insert/update/delete queries. For SELECT, it's there to kill db_rewrite_sql(). But do we actually need it for the others? (It was included for consistency, but if it makes life easier to remove it and there are no good use cases for it, I'm fine with dropping it.)
#62
A very useful use-case for query_alter is the one we discussed at the sprint: treating it as a query listener interface. It allows (e.g.) devel module's query log to operate without a special-purpose hack. It also allows us to kill update_sql() but continue to produce the update query log, including *all* queries, not just the ones performed with update_sql().
If we decide allowing non-select queries to be altered isn't a great idea we could certainly rename it to query_listener or somesuch.
#63
Another snafu from the pgsql driver:
The old _db_query_callback() hack converted %d to "(int) array_shift($args)". Unfortunately, "(int) ''" (i.e. "(int) empty-string") evalutes to 0. This allowed bugs in which a query uses %d but the $args value corresponding to the %d is an empty string or FALSE. This bug appears in several places in core (node.module and menu.inc so far) and, presumably, lots of places in contrib.
Now that we are replacing %d (and others) with ?, the translation from empty-value to 0 for %d args is not happening. MySQL appears lenient/sloppy enough to accept the empty string as an integer but PostgreSQL rejects it as an invalid string-encoded integer (which it is).
If we want to provide backwards compatibility for db_query(), we are going to have to update the new db_query() to use something like the old callback system, at least to process %d and %f correctly. I'll have to think about %s and %b. I also observe that %% was previously translated into %, we'll have to fix that, too.
I have not yet tried to fix this.
#64
The alternative, of course, is to make people fix their broken usage of db_query(). The db_query() spec never said you could pass FALSE or empty-string as 0.
#65
+1 on fixing the original query. There's lots of old hacks that we're killing here, such as NULL folding. If we're going to fix it, let's downright fix it for reals. :-)
I suspect in many cases it's easier to fix the original query than rely on the BC hack. The BC hack is temporary and will not be in the final D7 release, so there's no sense putting much effort into every edge case.
#66
Fine with me, though we'll be finding and fixing these bugs for a long time. If we had tests that covered every query that caused every %d argument that can be zero to be zero then we could find them all quickly but, frankly, there is no way we'll have that if we get "100% test coverage."
Here is an example of one such fix, a patch to menu.inc:
<?php- $item['_external'] = menu_path_is_external($item['link_path']) || $item['link_path'] == '<front>';
+ $item['_external'] = (menu_path_is_external($item['link_path']) || $item['link_path'] == '<front>') ? 1 : 0;
?>
Later, $item['_external'] is passed as a %d arg to db_query.
I was thinking of an interesting option this morning. Just as we could write a mock-database driver, we could also write a pedantic-database driver filter that sat in front of a real driver but did all kinds of integrity checking on queries such as making sure the data types of the arguments matched the query fields.
#67
Another snafu. As previously discussed, the pgsql driver needs to call drupal_get_schema() early in the boot process. Even if we stopped supporting drupal_alter('query'), drupal_get_schema() is still necessary to implement the duplicateUpdate() method.
drupal_get_schema() caches the combined schema of all installed modules. When invoked during bootstrap, it caches the limited set of modules that are loaded at that point. During the install process, it seems to cache the tables defined by system.module and nothing else.
A limited schema cache breaks drupal_write_record() because that function fails if it cannot find the table schema. It will presumably break other things as well.
The solution I've thought of so far is to have drupal_get_schema() refuse to create a cache entry if the boot process is less than BOOTSTRAP_FULL. It can still use the schema cache if it exists (since any cached entry will now be a full schema). This will require slight changes to the bootstrap API since it does not currently make the boot phase available. I'll give it a try and see what happens.
#68
Hm. I like the idea of a "Dev" driver that does insane-but-slow checks on data types. I hadn't thought of that as an option, but the swappable drivers should make that reasonably easy to do. We can even base it on MySQL, so that developers can write to MySQL like they're used to but with an overbearing ANSI-strict, type-strict configuration. Sorta like E_ALL for the database layer. Mesa like. :-)
Probably a step 2 task, but still good idea.
For queries like the one above, I'd just go ahead and rewrite it to the new format; don't even bother preserving %d if that takes actual effort. You should have svn access now, so go ahead and check such things in.
#69
Barry, Larry stated multiple times that duplicateUpdate won't require schema. That's good. About condition putting the op to the end, omg. Just OMG, are you sure you want to make people understand (and do) RPN logic in Drupal? I love RPN, I am using RPN calculators since 1989 but... Larry also wanted the same order of arguments and I was more than reluctant to do so.
#70
I have installation and logging in as uid 1 working with the pgsql pdo driver though node/add still whitescreens. Obviously there are some bugs left. :-) Calling drupal_get_schema() and/or drupal_alter() inside the database driver during bootstrap is triggering a lot of subtle bugs that have probably existed since the bootstrap code was added.
duplicateUpdate() absolutely requires schema information unless we require that the caller specify the primary key fields (i.e. the fields to move from the INSERT list to the WHERE clause of the UPDATE). It seems kinda odd to require the caller to specify that information since we just went to all the trouble of creating Schema API to provide exactly that kind of information and capability.
Regarding condition(), I think condition($field, $value, $op) makes more sense that condition($field, $op_unless_you_do_not_provide_a_value_in_which_case_value, $value_if_you_provided_an_op). It isn't really RPN logic, it's an optional argument at the end to override a default value, a very common pattern in Drupal. We could hold a survey but I'd have to reserve the right to ignore responses from people who do not know what they are talking about. :-)
I should have some more time to work on this tomorrow.
#71
Well, currently you do specify the keys when writing INSERT, db_affected_rows and a full UPDATE. Also what if you want to work with a non-Drupal database? I am leaning towards mandating a full blown update statement and then MySQL can ignore the keys.
#72
I'm not prepared to express an opinion on "mandating a full blown update statement" but understand that doing so will resolve NONE of the issues I've reported so far unless we also remove drupal_alter('query') and engage in some cleverness with lastInsertId() (which, granted, we may need to do anyway).
I suggest waiting until I'm done before making substantial changes. It should be soon.
#73
Whether or not we dip into the schema for the INSERT statement on non-MySQL will probably come down to a performance question. I'm not sure how expensive that operation will be; we did already discuss breaking the cache up per-table, which in this case should help considerably. Let's try it using schema and see what the performance is like. If it's icky, we can adjust the API.
As for the order of parameters in condition(), I'm also in favor of the postfix style in that it's more predictable and the code is cleaner. I defer to Dries on this one, though. Dries? :-)
#74
By the way, this code is a perfect use case for unit tests. We have a (in theory) clearly defined API that is easily testable and multiple implementations that are supposed to behave the same way. If the database code is buggy or inconsistent in any way, difficult-to-find bugs will result. IMHO, we should not commit this code until we have thorough unit tests for it.
#75
Since we don't have a place to put unit tests in core yet, especially for non-modules, how do you propose we do that? chx? (I'm not against it, just want to do it right.)
#76
@Barry - we are at the very beginning of the code cycle for D7. I think it is a bit early to deny patches that omit unit tests. We have not even added a single unit test to core yet. Your motivation is sound, but the timing is off IMO. If someone wants to write tests that terrific.
#77
@moshe: Yes, we're at the beginning of the cycle, but I draw the opposite conclusion: This is the perfect time to set the tone that patches require tests. Otherwise, when is the right time? Everyone will say "I'm too busy, I'll write tests later/after code freeze" which means "never." Also, if the tests reveal bugs at that point, it is much harder to change the new APIs, etc. than it is before the patch is committed.
More importantly, as the person most familiar with this code besides crell and chx, I have little confidence that it implements the documented APIs correctly, and I have no confidence that the pgsql driver implements it correctly, because I'm not sure the API docs are even self-consistent. Writing tests will flush that out.
@crell: I do not know if there is a plan for where to put core tests yet. For now we can create a dbtest module and move the tests when they have a correct home.
I'm still hoping to commit the pgsql driver to svn today.
#78
Even if not all patches require tests, this one really should. It will screw up everyone if it is wrong.
#79
All core tests are currently patches to simpletest module for now so they can be brought over in bulk. Somehow I don't think this is applicable for tests for patches which aren't into core yet (although two patches, one in the core queue and one in the simpletest queue seems sane to keep things as centralised as currently possible).
afaik, simpletests aren't required for patches to be committed, but they are required for a code freeze extension. If there's going to be an additional requirement for unit tests on big API-rewrite patches then it'd be good to have this made clear asap. But since we're only a week after testing was officially announced in the first place, it also seems a little early to me.
#80
I was hoping that we can write tests for this one once its in. I can certainly accept that forthcoming big patches need a test -- but then that's part of the process... I just would not like this one held up by the unanswered questions about testing.
#81
(subscribing)
#82
With the attached patch, the pgsql PDO driver correctly handles installation, login as uid 1, posting a node, and posting a comment. There are still known bugs, but I wanted to get something committed. This code is NOT ready to commit, so I'm leaving it as CNW, but reviews are welcome.
If you are following chx's SVN repo, this is revision 357.
## To do items
* I have NOT done any pgsql BLOB handling, bindParam(), etc. stuff. I've noticed that sometimes the 'schema' entry in the cache table gets lost, causing a PDOException. I suspect these are related.
## Recommendations (not yet implemented)
* There should be a central anonymous-placeholder name generator, otherwise we might get conflicts. Consider:
<?phpUPDATE {cache} SET serialized=:db_update_placeholder_0, created=:db_update_placeholder_1, expire=:db_update_placeholder_2, headers=:db_update_placeholder_3, data=:db_update_placeholder_4
WHERE (cid = :db_placeholder_0)
?>
There is not a conflict between SET and WHERE, but there could be. If the placeholders are all just :db_placeholder_NNNN, there will never be a conflict.
* The return value of Query::execute() is not clearly specified and appears to vary by query type. What should InsertQuery::execute() if it performs an UPDATE query instead? Since there is no lastInsertId() to return in this case, the caller cannot rely on the return value being a last insert id (we surely do not want the return value to depend on whether duplicateUpdate() was called).
## Changes to the database layer in this patch
* I fixed the "FALSE/empty-string as %d" problem by making db_query() (and friends) take care of it. I just couldn't deal with fixing queries all over core at this time (there are a lot of them). I did fix a few before changing db_query(), and I left those fixes in.
* Remove abstract protected functions from class DatabaseSchema. Not all schema drivers implement the same internal functions.
* When throwing a PDOException, include the full text of the query and args.
* DatabaseConnection::lastInsertId() requires some special care (surprise!). For mysql, it takes no arguments. For pgsql, it requires the name of a sequence object. Our db_last_insert_id() function takes $table and $field, which identifies the sequence object, but according to the specs you have to pass lastInsertId() an argument or not depending on which driver you are using (not very portable!), so we can't just pass $table and $field to lastInsertId() and let the mysql driver ignore them. In the case of an InsertQuery() object, the object already knows the table and can get the field from the primary key. But in the case of db_last_insert_id(), we have to preserve the information without passing args to lastInsertId(). Therefore, I added a setLastInsertInfo() method to class DatabaseConnection. mysql's does nothing, pgsql's remembers what it needs.
* Add db_driver() method as replacement for $GLOBALS['db_type']. Warning! Many 6.x updates and other code will stop working because this variable is no longer available.
## Core changes in this patch (required for pgsql support)
* Add drupal_get_bootstrap_phase(). drupal_get_schema() needs to not cache the schema of we're not at BOOTSTRAP_FULL.
* Load includes/module.inc in BOOTSTRAP_DATABASE. This is so database queries can call drupal_alter(), drupal_get_schema(), etc. during early bootstrap.
* Move drupal_get_schema() and drupal_alter() to bootstrap.inc. Database queries need these.
* Fixed %d/'' bug in session.inc INSERT and UPDATE query.
* In module_list(), initialize static $list = array(). Otherwise, if module_list(FALSE) is called before module_list(TRUE), NULL is returned instead of an array() and foreach throws an E_WARNING. This happens (e.g.) when a query calls drupal_alter() during early bootstrap.
* In _drupal_bootstrap_full(), rebuild the module_implements hook cache. During bootstrap (perhaps only during install.php), the hook cache is limited to system and filter modules before BOOTSTRAP_FULL and needs to be reset once new modules are loaded.
* Fix two %d/FALSE bugs in menu.inc queries.
* Fix three %d/'' bugs in node.module.
* Cache a fully-build schema during install.php so schema-based queries can work on a brand new site before anyone visits the modules page.
#83
Addenda:
1. When I said, "There are still known bugs, but I wanted to get something committed" I meant "committed to chx's SVN repo.
2. I did not attach I patch to the previous comment. I meant to say "With the SVN commit I'm about to make...". If you want to see the changes I made to Larry+chx's work, use svn diff.
3. To this comment I am attaching a complete patch against HEAD for the most recent version of the PDO changes including my pgsql+related stuff. Note that this is Larry's issue so you should not take this as an authoritative patch, I'm just being helpful, and the code needs work anyway so it is not like this will get committed. Unlike the previous patch, this one actually contains all the files you need to use it. :-)
#84
I still have to do a proper review but I decided to hold that off for a while -- Barry's review is excellent and I'm happy to let him drive this for a while.
Here are a couple of thoughts that might help decision making and debates:
* Delayed inserts are useful. Let's keep it.
* Supporting multi-insert statements is useful. On MySQL it can speed up things by a factor 20. Let's keep it -- but let try to document it better. It took me a while to figure this one out.
* For me, writing
db_insert($id, $table)->fields('foo', 'bar')->values($foo, $bar)->execute();is more natural but also more prone to errors. I might likedb_insert($id, $table)->fields('foo' => $foo, 'bar' => $bar)->execute();better as I'd make fewer mistakes. Still undecided about this one.#85
I just checked in a "dbtest" module to the svn repository. It contains no code aside from the start of unit tests for the new system. I've written test cases for basic fetches from db_query(). Lots more unit tests to come. We can use them for testing the patch and additional drivers, and then move them to the SimpleTest module or wherever later.
@Dries: I'm trying to minimize the use of variable arg count, as it's ugly and harder to document and debug and makes adding optional parameters that actually are useful impossible. I also want to minimize the number of different ways a single method can be called, for the same reason. :-)
Any input on the condition() method interface?
I'll see what I can do to better document multi-insert once I'm done tweaking the API on that. :-)
#86
@dries:
fields('foo' => $foo)is syntactically invalid; you can't use => outside an array.fields(array('foo' => $foo))is how the current interface works. Allowing an inline-array with fields() as above while still supporting duplicateUpdate() is the purpose behind my proposed changes to that interface (which I think Larry agrees with).@crell: re: dbtest, excellent! I'll add some tests when I can.
#87
One question, I suppose, is how we want to handle mixing multi-insert and "ON DUPLICATE UPDATE". I'm not sure off hand how MySQL mixes them; what another database would do, I'm not sure either.
@bjaspan: db_last_insert_id() is vestigial, and should be removed as part of phase 2. The last inserted ID will be returned from db_insert(), or should be. (If it isn't now, that's a bug that I will fix.) I wouldn't spend too much time on it.
#88
Multi inserts on other databases are inserts in a transaction, and if there is a unique key collision, I guess the transaction gets rolled back. This is just a guess. MySQL has tricky handling of this see MySQL handbook on
VALUES(). Definitely a followup patch.#89
@crell: What should db_insert()->orUpdate() return if it actually performs an update? The id of the updated row? Should the caller be able to tell whether an insert or update occurred?
db_last_insert_id(), db_query(), etc. may be vestigial but the contrib update to D7 will be a lot slower if we yank them before D7 is released. This will be a decision the community has to make. We shouldn't pre-determine that they have to be yanked by not supporting them now. (And in any case, today's code uses it, so it needs to work for now.)
#90
We've already determined that db_fetch_object() is dying (see the note in the code to that effect); it's there just as part of the temporary BC layer to avoid a 1 MB patch. Same for db_last_insert_id(). Supporting those long-term complicates the API and creates too many ways of doing something, all of which people will need to know. (PDO already offers at least 3 ways to iterate the result set; we don't need more).
As for what db_insert()->orUpdate() returns... excellent question! :-) I'm not sure. I'll need to play with the code a bit more.
#91
@chx:
According to ANSI SQL, that's correct.
#92
Random musings:
The "abstract" classes in database.inc, such as InsertQuery and UpdateQuery, perform some non-trivial computations on some method arguments. For example, InsertQuery::fields() splits its $fields array into keys and values and loses fidelity of the original data by storing "$fieldname=$placeholder" instead of just $fieldname.
This is a problem for drivers (like pgsql) that need to walk through the field values and do something to them based on the scheme (such as blob handling). Of course, the driver can override the method and store the original data (which is what pgsql currently does) but then it has to choose between calling parent:: which it knows to be all/mostly useless or not calling it and risking losing important functionality that may get added there in the future.
I am thinking that most of the work performed in these abstract classes should not be. The classes should either be pure interfaces or the methods should just store the raw input data as received. execute() or similar can perhaps serialize things into SQL as a "generic" implementation but doing so earlier (e.g. creating the placeholder variables, etc. in fields() or values()) I think is counter-productive. It makes it more confusing, not less, to create a driver.
I'm working on proper blob handling for pgsql. I'm having to override most everything in Query and, I recently realized, DatabaseCondition as well. When I'm done, we'll have a clearer idea of what the abstract classes should do.
#93
Some thoughts on code style.
As we are changing the db_*() functions interfaces drastically, wouldn't it make sense to make them static class methods for consistency? So, for example, instead of:
<?phpdb_insert('variable_set', 'variable')->fields($fields)->execute();
?>
it would be:
<?phpDatabase::insert('variable_set', 'variable')->fields($fields)->execute();
?>
The consistencies I'm talking about here are a) it would not mix functional code with class based ones, and b) the line of code would be entirely Camel-cased.
#94
@bjaspan: You may be right. I was coming at it from a MySQL-centric view, since that's what I know. However, I'm concerned about making the specific driver implementations too complicated. Not all drivers will need field handling, for instance.
I did run into that issue when adding another line to InsertQuery::execute(), actually, since MySQL does override that. Part of me is wondering if we shouldn't have some sort of wrapping method. e.g., InsertQuery::execute() which does stuff and then calls InsertQuery::_execute(), which is then driver-specific. That adds a method call, though, so I'm not sure if that's a great idea. Hum.
@recidive: I considerered just exposing the static methods directly, but decided against it. One, we're not getting rid of db_query*(), so it doesn't make sense for us to not have the other wrappers. Two, Drupal folk are used to functions, and there's nothing inherently wrong with function-factories. Three, db_insert() is shorter to type than Database::insert(). :-)
@all but particularly bjaspan: I just checked in some more unit tests on insert queries. Writing more as we speak. I also tweaked the API a bit to allow for insert queries to be re-used. That is, you can call ->fields() to define the fields, then inside a loop call ->values() some number of times, execute(), then call values() some more and execute() again, etc, all on the same query object. That should make importers really fast. :-)
Writing the tests is really helping me think out the API more. Yay for testing!
#95
@crell: I actually find the way you are assembling placeholders, etc. quite confusing as it is. :-) I realize it is a tricky problem since you need to create placeholders separately for fields()/values(), conditions, etc.
Something I may need to do is iterate through the UPDATE SET fields (at least), identify which ones are blobs, and use bindParam on them. I'm not sure but I think this means I need to use bindParam on everything for the query. So that means I also need to be able to iterate through the condition fields and bindParam them, too. It was at this point that I punted.
Yes, yay for testing, for exactly the reason you said. Be sure to test the multi-insert functionality; I'm pretty sure I did not understand your data structure for that at first and got it wrong in the currently committed code.
#96
My latest understanding of the bootstrap snafus I mentioned:
* The earliest queries Drupal performs are reads from the variable and cache tables. This is before any modules are loaded.
* The cache tables use blobs.
* Databases with special blob handling need access to the schema in order to know that cache tables use blobs.
* The schema structure lives in the cache but that requires access to the cache table to learn that the cache table uses blobs.
* The schema structure also lives in hook_schema(), but no modules are loaded yet.
* Catch-22.
In pgsql's case, we can apparently SELECT from a blob column without knowing it, so we get around the catch-22. For other databases, that trick may not work.
What this means is that if drupal_get_schema() cannot retrieve a schema from the database and no modules are loaded yet, it needs an emergency fallback of loading system.install (perhaps all required-modules .install), calling those hook_schema()s, and returning the results.
#97
I also ran into an issue last night with the orUpdate() logic. If we only specify fields, then we can't do things like incremental counters. (ON DUPLICATE UPDATE counter=counter+1) So orUpdate() will have to take a complete field/value array(s). orUpdateExcept() we can probably get away with just the field names, though.
Regarding blobs, two possibilities: 1) Does the cache table really need to use blob, or could it use TEXT? 2) Since Dries wants us to do streaming blobs eventually, which I'm fairly certain would require special handling anyway, perhaps we should just go ahead and do a different syntax for blobs now? (Not a %b, but something in the SELECT query builder.)
Would any of the Oracle/MS SQL/DB2/SQLite folks know how much trouble BLOBs are going to be on those databases?