Problem / Motivation

Field API always creates revision db field tables even if revisions are unused. As a result every field table in database has the exact same content as its revision table. field_data_xxxx is the same as field_revision_xxx, and the data inside is identical.

)

@alexpott has repeatedly told me (yched) he would love to see the revision tables disappear, and I think I'd support that. Just haven't been able to prioritize that on my own todo list...

Previously discusssed for 8x here #1279440: Deeply decouple revisions from Entity API and Node module

Tech Description
After #1497374: Switch from Field-based storage to Entity-based storage, entity types are in charge of storing field data, and DatabaseStorageController provides storage in SQL tables in a similar way to D7 - two tables per field, one for the "current revision" data, one for the "other revisions" data.

- At least it seems we should stop bothering with revision tables on entity types that are not revisionable :-)
Problem: "revisionable" is not guaranteed to always stay the same. It's alterable in hook_entity_info_alter(), so if our base classes provide generic support for entity revisioning (I don't remember if they do right now, but they could and ideally they would), "revisionability" can just come and go. Even without this generic support, an entity type can add revision support in a point releasse of the contrib module that provides it.

Advantages of resolution
Decrease by half SQL writes for nodes. Here is a case study that shows node write performance doubles

Comments

catch’s picture

I'm sure there's an issue for this somewhere already, but can't find it at the moment.

iirc the original reason for having two tables was this:

- queries with conditions/sorts on field tables are going to mean temporary tables/filesorts

- having smaller tables for current values means less rows to go through.

http://posulliv.github.io/2013/01/08/norevisions-field/ is for write performance, which is nice if it's better, but not the same thing at all.

So before doing this, I'd want to see a View or EntityQuery with conditions/sorts on a couple of different fields or more, then a dataset of say 20,000 nodes and 100,000 revisions, and compare the EXPLAIN and/or benchmarks for executing that query.

The way to fix that performance issue properly would be to do automated denormalization of those queries. The last work I'm aware of on that was in 2011 though https://groups.drupal.org/node/145174

jibran’s picture

Issue tags: +Field API, +Entity Field API

Tagging.

giorgio79’s picture

Here is the original issue #1279440: Deeply decouple revisions from Entity API and Node module
Thx yched for summarizing the issue so well.

@catch
Field table and a field revision table are identical if revisioning is turned off. In this case, a field revision table is not smaller in any way than its field counterpart, so views read queries on a field or a field revision table should have no difference. Write query performance could be doubled by not creating revisioning tables, and mysql cache can be halfed if we would not have those "duplicate"tables on entities without revisioning enabled.

Field revisioning tables are ok if you have enabled revisioning, but not when you have revisioning disabled. So, I think it is ok to have revisioning tables when someone enabled them.

yched’s picture

Field table and a field revision table are identical if revisioning is turned off.

Revisioning can't be turned off. An entity type does or does not support revisions, that's all.
If it does, revisions can be created by anyone with the sufficient permissions, there's no switch. And then both tables are really different, and yes, as @catch points, querying performance is very likely to differ, especially for queries that join with conditions on one or more of those tables from the base table.

The 1st proposal in the OP is about "only bother with revision tables for entity types that do support revisions", but as explained here, even that property is not immutable and may vary over time.
So yes, writing the same data in two tables when the entity type is not revisionable is kind of absurd and adds needless write costs - but we cannot live-adapt our storage to react to "the entity type is / isn't revisionable".

giorgio79’s picture

Thx yched much appreciated. For some reason, I thought we have some kind of switch at admin/structure/types/manage/mycontenttype . I guess it was just a dream I mistook for reality over time :) I don't want to derail this issue, but adding a proposal for a followup issue:

It would be a feature request then to enable site admins to manually enable / disable revisioning for entities that support revisioning to avoid unnecessary overhead. (controlling revisionability)
Disabling would offer the deletion of all revisioing collected so far, and enabling would create a copy of the field table into field_revision table (or this can happen when a new revision is actually created one by one).

plach’s picture

This issue is pretty close to an area I will hopefully start to work on soon: as briefly noted in #2068325: [META] Convert entity SQL queries to the Entity Query API, D8MI is planning to implement an enhanced SQL storage controller which will provide full multilingual support for base fields of all core entities at storage level. The plan is have it ready ASAP and then replace the current one with it, as soon as we are done with the entity query migration.

Our current storage controller supports 4 modes for storing base fielads:

  1. base: simple entities are not revisionable nor translatable, hence they just need a base entity.
  2. rev: revisionable entities need an additional entity_revision table.
  3. mul: multilingual entities need an additional entity_field_data table to store field translations.
  4. mulrev: multilingual and revisionable entities need an entity_revision, an entity_field_data and an entity_field_revision table (see also #2057401: Make the node entity database schema sensible).

Currently these modes are hardcoded in the entity definitions, depending on the entity business-logic requirements. One of my goals for the enhanced SQL storage controller is being able to switch among these modes: since it will need to be able to generate all the tables, not just the configurable field ones, switching among the 4 table layouts above when there is no data should be trivial. Implementing a migration code, making it possible to switch from one mode to another when data is already stored in the tables, should be feasible but might be contrib material, it depends on how much time we will have to work on that.

That said, I think that the best way to address this would be to provide field revision tables when the entity has revision support enabled and avoid them in other cases. We could do the same with translation support and avoid the langcode column for monolingual entity types.

Thoughts?

yched’s picture

Going from "not revisionable" to "revisionable" means creating one new table per field, and copying all the data - times all fields in all bundles of the entity type. Not really something you can do in a form submit or in any other non-batched request.
Going from "not multilingual" to "multilingual" is not too clear, but it seems like it involves heavy data copy as well...

So I'm not sure how support for live reconfiguration of the schemas would work - be it backed on the UI or on the code (entity type definition). That smells like CCK's infamous dynamic db reshuffling & data juggling - only worse, since at the entity level instead of at the level of individual fields separately :-/.

Side note: I now finally understand what the 'entity_test_mul / entity_test_mulrev' entity types we have in core mean, but it's totally obscure that 'mul' stands for multilingual...

plach’s picture

So I'm not sure how support for live reconfiguration of the schemas would work - be it backed on the UI or on the code (entity type definition). That smells like CCK's infamous dynamic db reshuffling & data juggling - only worse, since at the entity level instead of at the level of individual fields separately :-/.

As I said, the initial goal is supporting switching between modes only when no data has been created yet, which is how core Field API deals with settings that have an impact on the storage, correct? I didn't fully think through the actual triggering process (UI vs API), but being able to switch among those will certainly bring a perfomance gain when some of the features they allow for are not needed.

Implementing a migration reliably might be way harder and is indeed something like the CCK stuff, but I don't think that's the crucial part of the proposal. Instead we could certainly prevent a switch when data is available, at least in core.

yched’s picture

the initial goal is supporting switching between modes only when no data has been created yet, which is how core Field API deals with settings that have an impact on the storage, correct? I didn't fully think through the actual triggering process (UI vs API).

The major difference is that field settings are entirely on the config side, so you can raise an exception when someone attempts to change them. Whether an entity type is revisionable or translatable currently lives in the entity type definition (@EntityType annotations or hook_entity_info_alter()), that is, in code. How do we prevent those from changing when we don't want them to ?

Also, a contrib entity type not being revisionable in release 1.0 and becoming revisionable in release 1.1 is a totally valid use case.

Currently, every entity type has to explicitly define the associated storage tables for its use case: revisionable ? translatable ? what are the base fields ? (all of those things that are defined in code)
That's a non-minor burden, but it also makes the responsibilities clear - build your own schemas correctly according to your stated features, and write the correct hook_update_N() to update your schemas if you change your features.
If we switch to "DatabaseStorage takes care of building your schemas magically", then who's responsible for the updates ?

Bottomline is: dynamic storage in SQL is hard :-/.
- base tables for entity types are currently not dynamic, and are the responsibility of the entity type to define and maintain.
- the only thing that is - by definition - dynamic is the storage of configurable fields, and it assumes the host entity type is always revisionable & translatable.

The only way I can think right now to make "automating" sustainable would be to make "revisionable / translatable" not inherent to the entity type and move them to config. Meaning: all entity types are revisionable / translatable in theory, and whether they actually are is a matter of configuration.
Only then we can observe the changes and block them in the cases we cannot support (e.g. you can't make an entity type revisionable once there are existing entities) - the way Field API does for field settings.
Seems like a non minor effort though :-/. Means UI work, new config entities to store the settings, in addition to the code to automatically generate the correct schemas.

yched’s picture

Issue summary: View changes

grammar

andypost’s picture

Issue summary: View changes

Beta is out, this one could be closed

giorgio79’s picture

@andypost how having a beta solves this issue?

geerlingguy’s picture

Version: 8.0.x-dev » 8.1.x-dev

More appropriately, this should be set to 8.1.x. I don't think a refactor of the scale being proposed here would be considered release-blocking for 8.0.x.

giorgio79’s picture

Issue tags: +Performance

Adding performance tag

As yched mentioned in #4

So yes, writing the same data in two tables when the entity type is not revisionable is kind of absurd and adds needless write costs

Berdir’s picture

As commented in #1279440: Deeply decouple revisions from Entity API and Node module, entity *types* that don't support revisions (like user and taxonomy terms) already don't get the revision tables anymore.

So that part of the issue is resolved, and that's the part the other issue talked about.

The second part is restructuring the storage on how revisions are actually stored. That's still open but I don't think that will happen in 8.x as it would be a huge change. But contrib can experiment here by using different storage implementations.

David Strauss’s picture

queries with conditions/sorts on field tables are going to mean temporary tables/filesorts

That's generally correct if the assumption is that a join to a "field revisions" table needs to sort to get the latest revision. However, it would be possible to retain performance by either (1) always identifying the latest revision by a number like "0" or (2) adding a column like latest_revision that's set to TRUE for the latest one. That way, the join can be on the nid AND revision=0, for example. As long as there's an index on (nid, revision), it would work.

having smaller tables for current values means less rows to go through.

This is true in terms of data locality. There's a higher chance that contiguous data in a table without old revisions is useful -- and less chance in a table littered with old revisions.

yched’s picture

it would be possible to retain performance [of the typical join to a "field revisions" table if it contained all revisions] by either (1) always identifying the latest revision by a number like "0" or (2) adding a column like latest_revision that's set to TRUE for the latest one

Both approaches mean we'd need to update the rows of the "previous latest" revision when a new revision is created ?

Michael_Lessard_micles.biz’s picture

Greetings,

Just to mention what may be obvious : D6 and prior only stored the current title/body/teaser in the revision table and only created a new revision if requested by the settings.

The more complex D7 logic increases the size of the database, which is probably not an issue really [though a lot of arguments to the contrary here: https://www.drupal.org/node/1279440 ].

Both methods work of course, though the D7 method seems safer (even if some odd database writing error occurred, the previous version remains in database). The D7 has another main advantage: the Admins have a copy of the original or re-written node easily accessible to work with in case a user makes an error (i.e. without needing to use filed backups), even if the Create Revisions is off.

For migration (D6 to D7; which I am doing right now), it is a bit more complex.

So though I prefer the D6 simplicity, rationally it seems like the D7 structure is safer/better.

EDIT : my assumption that the new structure keeps the original title/body/teaser is wrong. I tested and oddly all D7 content related tables (node, node_revision, field_data_body, field_revision_body) contain only the most recent title/body/teaser. Probably because the Revisions are off on my set-up, still this means the database has two copies of the titles and the body also is doubled.

giorgio79’s picture

Issue summary: View changes

Updating IS

giorgio79’s picture

Issue summary: View changes

Version: 8.1.x-dev » 8.2.x-dev

Drupal 8.1.0-beta1 was released on March 2, 2016, which means new developments and disruptive changes should now be targeted against the 8.2.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.2.x-dev » 8.3.x-dev

Drupal 8.2.0-beta1 was released on August 3, 2016, which means new developments and disruptive changes should now be targeted against the 8.3.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.3.x-dev » 8.4.x-dev

Drupal 8.3.0-alpha1 will be released the week of January 30, 2017, which means new developments and disruptive changes should now be targeted against the 8.4.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.4.x-dev » 8.5.x-dev

Drupal 8.4.0-alpha1 will be released the week of July 31, 2017, which means new developments and disruptive changes should now be targeted against the 8.5.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.5.x-dev » 8.6.x-dev

Drupal 8.5.0-alpha1 will be released the week of January 17, 2018, which means new developments and disruptive changes should now be targeted against the 8.6.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.6.x-dev » 8.7.x-dev

Drupal 8.6.0-alpha1 will be released the week of July 16, 2018, which means new developments and disruptive changes should now be targeted against the 8.7.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.7.x-dev » 8.8.x-dev

Drupal 8.7.0-alpha1 will be released the week of March 11, 2019, which means new developments and disruptive changes should now be targeted against the 8.8.x-dev branch. For more information see the Drupal 8 minor version schedule and the Allowed changes during the Drupal 8 release cycle.

Version: 8.8.x-dev » 8.9.x-dev

Drupal 8.8.0-alpha1 will be released the week of October 14th, 2019, which means new developments and disruptive changes should now be targeted against the 8.9.x-dev branch. (Any changes to 8.9.x will also be committed to 9.0.x in preparation for Drupal 9’s release, but some changes like significant feature additions will be deferred to 9.1.x.). For more information see the Drupal 8 and 9 minor version schedule and the Allowed changes during the Drupal 8 and 9 release cycles.

plach’s picture

plach’s picture

Status: Active » Closed (outdated)

Actually, I think we can close this issue as obsolete, given that with the new entity storage revision field tables are not created for non-revisionable entity types.