Reconsider the separate field revision data tables [#2083451]

Problem / Motivation

Field API always creates revision db field tables even if revisions are unused. As a result every field table in database has the exact same content as its revision table. field_data_xxxx is the same as field_revision_xxx, and the data inside is identical.

Some people have advocated that the separate revision tables add more pain than gain.
See comment #136 in the "entity storage" issue,
http://posulliv.github.io/2013/01/07/bench-field-storage/
http://posulliv.github.io/2013/01/08/norevisions-field/, based on @alexpott's own https://drupal.org/project/field_sql_norevisions :-

)

@alexpott has repeatedly told me (yched) he would love to see the revision tables disappear, and I think I'd support that. Just haven't been able to prioritize that on my own todo list...

Previously discusssed for 8x here #1279440: Deeply decouple revisions from Entity API and Node module

Tech Description
After #1497374: Switch from Field-based storage to Entity-based storage, entity types are in charge of storing field data, and DatabaseStorageController provides storage in SQL tables in a similar way to D7 - two tables per field, one for the "current revision" data, one for the "other revisions" data.

- At least it seems we should stop bothering with revision tables on entity types that are not revisionable :-)
Problem: "revisionable" is not guaranteed to always stay the same. It's alterable in hook_entity_info_alter(), so if our base classes provide generic support for entity revisioning (I don't remember if they do right now, but they could and ideally they would), "revisionability" can just come and go. Even without this generic support, an entity type can add revision support in a point releasse of the contrib module that provides it.

Advantages of resolution
Decrease by half SQL writes for nodes. Here is a case study that shows node write performance doubles

Comments

Comment #1

catch

he/him

English

commented 6 September 2013 at 15:09

I'm sure there's an issue for this somewhere already, but can't find it at the moment.

iirc the original reason for having two tables was this:

- queries with conditions/sorts on field tables are going to mean temporary tables/filesorts

- having smaller tables for current values means less rows to go through.

http://posulliv.github.io/2013/01/08/norevisions-field/ is for write performance, which is nice if it's better, but not the same thing at all.

So before doing this, I'd want to see a View or EntityQuery with conditions/sorts on a couple of different fields or more, then a dataset of say 20,000 nodes and 100,000 revisions, and compare the EXPLAIN and/or benchmarks for executing that query.

The way to fix that performance issue properly would be to do automated denormalization of those queries. The last work I'm aware of on that was in 2011 though https://groups.drupal.org/node/145174

Comment #2

jibran

he/him

Urdu

Toronto, Canada

commented 6 September 2013 at 16:16

Issue tags:

+Field API, +Entity Field API

Tagging.

Comment #3

giorgio79 commented 7 September 2013 at 18:42

Here is the original issue #1279440: Deeply decouple revisions from Entity API and Node module
Thx yched for summarizing the issue so well.

@catch
Field table and a field revision table are identical if revisioning is turned off. In this case, a field revision table is not smaller in any way than its field counterpart, so views read queries on a field or a field revision table should have no difference. Write query performance could be doubled by not creating revisioning tables, and mysql cache can be halfed if we would not have those "duplicate"tables on entities without revisioning enabled.

Field revisioning tables are ok if you have enabled revisioning, but not when you have revisioning disabled. So, I think it is ok to have revisioning tables when someone enabled them.

Comment #4

yched commented 7 September 2013 at 20:23

Field table and a field revision table are identical if revisioning is turned off.

Revisioning can't be turned off. An entity type does or does not support revisions, that's all.
If it does, revisions can be created by anyone with the sufficient permissions, there's no switch. And then both tables are really different, and yes, as @catch points, querying performance is very likely to differ, especially for queries that join with conditions on one or more of those tables from the base table.

The 1st proposal in the OP is about "only bother with revision tables for entity types that do support revisions", but as explained here, even that property is not immutable and may vary over time.
So yes, writing the same data in two tables when the entity type is not revisionable is kind of absurd and adds needless write costs - but we cannot live-adapt our storage to react to "the entity type is / isn't revisionable".

Comment #5

giorgio79 commented 8 September 2013 at 14:05

Thx yched much appreciated. For some reason, I thought we have some kind of switch at admin/structure/types/manage/mycontenttype . I guess it was just a dream I mistook for reality over time :) I don't want to derail this issue, but adding a proposal for a followup issue:

It would be a feature request then to enable site admins to manually enable / disable revisioning for entities that support revisioning to avoid unnecessary overhead. (controlling revisionability)
Disabling would offer the deletion of all revisioing collected so far, and enabling would create a copy of the field table into field_revision table (or this can happen when a new revision is actually created one by one).

Comment #6

plach

he/him

Italian

Venezia

commented 8 September 2013 at 22:24

This issue is pretty close to an area I will hopefully start to work on soon: as briefly noted in #2068325: [META] Convert entity SQL queries to the Entity Query API, D8MI is planning to implement an enhanced SQL storage controller which will provide full multilingual support for base fields of all core entities at storage level. The plan is have it ready ASAP and then replace the current one with it, as soon as we are done with the entity query migration.

Our current storage controller supports 4 modes for storing base fielads:

base: simple entities are not revisionable nor translatable, hence they just need a base entity.
rev: revisionable entities need an additional entity_revision table.
mul: multilingual entities need an additional entity_field_data table to store field translations.
mulrev: multilingual and revisionable entities need an entity_revision, an entity_field_data and an entity_field_revision table (see also #2057401: Make the node entity database schema sensible).

Currently these modes are hardcoded in the entity definitions, depending on the entity business-logic requirements. One of my goals for the enhanced SQL storage controller is being able to switch among these modes: since it will need to be able to generate all the tables, not just the configurable field ones, switching among the 4 table layouts above when there is no data should be trivial. Implementing a migration code, making it possible to switch from one mode to another when data is already stored in the tables, should be feasible but might be contrib material, it depends on how much time we will have to work on that.

That said, I think that the best way to address this would be to provide field revision tables when the entity has revision support enabled and avoid them in other cases. We could do the same with translation support and avoid the langcode column for monolingual entity types.

Thoughts?

Comment #7

yched commented 8 September 2013 at 23:21

Going from "not revisionable" to "revisionable" means creating one new table per field, and copying all the data - times all fields in all bundles of the entity type. Not really something you can do in a form submit or in any other non-batched request.
Going from "not multilingual" to "multilingual" is not too clear, but it seems like it involves heavy data copy as well...

So I'm not sure how support for live reconfiguration of the schemas would work - be it backed on the UI or on the code (entity type definition). That smells like CCK's infamous dynamic db reshuffling & data juggling - only worse, since at the entity level instead of at the level of individual fields separately :-/.

Side note: I now finally understand what the 'entity_test_mul / entity_test_mulrev' entity types we have in core mean, but it's totally obscure that 'mul' stands for multilingual...

Comment #8

plach

he/him

Italian

Venezia

commented 8 September 2013 at 23:28

So I'm not sure how support for live reconfiguration of the schemas would work - be it backed on the UI or on the code (entity type definition). That smells like CCK's infamous dynamic db reshuffling & data juggling - only worse, since at the entity level instead of at the level of individual fields separately :-/.

As I said, the initial goal is supporting switching between modes only when no data has been created yet, which is how core Field API deals with settings that have an impact on the storage, correct? I didn't fully think through the actual triggering process (UI vs API), but being able to switch among those will certainly bring a perfomance gain when some of the features they allow for are not needed.

Implementing a migration reliably might be way harder and is indeed something like the CCK stuff, but I don't think that's the crucial part of the proposal. Instead we could certainly prevent a switch when data is available, at least in core.

Comment #9

yched commented 9 September 2013 at 16:10

the initial goal is supporting switching between modes only when no data has been created yet, which is how core Field API deals with settings that have an impact on the storage, correct? I didn't fully think through the actual triggering process (UI vs API).

The major difference is that field settings are entirely on the config side, so you can raise an exception when someone attempts to change them. Whether an entity type is revisionable or translatable currently lives in the entity type definition (@EntityType annotations or hook_entity_info_alter()), that is, in code. How do we prevent those from changing when we don't want them to ?

Also, a contrib entity type not being revisionable in release 1.0 and becoming revisionable in release 1.1 is a totally valid use case.

Currently, every entity type has to explicitly define the associated storage tables for its use case: revisionable ? translatable ? what are the base fields ? (all of those things that are defined in code)
That's a non-minor burden, but it also makes the responsibilities clear - build your own schemas correctly according to your stated features, and write the correct hook_update_N() to update your schemas if you change your features.
If we switch to "DatabaseStorage takes care of building your schemas magically", then who's responsible for the updates ?

Bottomline is: dynamic storage in SQL is hard :-/.
- base tables for entity types are currently not dynamic, and are the responsibility of the entity type to define and maintain.
- the only thing that is - by definition - dynamic is the storage of configurable fields, and it assumes the host entity type is always revisionable & translatable.

The only way I can think right now to make "automating" sustainable would be to make "revisionable / translatable" not inherent to the entity type and move them to config. Meaning: all entity types are revisionable / translatable in theory, and whether they actually are is a matter of configuration.
Only then we can observe the changes and block them in the cases we cannot support (e.g. you can't make an entity type revisionable once there are existing entities) - the way Field API does for field settings.
Seems like a non minor effort though :-/. Means UI work, new config entities to store the settings, in addition to the code to automatically generate the correct schemas.

Comment #9.0

yched commented 9 September 2013 at 16:10

Issue summary:

View changes

grammar

Comment #10

andypost

he/him

Russian

commented 9 October 2014 at 02:43

Issue summary:

View changes

Beta is out, this one could be closed

Comment #11

giorgio79 commented 1 December 2014 at 18:50

@andypost how having a beta solves this issue?

Comment #12

geerlingguy commented 1 December 2014 at 19:54

Version:

8.0.x-dev

» 8.1.x-dev

More appropriately, this should be set to 8.1.x. I don't think a refactor of the scale being proposed here would be considered release-blocking for 8.0.x.

Comment #13

giorgio79 commented 3 December 2014 at 14:37

Issue tags:

+Performance

Adding performance tag

As yched mentioned in #4

So yes, writing the same data in two tables when the entity type is not revisionable is kind of absurd and adds needless write costs

Comment #14

berdir

German

Switzerland

commented 3 December 2014 at 15:04

As commented in #1279440: Deeply decouple revisions from Entity API and Node module, entity *types* that don't support revisions (like user and taxonomy terms) already don't get the revision tables anymore.

So that part of the issue is resolved, and that's the part the other issue talked about.

The second part is restructuring the storage on how revisions are actually stored. That's still open but I don't think that will happen in 8.x as it would be a huge change. But contrib can experiment here by using different storage implementations.

Comment #15

david strauss

he/him

commented 11 March 2015 at 22:06

queries with conditions/sorts on field tables are going to mean temporary tables/filesorts

That's generally correct if the assumption is that a join to a "field revisions" table needs to sort to get the latest revision. However, it would be possible to retain performance by either (1) always identifying the latest revision by a number like "0" or (2) adding a column like latest_revision that's set to TRUE for the latest one. That way, the join can be on the nid AND revision=0, for example. As long as there's an index on (nid, revision), it would work.

having smaller tables for current values means less rows to go through.

This is true in terms of data locality. There's a higher chance that contiguous data in a table without old revisions is useful -- and less chance in a table littered with old revisions.

Comment #16

yched commented 11 March 2015 at 22:24

it would be possible to retain performance [of the typical join to a "field revisions" table if it contained all revisions] by either (1) always identifying the latest revision by a number like "0" or (2) adding a column like latest_revision that's set to TRUE for the latest one

Both approaches mean we'd need to update the rows of the "previous latest" revision when a new revision is created ?

Comment #17

michael_lessard_micles.biz commented 4 August 2015 at 00:08

Greetings,

Just to mention what may be obvious : D6 and prior only stored the current title/body/teaser in the revision table and only created a new revision if requested by the settings.

The more complex D7 logic increases the size of the database, which is probably not an issue really [though a lot of arguments to the contrary here: https://www.drupal.org/node/1279440 ].

Both methods work of course, ~~though the D7 method seems safer (even if some odd database writing error occurred, the previous version remains in database)~~. The D7 has another main advantage: the Admins have a copy of the original or re-written node easily accessible to work with in case a user makes an error (i.e. without needing to use filed backups), even if the Create Revisions is off.

For migration (D6 to D7; which I am doing right now), it is a bit more complex.

So though I prefer the D6 simplicity, rationally it seems like the D7 structure is safer/better.

EDIT : my assumption that the new structure keeps the original title/body/teaser is wrong. I tested and oddly all D7 content related tables (node, node_revision, field_data_body, field_revision_body) contain only the most recent title/body/teaser. Probably because the Revisions are off on my set-up, still this means the database has two copies of the titles and the body also is doubled.

Comment #18

giorgio79 commented 27 August 2015 at 13:37

Issue summary:

View changes

plach

he/him

Italian

Venezia

commented 14 February 2020 at 11:49

Status:

Active

» Closed (outdated)

Actually, I think we can close this issue as obsolete, given that with the new entity storage revision field tables are not created for non-revisionable entity types.

Reconsider the separate field revision data tables

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #9.0

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

Comment #18

Comment #19

Comment #20

Comment #21

Comment #22

Comment #23

Comment #24

Comment #25

Comment #26

Comment #27

Comment #28

Comment #29

Referenced by

News items

Our community

Documentation

Drupal code base

Governance of community