Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
Problem/Motivation
There is a set of situations where composite entities (such as Paragraphs) are not deleted from the database. This could lead to a huge amount of storage taken by revisions that may be no longer used.
Proposed resolution
Offer a Batch delete form and be careful with revisions that are still used.
Remaining tasks
User interface changes
API changes
Data model changes
Release notes snippet
Comment | File | Size | Author |
---|---|---|---|
#63 | manual_cleanup_obsolote_composites-3016388-63-interdiff.txt | 4.07 KB | Berdir |
#63 | manual_cleanup_obsolote_composites-3016388-63.patch | 37.38 KB | Berdir |
| |||
#60 | interdiff-3016388-60-58.txt | 6.6 KB | mbovan |
#60 | manual_cleanup_obsolote_composites-3016388-60.patch | 35.99 KB | mbovan |
| |||
#58 | interdiff-3016388-58-57.txt | 930 bytes | mbovan |
Comments
Comment #2
johnchqueMoving patch from META issue.
Comment #3
johnchqueUpdating patch to insert the entities manually. :)
WIP: Will continue tomorrow.
Comment #5
johnchqueStrange, I guess this prevented the test to be run.
Comment #7
johnchqueOK, almost working. :) Half of the test passing. Will investigate a bit more about the others. :)
Comment #9
johnchqueOK, just checking, not sure how it is possible with the UI to get orphan revisions on non revisionable composite relationships. Thought we fix that in #2834314: Support deleting translations of a composite reference and #2834034: Deleting a translation deletes composite references on translatable fields. Is there something that I am missing?
Do I need to just create the entities without a parent intentionally?
Comment #10
BerdirMostly it's about cleaning up stuff from *before* we did those fixes, but there are still pretty easy ways, for example simply stop using a paragraph that you created previously ("delete" it) without saving as a new revision.
Comment #11
johnchqueUpdating tests. Thanks for the suggestion.
Comment #13
johnchqueI could finally make tests pass locally, let's see what testbot has to say. :)
Comment #14
johnchqueCleaning tests.
BTW I think that for make the tests more reliable, we should maybe set the parent id field of the composite entity with a value of one that no longer exists instead of deleting the ones that we already have. Otherwise, this test can collide with #2953650: Composite entities are not removed on host delete if reference was removed in prior revision.
What do you think?
Comment #15
BerdirYes, definitely, we should create orphaned entities explicitly by-hand and not rely on bugs.
Comment #16
miro_dietikerOne of, or all of them at the same time? ;-)
in the any?
This is not 100% clear - what if the host ever changed?
Comment #17
johnchqueUpdating the tests, need to rework the tests for counting the existing revisions. :) Will continue tomorrow.
WIP: Not addressing yet #16.
Comment #18
johnchqueAdding new scenarios that are needed to be checked. :)
Comment #19
johnchqueSo wondering the results of the following scenarios:
// Scenario 1: A composite with a default revision that is referenced and an
// old revision that is not. Result: Only the old revision is deleted.
// Scenario 2: A composite with an old revision that is used and a default
// revision that is not. Result: Nothing should be deleted.
// Scenario 3: A composite with an old revision and a default revision both
// that are not used. Result: Should first delete the old revision and then
// the default revision. Delete the entity too.
// Scenario 4: A composite used in a EN node and a DE node. If the composite field is translatable.
// Delete the EN node.
Result:
// Scenario 5: A composite used in a EN node and a DE node. If the composite field is not translatable.
// Delete the EN node.
Result:
// Scenario 6: A composite used in a EN node and a DE node with many revisions. If the composite field is translatable.
// Delete all revisions from the EN node but the last one. Does it make a difference if we remove the DE revisions?
Result:
// Scenario 7: A composite used in a EN node and a DE node with many revisions. If the composite field is not translatable.
// Delete all revisions from the EN node but the last one. Does it make a difference if we remove the DE revisions?
Result:
What other scenario should we also work on?
Comment #20
johnchqueOK, updating to have only the needed scenarios. :)
Comment #21
johnchqueOops, sorry. Removing leftover.
Comment #22
Berdirthis is a strange path, things below admin/config should be in one of the defined sections, this one is a bit tricky, maybe just put it in system for now?
The page title is also not very clear yet, plus we are not actually exposing this anywhere I think, no menu link or so.
We should have that, with a title and description that explains what it is about a bit better.
also not sure about the permission. This is a pretty destructive and very technical action, maybe a separate permission would make sense.
I'm not sure if we really need a list like this.
An orphaned entity is simply not used at all :)
And with the switch to revisions, we could maybe instead be a bit more specific about that and explain that it will remove all orphaned composite entity revisions that are not referenced by anything. And if no revisions are left, then the entity itself will be deleted too.
this could get huge. we only need the *count* of the results, not each id.
Instead, I would just store a count of deleted revisions and deleted entities and then report those at the end.
maybe store the entity type manager in $entity_type_manager, should make it a bit easier to read.
Then you could also use hasDefinition() for the condition below instead of that instanceof check.
I think there is no need to check for just that specific parent id, with translations and so on, it is possible that maybe this entity doesn't reference it anymore but the parent still does.
So we don't even need the parent_id, we just need the field name and type.
=> New test case, have a composite entity that claims that Parent1 references it, which does not (might not even exist), but instead, Parent2 does. Must not be deleted.
explain a bit clearer that we are looking for other revisions *of the composite entity.
Also, maybe rename $revision to $composite_revision to make it clearer what's what.
this does not include the accessCheck(FALSE) condition.
maybe put the deleteRevision() call explicitly into an else {} block instead of the continue.
$entity doesn't exist anymore., should be $revision.
drupal_set_message() is deprecated.
I think results should be keyed by entity type as well and this would then loop over that.
Something like
* Paragraph: Deleted 124 revisions (12 entities).
* Something else: Deleted 461 revisions (20 entities).
As separate messages.
(entity type label first to avoid language problems with lowercase, plural and so on)
Comment #23
johnchqueOK, addressing all changes from the previous comment. Not totally sure about descriptions though. :)
Comment #24
BerdirDid some manual testing and found a lot of bugs ;)
* parent entity types that don't support revisions were broken, e.g. taxonomy_term
* batch was broken in about 4 different ways, including core bugs that prevent us from using an entity query, updating the wrong key, having the wrong total, and never finishing because it was lowering the actual total so never reached the end.
* Added better reporting including time and deletions, removed the id as that doesn't say anything about the progress.
Next steps:
* Fix tests, they probably fail now due to changed strings and so on.
* Improve test coverage, switch to Settings::get('entity_update_batch_size', 50) so we can set the batch size to 1 or 2 in the test.
* Cover non-revisionable parent entity types
Comment #26
BerdirAnother small update that uses loadMultipleRevisions(), which makes it a lot faster.
Comment #28
BerdirA test run against 800k paragraphs took 20min, it reported 8700 revisions deleted and 15661 entities. I already added a revision count to deleting the entity if no revisions are left, sounds like this site hat paragraphs that referenced a parent type that didn't exist anymore. Or actually, had no parent_type set at all.
Also, that no-valid-parent case still calls delete() which deletes the whole entity, it should only delete the revision as there might be other revisions that have a parent.
Comment #29
moshe weitzman CreditAttribution: moshe weitzman as a volunteer commentedNice work folks ... Perhaps someone can make a command/script that doe deletion reporting. I'd be happy to run it on the mass.gov DB which has lots of paragraphs.
Comment #30
johnchqueGreat, so fixing tests. What is missing now:
* Improve test coverage, switch to Settings::get('entity_update_batch_size', 50) so we can set the batch size to 1 or 2 in the test.
* Cover non-revisionable parent entity types
* Refactor code to avoid code duplication.
Working on that right now. :)
Comment #31
johnchqueComment #32
johnchqueAdding tests for non revisionable parents.
Refactoring the code.
Will work on the batch size test. :)
Comment #33
BerdirI would invert it. Put this part into a method, return FALSE if there are no parents (=safe to delete), TRUE if there are.
You should need fewer arguments for that, and then the delete only needs to exist once and you don't have to put that into a method for now.
Comment #34
johnchqueAll addressed. Hope this works good.
Comment #35
BerdirI would just read the values again from the field in here, then you only need a single argument. That argument should have a type hint for ParagraphInterface, however.
There are actually two cases here:
There is a parent type but it doesn't exist anymore. Safe to delete.
It's also possible that there is no parent_type at all. Which could for example mean that it has been created before 2016, when we added the parent_type field. And never updated, or only as a new revision.
But we don't know and we have no idea who might still be using them. So they aren't safe to delete.
which means maybe this function should be called isUsed() or so, and must return TRUE for parent_type = NULL. Including a test.
Comment #36
moshe weitzman CreditAttribution: moshe weitzman as a volunteer commentedMay I suggest a minor refactoring to submitForm() whereby the batch creation is separate from form handling. That way a Drush command can easily create the batch and then call drush_backend_batch_process()
Comment #37
johnchqueThankfully I already had the scenario suggested, created. Updating tests for following that.
@moshe weitzman I think we are taking a bit longer than we should on this. Maybe that could be done in a follow-up? :)
Comment #38
johnchqueOK, so I created a service as suggested by @Berdir. Tests passing locally.
Not sure how to split the method deleteOrphansBatchOperation.
Comment #39
Berdirthe main point to document is that we are unable to test if it's still used because we don't know what to query for.
the finish callback also needs to go to the service I guess, not quite sure yet how that would work with drush but we'll see.
Manager is a pretty boring class name, anything can be a "Manager", doesn't say much about what it does.
Maybe a ..Purger?
this is already injected.
this can be injected
this too.
I removed this from the message, so we can remove it here as well.
description is a bit strange, mabe just "Whether the composite entity is still being used, FALSE if it is safe to delete.
can be simplified to just "return !empty($revisions);"
Looks pretty good, I think we want to get this in soon so we can start to reuse the API here in related issues.
One more thing: I would suggest we add a warning message to the UI that says this has the potential to delete composites that are still being used and it is recommended to create backup first.
Comment #40
johnchqueAddressing everything but 1. and adding the warning message.
Comment #41
johnchqueUpdating the interdiff. :)
Comment #42
johnchqueOK, this addresses all comments above. :)
Comment #43
johnchqueSo making the fields revisionable should help with this, had to update the test according to the change.
Comment #44
AMDandy CreditAttribution: AMDandy commentedI ran this on a database with more than 8 million Paragraphs. It took about 80 hours and deleted 10% or so of the Paragraphs with no apparent ill effects (I checked older revisions, translations, and old translation revisions to make sure nothing was obviously missing.)
Comment #45
albertosilvaThanks @yongt9412 for your patch!
Anyway, I'm facing a problem with paragraphs having parent_id set to NULL, and I think those paragraphs should be deleted:
Do you think that modifying your patch to add a check for paragraphs without parent_id into isUsed() function (and then deleting them) could make sense?
Thank you!
Comment #46
miro_dietikerWe need to be very careful about those cases as there could be stale records where the parent_id was set wrong (or not at all). Thus best would be we provide a separate command to rebuild the parent_id data (optionally).
Making the process more aggressive is perfectly fine for a follow-up.
Comment #47
moshe weitzman CreditAttribution: moshe weitzman as a volunteer commented@berdir let me know that this issue must wait for [#322354] and then adapt to it.
Comment #48
moshe weitzman CreditAttribution: moshe weitzman as a volunteer commentedUsing latest patch, I got an error after submitting the form
Following the link for more info told me
An error occurred while processing _entity_reference_revisions_orphan_purger_batch_dispatcher with arguments : _entity_reference_revisions_orphan_purger_batch_dispatcher
Comment #49
andypostComment #50
miro_dietikerDidn't check in deep detail, but it seems chx proposes a completely different approach:
https://twitter.com/chx/status/1098624319624642560
https://gitlab.com/chx_/nope
Maybe worth double checking?
Comment #51
yobottehg CreditAttribution: yobottehg commentedI checked both this patch and the module from chx with a real site:
- 5 languages
- 80.000 paragraph entities
- paragraphs in paragraphs
- symetrical translations
- node revisions get automatically cleaned up (node_revisions_autoclean) to at max 10 remaining.
This patch works but leaves a lot of trash still in the database as only 11% of the paragraph entities were deleted. (I assume that because of the deleted node revisions likely 60 to 70 percent of the paragraphs should be garbage.
The module from chx just threw an error (calling ->delete() on null) without any other explanation.
This issue is on "needs work". May i ask whats the current plan here?
Comment #52
johnchque@yobottehg I think we need to focus on #2904231: Parent fields are not revisionable first and update the parent fields correctly so we can use this patch again.
Comment #53
tbenice CreditAttribution: tbenice commentedHi, not sure where this issue is at this point but I needed to try it. I had trouble and so am modifying the patch a little.
The site I'm working with had a field_paragraph (in the past) that was and entity_reference_field but later was deleted and re-created as a entity_reference field. Of course there were then orphaned paragraphs with parent field of field_paragraph. So, in the isUsed() method the query failed with an 'invalid specifier target_revision_id' error since the field_paragraph was no longer an entity_reference_revisions field.
So...I just added a check if the field is a valid one before the query. I'm attaching the altered patch from #43 in case it is still needed.
Comment #54
BerdirHI, thanks for improving the patch.
It's extremely helpful to provide an interdiff when changing something in an existing patch, because that makes it so much easier to review what you did. see https://www.drupal.org/documentation/git/interdiff
The easiest way to do so is to apply a patch with git apply -3 (or another way that results in all changes being staged), then the full patch is git diff HEAD and the interdiff is git diff.
Comment #55
tbenice CreditAttribution: tbenice commentedMakes sense here is the improved patch with interdiff.
Comment #56
tbenice CreditAttribution: tbenice commentedagain with correct extension for interdiff.
Comment #57
mbovan CreditAttribution: mbovan at MD Systems GmbH commentedThis patch addresses a suggestion from #36.
#56 makes sense. Is it worth to write a test for this case?
Comment #58
mbovan CreditAttribution: mbovan at MD Systems GmbH commentedDiscussed #56 with @Berdir, and this would actually affect entities that do not use
entity_reference_revisions
field type such as the one in Entity Reference with Layout module.Thus, we decided to add an explicit check for
target_revision_id
schema column. This means, the described use case in #53 will not be supported with the latest updates.Comment #59
mbovan CreditAttribution: mbovan at MD Systems GmbH commentedCreated a follow-up issue to implement a Drush command in #3081700: Implement a Drush command to clean obsolete composite entities .
Comment #60
mbovan CreditAttribution: mbovan at MD Systems GmbH commentedAdditional cleanup... Connecting with the related #2766135: EntityQuery with condition on the revision field leads to wrong results core issue.
Comment #61
mbovan CreditAttribution: mbovan at MD Systems GmbH commentedCreated a follow-up #3081765: Allow users to remove composite revisions without parent id value in the manual cleanup process to cover #45, #46 and #53. I assume this will cover #51 until some degree as well.
@moshe weitzman I am not sure under what conditions #48 happens? Does it still happen on your project with the patch #60?
To summarize, I went through all the comments above and didn't notice anything remaining that we could work on here...
Comment #62
miro_dietikerPromoting for current relevancy and dependency for other progress.
Comment #63
BerdirThe schema check introduced a performance regression, refactored it a bit to store the result in a property per entity type/field.
Comment #64
BerdirCommitted!
Comment #67
arsn CreditAttribution: arsn commentedHello. For end user reference:
Thank you
Comment #68
dqd+1 for this fix and hard work on it. Thanks to yongt9412, tbenice, Berdir, mbovan! Greetings from Berlin and please stay safe!
PS; should we start a documentation page on D.O. reflecting this?
Comment #69
awasson CreditAttribution: awasson commentedI've been following this issue and the meta issues for a year or so and I wonder if I can pester you all for some clarification.
Is the addition of the new feature "Orphaned Paragraph cleanup": admin/config/system/delete-orphans supposed to resolve the issue that results from rendering a deleted paragraph entity in a view?
For instance if I have a view that's purpose is to render a paragraph field (one or more) and has a contextual filter on the node ID so it's limited to a per node arrangement. It seems that if I add one or more paragraphs from the node editing form they will show up in the view. As noted in the issue, the problem is that if I delete them or even edit them they remain in the view the way they did when they were originally created.
I thought that the new feature "Orphaned Paragraph cleanup" would resolve this but now I think I must be mistaken.
I'm using Entity Reference Revisions 8.x-1.8 and if I try to use "Orphaned Paragraph cleanup", even when I delete the paragraph instances that I don't want, no orphans are detected or deleted and they still show up in the view.
Thanks,
Andrew
Comment #70
gatorjoe CreditAttribution: gatorjoe as a volunteer commentedI confirm the same behavior in #69. I have removed a paragraph, run the cleanup but there are no paragraphs detected, therefore no removals. The deleted paragraph remains in the view as well.
Comment #71
BerdirIf you are using revisions then the old paragraphs aren't orphaned, they are still used by the old revision. Views support for ERR/paragraphs is incomplete and has bee since forever, see #2799479: Views doesn't recognize relationship to host, I don't understand why so many try to display paragraphs with views and despite that, nobody is spending time on properly fixing that issue or paying someone to do it :)
Comment #72
jigariusCool! Thanks for the solution. I just wanted to mention that for me, the batch stays stuck at 0.
For some reason,
$composite_storage->loadMultipleRevisions($entity_revision_ids)
always returns an empty array. I wonder if anyone else has run into this issue. If not, I guess I'll have to dig deeper.Comment #73
BerdirI'd suggest you create a new issue. Unsure why this would happen, sounds like some kind of strange edge case where it doesn't manage to count up. You'll have to debug through the batch process to find out what it's getting confused about. A field without paragraphs? A certain type of broken references? Not sure.