Prevent cron errors when scheduled nodes no longer exist [#2011692]

I had an issue where empty nodes were created every cron run. I managed to trace it to the publishing function executed in scheduler_cron().

The very root cause was that the nodes listed in the scheduler table were 'corrupted': the user that created these nodes was no longer present in the database, resulting in the node_load() for the nids from the scheduler table to fail.
The root cause was not the scheduler module, but it did add further 'corruption' (note the quotes) to an already problematic situation.

So, in my humble opinion, the scheduler module should check if the node_load() call was successful before modifying the node object and saving it back to the database.
When the load fails, it should log an error to the database so the admins know something's up and we don't just sweep it under the carpet.

A patch for this can be found in attach.

Comment	File	Size	Author
#24	2011692-24-scheduler-missing_node_relations.patch	1.35 KB	jonathan1055
#24
#23	missing nodes 3 with anon user.png	67.89 KB	jonathan1055
#20	2011692-20-scheduler-missing_node_relations.patch	1.5 KB	jonathan1055
#20
#16	2011692-16-scheduler-missing_node_relations.patch	1.5 KB	jonathan1055
#16
#15	2011692-15-scheduler-missing_node_relations.patch	1.74 KB	jonathan1055
#15
#14	2011692-14-scheduler-missing_node_relations.patch	1.57 KB	jonathan1055
#14
#10	2011692-10-scheduler-missing_node_relations.patch	659 bytes	pfrenssen
#10
#9	2011692-9-scheduler-missingnodes.patch	3.51 KB	jonathan1055
#9
#8	scheduler-missingnodes-2011692-8.patch	2.61 KB	chrisfromredfin
#8
#5	select nid from scheduler not node.png	49.74 KB	jonathan1055
#2	2011692_2.scheduler.empty-node-protection.patch	1.96 KB	malc0mn
#2
	scheduler_empty_node_protection.patch	1.89 KB	malc0mn

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

jonathan1055 CreditAttribution: jonathan1055 commented 4 June 2013 at 17:10

Version:	6.x-1.9	» 7.x-1.1
Category:	bug	» task
Status:	Needs review	» Needs work

Thanks malc0mn for explaining the problem and for doing the work to provide a solution. Yes I agree that even though this is not a scheduler bug we should test for a valid node_load() result and not compound the error.

The patch will need to be made to the 7.x scheduler version first. Do you have the means to do this? Also, just a few suggestions:

Should the action 'publish' or 'unpublish' also be passed to the helper function? This could then be included in the watchdog message, otherwise there will be no record of what Scheduler was attempting to do
I think the watchdog text is a bit wordy. How about something like "Failed loading node @nid for @action. Check user id @uid exists"
Could you look at #1981600: Add our patch file naming standard to project page which explains our patch naming rules

Thanks very much for helping, looking forward to the D7 patch.
Jonathan

Comment #2

malc0mn CreditAttribution: malc0mn commented 6 June 2013 at 11:04

File	Size
2011692_2.scheduler.empty-node-protection.patch	1.96 KB

Patch in attach for 7.x-1.x

The exact error message logged to watchdog will probably need adjustment still... The reason for the wordyness is that I wanted to explain in short what to look for to fix this as the non-existant user is not the only root cause but the most likely one...
We cannot add the uid in the watchdog unless we do a separate query in the node table to get the uid. node_load() with a non-existent user will simply fail.

I added the empty node type check because that's a good pointer of immenent chaos (lol) as well. We should not act upon nodes without a node type to prevent who knows what.

The check could be split up as well to give a better error message maybe, but that might be overkill...?

Comment #3

jonathan1055 CreditAttribution: jonathan1055 commented 10 June 2013 at 20:11

Thanks for the D7 patch. Just a few points:

I ran the patch through code review and the only thing it found was no space before . in the concatenation

Line 25: String concatenation should be formatted with a space separating the operators (dot .) and the surrounding terms [style_string_spacing]
watchdog('scheduler', 'Failed loading node with nid @nid! Action executed: @action.', array('@nid' => $nid, '@action' => $op), WATCHDOG_ERROR, l(t('view'), 'node/'. $nid));

For the watchdog text I would change 'executed' to 'attempted' given that the action could not be completed
Also we do not need the ! in the middle
How about "Failed loading node @nid during attempted @action"
Following our discovery in #2015415: Avoid PDOexception with shorter watchdog link could you add array('alias' => TRUE) to the call to l()

Good work, thanks.

Comment #4

jonathan1055 CreditAttribution: jonathan1055 commented 10 June 2013 at 20:28

I forgot to add that I tested the patch on a node where I had manually deleted the user row from the database. Scheduler did not give the new message about failing to load the node, because the node actually loads ok. Atleast, it did with my minimal set of modules in my test site, you may have additional functionality which is hooked in to node_load(). So that in itself did not trigger the new message. There were errors saying "Notice: Undefined index: 2 in user_node_load() (line 3606 of /Library/WebServer/Documents/drupal7/modules/user/user.module)" but this did not prevent the node from getting published. Annoyingly this error did prevent the normal scheduler watchdog line which says what action scheduler performed, but the node definitely got saved with the new publish status.

Then I took a different node which was scheduled for publishing and manually deleted the node row from the database table. This did trigger the new checking function and I got the new message in the dblog.

Jonathan

Comment #5

jonathan1055 CreditAttribution: jonathan1055 commented 11 June 2013 at 13:17

File	Size
select nid from scheduler not node.png	49.74 KB

Can I request another minor change to be added into this patch? On the sql join of the node and scheduler tables currently the nid value is taken from the node table. Obviously when everything is working correctly then this would be the same as the nid from the scheduler table, because the join is made on nid.

But in the cases we are discussing here, when something has gone wrong with the database, often the nid value may be null in this result. Then the watchdog message can give no information on what node scheduler was trying to act on. If the nid is taken from the scheduler table instead then it will be used in the 'view' link and the admin is given some idea as to where to start looking to solve the problem.

The change is highlighted in this image:

Thanks,
Jonathan

Comment #6

jonathan1055 CreditAttribution: jonathan1055 commented 12 June 2013 at 08:40

For reference, I have just closed #1410184: Scheduler breaks on Cron if a node has been deleted outside Drupal. as a duplicate of this issue.

One idea I had in #12 of that thread was the need to tidy up the {scheduler} table for nodes which have failed to load, otherwise they remain there forever. Should we incorporate that into this patch, as it is all related?

Jonathan

Comment #7

jonathan1055 CreditAttribution: jonathan1055 commented 24 July 2013 at 16:37

Title:

Prevent scheduler module from creating empty nodes

» Prevent errors when scheduled nodes no longer exist

Changing to a more generic title for better search results.

Marked #1906684: EntityMalformedException running cron job, when scheduler pointing to deleted node as a duplicate of this issue.

Comment #8

chrisfromredfin

he/him/his

English

Portland, Maine

CreditAttribution: chrisfromredfin commented 13 September 2013 at 21:51

Version:	7.x-1.1	» 7.x-1.x-dev
Status:	Needs work	» Needs review

File	Size
scheduler-missingnodes-2011692-8.patch	2.61 KB

I just ran into this, and I'm not even sure how the database got into this state, but same overarching issues (appears in scheduler table but does not node_load). I took the existing patch from 2, and incorporated all the above feedback.

Comment #9

jonathan1055 CreditAttribution: jonathan1055 commented 14 September 2013 at 12:43

Status:

Needs review

» Reviewed & tested by the community

File	Size
2011692-9-scheduler-missingnodes.patch	3.51 KB

Good work cwells, thanks for making the changes required. I've tested the patch and it all works as expected.

I've made a couple of changes, mainly in comments and coding standards. The only processing change is the addition of a fourth parameter $delete in _scheduler_check_node() which defaults to TRUE, but allows other functions to call this and not delete the row, eg from other modules implementations of hook_scheduler_nid_list_alter()

Here is a re-rolled patch and I'm happy to mark this RTBC. When it is committed, the credit should also go to malc0mn for making the first patch.

Thanks

Jonathan

Comment #10

pfrenssen

Sofia

CreditAttribution: pfrenssen commented 16 September 2013 at 13:29

Status:

Reviewed & tested by the community

» Needs review

File	Size
2011692-10-scheduler-missing_node_relations.patch	659 bytes

I don't think this is the right approach. But let's take a step back first, at why this problem might be occurring in the first place.

This problem can be replicated quite easily I think. The cleanup of the scheduler table is done in scheduler_node_delete(). This hook only fires if Scheduler is enabled, so this "corruption" can be simulated by creating a bunch of scheduled nodes, disabling Scheduler, removing a couple of the nodes and then enabling Scheduler again. Scheduler now has references to nodes that no longer exist, and will cause the trouble as mentioned above.

Similarly, if users are removed from the system but their content remains, their user id will be set to 0. I think it might be wise to not schedule content from removed users.

The suggested solution is quite hacky though. It shouldn't be necessary to add checks in 3 different places, and I think it is a very bad idea to delete data in a function that is called _scheduler_check_node(). This function name gives the impression that it just does some checks, and would not assume that calling this might cause data loss.

IMO the right solution is much simpler, we just need to change the left joins to inner joins in scheduler_list(). This will prevent Scheduler from acting on invalid data.

Cleaning up the "corrupted" rows from the database is not our responsibility I think, after all this has been caused by external problems, but this can be discussed. If a developer causes this problem they can just as well fix it themselves:

DELETE FROM scheduler WHERE nid NOT IN (SELECT nid FROM node)

Attached patch contains my proposed approach. I did not test it.

Comment #11

jonathan1055 CreditAttribution: jonathan1055 commented 16 September 2013 at 14:07

I take your point. Yes it may be better to change the join to 'inner' as a better way to avoid the errors.

It does mean that the list of scheduled nodes in the admin content tab will simply never show the corrupted ones. Maybe we also need to provide some separate way for the scheduler table to be cleaned up by admins who do not have direct access to the database.

I'll test yout patch and have a think about this.

Jonathan

Comment #12

pfrenssen

Sofia

CreditAttribution: pfrenssen commented 16 September 2013 at 19:32

Maybe we can provide a drush command and/or an option in the configuration to "clean invalid data"?

Comment #13

jonathan1055 CreditAttribution: jonathan1055 commented 16 September 2013 at 19:38

Status:

Needs review

» Active

Yep, I am working on that right now. Thank you for not passing my patch, as it was not the best solution. I did not really like the delete which cwells had added. That was not the cleanup I had imagined. But it has prompted us to make a better solution.

Jonathan

Comment #14

jonathan1055 CreditAttribution: jonathan1055 commented 17 September 2013 at 16:36

Status:

Active

» Needs review

File	Size
2011692-14-scheduler-missing_node_relations.patch	1.57 KB

I agree that the best way to avoid the errors during cron is simply to only process the good nodes. Clearing up the corrupt nodes will be dealt with in a separate issue.

I've tested your patch but you changed the sql in scheduler_list() which is used for displaying the scheduled nodes. The cron sql is in _scheduler_publish() and _scheduler_unpublish() so it is here that we need to change the join from LEFT to INNER.

I used your method to create corrupt nodes and tested the changes. If you have a corrupt node scheduled for publishing and another corrupt node scheduled for unpublishing, the cron run fails during _scheduler_publish() and does not get as far as _scheduler_unpublish(). With INNER for each, the bad nodes are ignored as required.

I also added an INNER join to the users table. This is to prevent problems described above where the user table has become incomplete. We should also consider implementing hook_user_delete() to remove rows from scheduled nodes when they are reassigned to the anonymous user, but that will be in another issue.

Here's the patch against 7.x-1.1+10
I do not have git diff, I use just diff. I have tried to find out how to format the patches in the a/ b/ style, but I have not found the option to do it.

Jonathan

Comment #15

jonathan1055 CreditAttribution: jonathan1055 commented 28 September 2013 at 16:31

File	Size
2011692-15-scheduler-missing_node_relations.patch	1.74 KB

Re-rolled patch for 1.1+15 dated 25th Sept, in git diff a/ b/ format.

Comment #16

jonathan1055 CreditAttribution: jonathan1055 commented 8 October 2013 at 17:00

File	Size
2011692-16-scheduler-missing_node_relations.patch	1.5 KB

Here is a re-roll of the patch, now against 1.1+19. It still applies ok even after the .module patch in #1566024: Coding Standards and Coder Review #27 but with a fuzz and offset. So it should be able to be tested and committed at the same time without any further re-roll.

Jonathan

Comment #17

jonathan1055 CreditAttribution: jonathan1055 commented 9 October 2013 at 09:23

#16: 2011692-16-scheduler-missing_node_relations.patch queued for re-testing.

Comment #18

9 October 2013 at 09:26

Status:

Needs review

» Needs work

The last submitted patch, 2011692-16-scheduler-missing_node_relations.patch, failed testing.

Comment #19

jonathan1055 CreditAttribution: jonathan1055 commented 9 October 2013 at 12:22

The patch applies manually with an offset and a fuzz value. So does the failure above mean that for automated testing it has to apply absolutely cleanly with no offset? or is it the fuzz which caused the problem? The reason given is:

Unable to apply patch.  See the log in the details link for more information

There is no further explanation in the details. It would be useful to know if it was the fuzz or offset which causes the failure, as that would help to minimise the number of re-rolls we need to do.

Comment #20

jonathan1055 CreditAttribution: jonathan1055 commented 9 October 2013 at 13:09

Status:

Needs work

» Needs review

File	Size
2011692-20-scheduler-missing_node_relations.patch	1.5 KB

To test this, here is the same code changes, in a patch which applies with two offsets. However I have corrected the line which caused the 'fuzz' value. This will show if the automated testing accepts patches with offsets.

Comment #21

jonathan1055 CreditAttribution: jonathan1055 commented 9 October 2013 at 15:55

The successful test result above shows that the automated testing does accept patches which apply with offset(s) but does not accept patches with a fuzz. That's useful to know, and means we have less re-rolling to do after the major code review changes in #1566024: Coding Standards and Coder Review

However, the patch does alter the code at the same places as in #1182450: i18n support - Synchronize scheduled dates in translated nodes so whichever is committed first, the other will need to be re-rolled.

Comment #22

pfrenssen

Sofia

CreditAttribution: pfrenssen commented 15 October 2013 at 18:03

Looking at this again I have one reservation: the inner join with the users table prevents to schedule nodes that are created by anonymous users. When this was discussed above we were thinking about users that were removed from the system but I realized there is also the possibility that websites allow anonymous users to create nodes, and these would become unschedulable. This might break existing websites. There are 50000 websites using this module so chances are that at least some of them have use cases for anonymously created nodes.

What would be the best option? Play it safe and only commit the join with the node table? That would still fix this issue, and not risk breaking websites.

Comment #23

jonathan1055 CreditAttribution: jonathan1055 commented 15 October 2013 at 19:20

File	Size
missing nodes 3 with anon user.png	67.89 KB

No I already thought of that, and it's fine. The anonymous user is uid 0 and they have a row in the users table too, so the inner join will not exclude them. The join with users will only exclude nodes which are still 'owned' by a user which has been deleted without reassigning their content to a new user, and would thus cause node_load() to throw an error (which is exactly what we want).

Incidentally, I've also just tested anon user scheduled nodes in conjunction with #2110983: Improve UI functionality for scheduled nodes list and they both work fine together.

Jonathan

Comment #24

jonathan1055 CreditAttribution: jonathan1055 commented 24 October 2013 at 10:49

File	Size
2011692-24-scheduler-missing_node_relations.patch	1.35 KB

Re-roll needed at 1.1+34

Comment #25

jonathan1055 CreditAttribution: jonathan1055 commented 10 November 2013 at 11:01

Title:	Prevent errors when scheduled nodes no longer exist	» Prevent cron errors when scheduled nodes no longer exist
Issue summary:	View changes
Related issues:		+#1906684: EntityMalformedException running cron job, when scheduler pointing to deleted node, +#2110983: Improve UI functionality for scheduled nodes list

10 files were hidden/shown/deleted

File	Size
scheduler_empty_node_protection.patch	1.89 KB

2011692_2.scheduler.empty-node-protection.patch	1.96 KB

select nid from scheduler not node.png	49.74 KB
scheduler-missingnodes-2011692-8.patch	2.61 KB

2011692-9-scheduler-missingnodes.patch	3.51 KB

2011692-10-scheduler-missing_node_relations.patch	659 bytes

2011692-14-scheduler-missing_node_relations.patch	1.57 KB

2011692-15-scheduler-missing_node_relations.patch	1.74 KB

2011692-16-scheduler-missing_node_relations.patch	1.5 KB

2011692-20-scheduler-missing_node_relations.patch	1.5 KB

Comment #26

pfrenssen

Sofia

CreditAttribution: pfrenssen commented 26 November 2013 at 19:59

Status:

Needs review

» Fixed

Alright, this is a straightforward fix, let's get it in. I don't think this really needs a test since this problem will not occur during normal usage. Writing a test that deliberately introduces database corruption is not really useful I guess :)

Committed to 7.x-1.x: commit 46c1d4a. Thanks!

Comment #27

jonathan1055 CreditAttribution: jonathan1055 commented 26 November 2013 at 20:35

Excellent. Thank You. Yes I agree it does not need tests. Thanks for the other commits too.

Do you know how we can progress the fact that none of the commits from 23 Nov onwards are showing in the list from https://drupal.org/node/3292/commits and the dev package has not been refreshed since 18th Nov.

I have raised it on #2125405: Not all commits showing up but nothing seems to be happening on that thread.

Comment #28

pfrenssen

Sofia

CreditAttribution: pfrenssen commented 26 November 2013 at 20:49

If it has been reported there is not much more we can do, the d.o team will certainly fix it when they get around to it. If you feel like helping out I'm sure they will be very grateful. Check out Contribute to Drupal.org.