Media Mover re-process changed nodes every cron run [#645060]

Hello Arthur,

1) I create some node with attached files.
2) I run cron and MM stores the processed files as a CCK filefield.
3) Later, I change some text in the node body or the title and save the node.
4) I upload some new video node and run cron. Here is the problem, because Media Mover re-encode my already processed files. I think it shouldn't happen, because I not modified the attached files, I only modified some text in the node.

Is there any way to not re-encode already processed files, only when the attached files changed?

Thank in advance.

Comment	File	Size	Author
#18	harvest_timestamp-64506-17.patch	747 bytes	jsit
#16	harvest_timestamp-64506-8.patch	752 bytes	jsit
#7	mm_node.diff	439 bytes	arthurf
#4	mm_cck.diff	1.07 KB	arthurf

Comments

Comment #1

delykj commented 3 February 2010 at 12:53

Category:

support

» feature

Is there any hint about it?

Another option is to add a checkbox to the node editing form to not reprocess the MM configuration at cron run.
There is a similar solution in the Blue Droplet Video module (Retranscode this video):
http://drupal.org/node/422088

Comment #2

jordanmagnuson commented 9 February 2010 at 06:59

Subscribing, as this seems somewhat similar to my issue: http://drupal.org/node/707038

Comment #3

delykj commented 17 February 2010 at 10:20

I solved this problem with the Flags module, Rules and Actions.
I created a special global flag ("Convert this video") and hacked Media Mover to check the node's flag at cron processing. So if this flag is set, MM will convert the video, otherwise skip it. When I create a node I initially set the flag so the node will be converted at cron run. MM will publish the node when the conversion process completed. I create a rule that execute an unflag action when a node is published. If you want to reconvert a node, you should check the flag, save the node and run cron again.

Comment #4

arthurf commented 17 February 2010 at 14:15

Status	File	Size
new	mm_cck.diff	1.07 KB

What is happening is that the node created date changes making it larger than the last time cron runs. Media Mover now thinks it does not have the file, so it runs it again.

The way that it needs to be fixed is in the CCK harvest query, the harvested FID time stamp needs to be looked at and that should be considered a uniqueness check- something like

 AND f.timestamp >= $configuration['last_start_time']

I think this should do it. Though what seems weird to me is that:

  AND f.filepath NOT IN (SELECT harvest_file FROM {media_mover_files} WHERE cid = %d)

is not preventing this.

Comment #5

arthurf commented 17 February 2010 at 14:17

@delykj - would you mind sharing the flag code? This would be awesome to get into a module. Note that you could use media_mover_api_event_trigger() to do this :)

Comment #6

delykj commented 17 February 2010 at 20:00

I didn't write a module for it. I simly hacked mm_node.module, so this is hard-coded ("'convertablefiles'" is the flag name).

function mm_node_harvest($action, $configuration, $job, $nid) {
  // this builds a set of node types that is mysql friendly for n.type IN (  ...  )
  if ($node_types = $configuration['mm_node_types'] ) {
    foreach ($node_types as $type) {
      if ($type) {
        $node_type_list[] = '"'. $type .'"';
      }
    }
    $node_type_list = 'n.type IN ('. implode(', ', $node_type_list) .')';

    // are we harvesting from a specific NID ?
  	if ($nid) {
  	  $harvest_conditions = ' AND n.nid = '. $nid;
  	}
    // otherwise we only look for nodes that are newer than the
    // last time that we ran
  	else {
  	  $harvest_conditions = ' AND n.changed > '. ($job->last_start_time ? $job->last_start_time : 0);
  	}

  	// select from specified file types
    if ($configuration['file_types']) {
      $types = explode(' ', $configuration['file_types']);
      foreach ($types as $type) {
        $conditions[] = "f.filepath LIKE '%%.$type'";
      }
      // build the SQL to check against the file types
      $file_type_conditions = ' AND ('. implode(' OR ', $conditions) .')';
    }

    // get all potentially harvestable files
    // select all files join with nodes of $node_type_list
    // where node changed date is greater than last run start time
    // query for all files that match these conditions. Use the n.vid
    // to make sure we do not select files deleted from nodes.
    $query = 'SELECT f.*, n.nid FROM {files} f
      LEFT JOIN {upload} u ON f.fid = u.fid
      LEFT JOIN {node} n ON n.nid = u.nid
      WHERE
      '. $node_type_list .'
      '. $file_type_conditions .'
      '. $harvest_conditions .'
      AND u.fid NOT IN (SELECT mmf.fid FROM {media_mover_files} mmf WHERE mmf.cid = %d)
      AND n.vid = u.vid
      AND f.status = 1
      ORDER BY n.changed DESC';

    // now run the query
    $results = db_query($query, $configuration['cid']);
    $files = array();
    //delykj
    $flag = flag_get_flag('convertablefiles');
    //END
    // take each result and add it to the output
    while ($result = db_fetch_array($results)) {
      // check to see if file exists
      if (file_exists($result['filepath'])) {        
        //delykj
        if (!($flag->is_flagged($result['nid']))) {
          continue;
        }
        //END
        // now we harvest file
        $result['harvest_file'] = $result['filepath'];
        $files[] = $result;
      }
    }
    return $files;
  }
}

Comment #7

arthurf commented 17 February 2010 at 20:54

Status	File	Size
new	mm_node.diff	439 bytes

So I think the fix here is the same:

  $query = 'SELECT f.*, n.nid FROM {files} f
      LEFT JOIN {upload} u ON f.fid = u.fid
      LEFT JOIN {node} n ON n.nid = u.nid
      WHERE
      '. $node_type_list .'
      '. $file_type_conditions .'
      '. $harvest_conditions .'
      AND u.fid NOT IN (SELECT mmf.fid FROM {media_mover_files} mmf WHERE mmf.cid = %d)
      AND n.vid = u.vid
      AND f.status = 1
      AND f.timestamp > %d
      ORDER BY n.changed DESC';

And then $configuration->last_start_time is passed in to f.timestamp... Can you try out the diff and see if it works for you?

Comment #8

jordanmagnuson commented 24 February 2010 at 07:16

diff in #7 seemed to fix my issue, but I had to apply it to mm_cck.module.

Changed mm_cck.module, line 345 from:

$configuration['cid'], $configuration['mm_cck_havest_node_type'], $configuration['cid'], $job->stop_time);

to:

$configuration['cid'], $configuration['mm_cck_havest_node_type'], $configuration['cid'], $job->last_start_time);

A bit confused as to why the initial value was $job->stop_time, as that variable doesn't seem to exist.

Comment #9

jordanmagnuson commented 24 February 2010 at 10:55

Never mind. After further testing, it looks like MM is STILL trying to harvest my files after they have been processed and moved to Amazon S3. At least some of them...

Comment #10

jordanmagnuson commented 24 February 2010 at 12:24

Okay, more testing, and I've determined that the $job->last_start_time control is *sort of* working. What happens is that MM wants to process all of my files exactly twice, where I want it to process each file once.

Comment #11

jordanmagnuson commented 25 February 2010 at 02:49

I doubt this is a good solution, but here's what I've done so that my S3 files are not processed twice by Media Mover. I changed line 347 of mm_cck.module from:

while ($result = db_fetch_array($results)) {
  $files[] = $result;
}

to:

while ($result = db_fetch_array($results)) {
  if (is_readable($result['harvest_file']))  
    $files[] = $result;
}

This prevents MM from trying to harvest files after they have been moved to S3, since the S3 files are not readable.

Comment #12

mrwhizkid commented 5 October 2010 at 09:16

Is #11 a good solution? I am having the same problem. Everything works but everytime cron runs, I get the following in my logs:

Harvested file is not readable, check permissions: http://documents.example.com/somedocument

Thanks.

Comment #13

mrwhizkid commented 15 November 2010 at 08:48

Thank you, Thank you, and Thank You!!

I was having the same problem...and this little change completely solves it. MM is no longer trying to harvest my S3 files which not only was filling up my logs with errors, but was also causing problems for subsequent 'run' operations.

If there is a better way to do this, I would like to know, but for now, this seems to solve my problem!

Comment #14

arthurf commented 18 November 2010 at 02:34

Status:

Active

» Needs review

I believe that the issue described here is this: http://drupal.org/node/917656 While #11 may prevent the double harvest, it doesn't solve the root of it which I am now convinced is #917656. This fix has been applied to 6.2.x and 6.1.x. It would be good to hear if this issue is fixed.

Comment #15

carteriii commented 8 April 2012 at 03:17

I know this is a bit old, but it appears #8 is still relevant and needs to be fixed. $job->stop_time simply doesn't exist and I agree it should be replaced with $job->last_start_time.

That seems easy enough, but I'm sorry that I don't know how to create an official patch to be submitted, tested, etc. etc. If someone cares to point me to some instructions for doing that properly, I will do it, but otherwise can someone with the proper knowledge & authority simply get this moving? I see that #917656 has had some new discussion and there is hope it could make it into the next beta (or something) and it would be nice to also get this fix included at the same time.

Comment #16

jsit commented 14 October 2012 at 02:18

Status	File	Size
new	harvest_timestamp-64506-8.patch	752 bytes

I'm getting this same problem; here's a patch based on comment #8 that, after some brief testing, worked in 1 out of 2 instances. Not sure why it isn't consistent -- might have just been too many files and fields bouncing around in there while I tested -- but I'm hoping it'll start behaving.

Update: the problem just reoccurred on one of the two nodes I'm using to test it. Don't know why the other is unaffected.

Comment #17

jsit commented 14 October 2012 at 04:39

My mistake, and sorry for all the edits here, BUT --

I think in comment #8, the line should be changed to $job->start_time, not $job->last_start_time. Using start_time has (so far, and I think reliably) worked for me.

Comment #18

jsit commented 14 October 2012 at 04:44

Status	File	Size
new	harvest_timestamp-64506-17.patch	747 bytes

Here's the new patch, please disregard the one from #16 and use this instead.

This changes the timestamp comparison from $job->stop_time to $job->start_time.

Comment #19

jsit commented 14 October 2012 at 06:40

Correction: that patch doesn't work either. Sorry folks, I think we're out of luck for now.

Comment #20

arthurf commented 14 October 2012 at 12:52

That order by clause is just to try to process things in order. I don't think that that would be the issue. You could try removing it completely from the query but I don't think that would change the issue going on. You might want to try just getting the query that is being run and running that directly in mysql to see if you're getting any results.

Comment #21

jsit commented 14 October 2012 at 17:00

Again, apologies for all the posts. I was stuck in a debugging hole last night running down blind alleys, thinking I was onto something, when really the misbehavior is just inconsistent enough to trick you briefly into thinking it has been resolved.

Anyway, now, without trying to interpret these results, I'm just going to present some raw data of what Media Mover is doing when it fails.

For this test, I have changed mm_cck.module to use $job->last_start_time on line 342, instead of $job->stoptime

Before testing, these were the values in the media_mover_config_list table:

233166: last_start_time
233220: start_time

I then uploaded a file to a node and saved the node, and here are the numbers that came out:

233166: the timestamp used in the WHERE part of the DB query, line 335 of mm_cck.module
233220: the new value of last_start_time in the media_mover_config_list table
233272: the new value of start_time in the media_mover_config_list table
233270: the value of timestamp in the files table

You can see already that if I were to manually run the configuation again, it would compare the file's timestamp (233270) to the last_start_time (233220) and find that the timestamp is larger, and would re-harvest the file. And indeed this is what happened, and these are the numbers that came out:

233220: the timestamp used in the WHERE part of the DB query, line 335 of mm_cck.module
233272: the new value of last_start_time in the media_mover_config_list table
233347: the new value of start_time in the media_mover_config_list table
233270: the value of timestamp in the files table

This is why I experimented with changing $job->last_start_time to $job->start_time -- and changing the comparison from >= to simply > -- but still ran into similar problems (duplicate harvesting).

I don't know if Media Mover is writing to the files table's timestamp field, or if it's recording its last_start_time incorrectly, but there is something crucially flawed about the DB query it executes when harvesting.

Media Mover re-process changed nodes every cron run