Download & Extend

Document: Allow dynamic generation of primary keys on sources

Project:Migrate
Version:7.x-2.x-dev
Component:Code
Category:task
Priority:normal
Assigned:Unassigned
Status:closed (fixed)

Issue Summary

Odd case here, trying to figure out possible causes. We have a complex CSV based migration. In one class, we create a "Physician' node. In another class, we read a separate CSV and attach data to a specific field on that node.

There are cases where the second migration never finishes. I finally tracked it down, and here's what is happening:

When a process respawns (as in a batch action), it seems to start over at row 1, not the "last processed row".
So it is looping back on itself and never reaching the end. For instance:

  • Imported 109 -- 109 records in db
  • Imported 116 -- 116 records in db
  • Imported 105 -- 116 records in db

Here's a snippet of a drush-based failure. This never proceeds past about 6% complete.

Processed 216 (0 created, 216 updated, 0 failed, 0 ignored) in 28.8 sec (450/min) - continuing with 'PhysicianBoards'                                                                          [ok]
Processed 300 (0 created, 300 updated, 0 failed, 0 ignored) in 42.1 sec (428/min) - continuing with 'PhysicianBoards'                                                                          [ok]
Memory usage is 217.75 MB (85% of limit 256 MB), resetting statics                                                                                                                             [warning]
Memory usage is now 208.7 MB (82% of limit 256 MB), not enough reclaimed, starting new batch                                                                                                   [warning]
Processed 216 (0 created, 216 updated, 0 failed, 0 ignored) in 33.1 sec (391/min) - continuing with 'PhysicianBoards'                                                                          [ok]
Processed 300 (0 created, 300 updated, 0 failed, 0 ignored) in 38.6 sec (467/min) - continuing with 'PhysicianBoards'                                                                          [ok]
Memory usage is 217.71 MB (85% of limit 256 MB), resetting statics  

Any ideas about where to begin debugging? One oddity here is that, due to a many-to-one relationship, we have to create custom primary keys during prepareRow(). Perhaps that is the issue?

Comments

#1

This is interesting:

Processed 100 (0 created, 100 updated, 0 failed, 0 ignored) in 13.6 sec (441/min) - continuing with          [ok]
'PhysicianBoards'
Memory usage is 217.7 MB (85% of limit 256 MB), resetting statics                                            [warning]
Memory usage is now 208.62 MB (81% of limit 256 MB), not enough reclaimed, starting new batch                [warning]
Processed 8 (0 created, 8 updated, 0 failed, 0 ignored) in 1.6 sec (296/min) - continuing with               [ok]
'PhysicianBoards'

#2

The plot thickens -- apparently this segfault is being hidden by normal batch processing.

drush5 mi PhysicianBoards --update --feedback="100 items" --watchdog=print --limit="200 items"
Processed 100 (0 created, 100 updated, 0 failed, 0 ignored) in 12.7 sec (472/min) - continuing with          [ok]
'PhysicianBoards'
Processed 100 (0 created, 100 updated, 0 failed, 0 ignored) in 13.2 sec (454/min) - continuing with          [ok]
'PhysicianBoards'
Processed 0 (0 created, 0 updated, 0 failed, 0 ignored) in 0.2 sec (0/min) - done with 'PhysicianBoards'     [completed]
Segmentation fault

#3

I might be able to get around this by avoiding records based on the "lastmigrated" column in the mapping table, but that is being set to 0 on import....

#4

@eaton notes that object memory is handled differently in PHP 5.3 (my localhost, where the migration works) and 5.2, which our host is using.

#5

First thought - you're running out of memory within 516 items processed? That's a serious memory leak, my first priority would be tracking that down.

Creating custom keys in prepareRow() - this is probably the cause of the original issue here. You're generating the key values that are being saved in the map table here, correct? So, imagining that 'id' is the source column you've reference in MigrateSQLMap() (thus, $row->id is the value to be saved in the map table), here's what would happen:

1. You process the first row of the CSV, which has "23" in the id column.
2. In prepareRow(), you set $row->id to "23article5".
3. When the row is done importing, "23article5" is the sourceid1 value in the map table.
4. You run out of memory, and spawn a fresh import.
5. Migrate reads the first row of the CSV, which has "23" in the id column.
6. It then looks for sourceid1="23" in the map table, and finds nothing.
7. Thus, rather than skipping the row because it's already been imported, it imports it again.
8. Since it's importing all the same stuff, it'll probably run out of memory in just about the same place.

What we might be able to do on the Migrate side would be to support a prepareKey($row) method, called at the top of the loop MigrateSource::next(), to let you do custom key generation.

As for other notes here:

I've seen a segfault at the end of a migration as well in one environment - but it seems to happen after the migration is fully complete, everything turns out fine, and I don't see it with the same code/data in other environments, so I haven't delved too deeply.

To set last_imported in the map table, you need to set the map's trackLastImported to TRUE. Unfortunately, looking at that support now, that member is protected and there's no setter, which makes doing this a bit of a challenge;). We should do something about that: #1703050: Enable setting trackLastImported

#6

Actually, in MigrateSQLMap, we set id to the compound key, which does not exist in the csv fle.

<?php
    $this
->map = new MigrateSQLMap($this->machineName,
      array(
       
'board_id' => array(
         
'type' => 'varchar',
         
'length' => '255',
         
'not null' => TRUE,
         
'description' => 'unique board id combined cms id and board description'
       
)
      ),
     
MigrateDestinationNode::getKeySchema()
    );
?>

In prepareRow(), we create the value. The $row->cmsid is the first item in the data, and it would be the primary key, except that we have a one-to-many mapping here.

<?php
   
// board id
   
$row->board_id = $row->cmsid . $row->board_desc_clean;
?>

However, your description in steps 1-8 sounds about right. But -- and this is a large but -- the memory leak problem only happens on PHP 5.2. Apparently, 5.3 does better memory release / garbage cleanup. I can run the migration just fine using drush on 5.3. Using either drush on 5.2 or the batch UI (in 5.2 or 5.3) causes the fail.

I like the prepareKey() idea. It is something I can test if we have a patch. It is unlikely that we can get "better" data at this point.

Big thanks to you and moshe (via IRC) for the support here. I was about to claw my eyes out.

#7

Title:Field to node migration endless loop» Allow dynamic generation of primary keys on sources
Category:support request» feature request

#8

Status:active» fixed

Committed - implement

<?php
 
public function prepareKey($source_key, $row)
?>

Where $source_key is the array you passed to MigrateSourceSQL defining your source key, and $row is the raw data from the source. Usually when generating a custom key I would assume you know exactly what key fields to set so won't need $source_key, but the default implementation definitely needs this.

#9

Sweet!

#10

Title:Allow dynamic generation of primary keys on sources» Document: Allow dynamic generation of primary keys on sources
Category:feature request» task
Status:fixed» active

I need to add a little docs patch when I get a second. It was not immediately obvious -- though it should be, I suppose -- that you need to return $key _and_ add the field to $row. e.g.

<?php
 
/**
   * Prepare a proper foreign key.
   */
 
public function prepareKey($source_key, $row) {
   
$row->board_id = $row->cmsid . $row->board_desc_clean;
   
$key['board_id'] = $row->board_id;
    return
$key;
  }
?>

There is also a small copy/paste error in the docblock for the new method.

#11

Actually, we probably should merge it in automatically...

#12

Status:active» fixed

On second thought, since in the default case the fields are already present in the row, it would be wasteful to always recopy them, it should be on the implementer of prepareKey() to populate whatever needs to be populated. I've updated the docblock to reflect this (and fix the method name as well).

Thanks.

#13

For the record, this fix corrected the original problem. Awesome work.

#14

Status:fixed» closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.

#15

You might also want to store this key into in the row. The saveIDMapping function needs it, which is called even if you choose to skip the object (return FALSE) in prepareRow.

nobody click here