Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
This is an additional class (just a modification of MigrateItemsXML).
I had to get it working with an array of urls instead of a single url.
You can use it, for your own Migrations, but you should test it before you do so.
Usage
Fist iclude the class into your module folder, where your Migration lives (declare in the *.info file)
//get your urls in an array()
$urls = array('http://yourfeedsource1', 'http://yourfeedsource1' ...);
//create the $items_class obj like with the MigrateItemsXML but with the "List" suffix.
$items_class = new MigrateItemsXMLList($urls, $item_xpath, $itemID_xpath);
$this->source = new MigrateSourceMultiItems($items_class, $fields);
Much Thanks for the great and flexible Module, it was very easy to work with it.
dropfen.
Comments
Comment #1
dropfen CreditAttribution: dropfen commentedAfter some fixes later...
100 URLs à ~220 items
Test Content with only 3 fields from each item.
22334 Test Nodes imported in about 20min.
What are u thinking about the performance?
Comment #2
dropfen CreditAttribution: dropfen commentedNew File, need tests...
Comment #3
dropfen CreditAttribution: dropfen commentedPerformance Test:
In this test I fist downloaded the data. And then run the migrate script to import.
Test with 500 xml Files(in sum 170mb).
Processed 85699 (85699 created, 0 updated, 0 failed, 0 ignored) in 1590.8 sec (3232/min) - done with 'TestXML'
So, the script took about 5Gb of Memory, maybe someone have some ideas, how to make it more efficient?
Comment #4
mikeryanIt would be better to extend the existing MigrateItemsXML class to handle an array of XML files, rather than introduce an entire new class. See MigrateSourceXML for an example of how to manage a list of files.
Memory-wise, it looks like you're loading all the files at once, you should take one file at a time and explicitly close each one when you're done with it.
Comment #5
dropfen CreditAttribution: dropfen commentedThank you for the tips. I will try it, and in the process get more OOP experience.
If the swp will not filled, the import process should run much faster!!! I hope :)
My last Test runs with only 1500/min..
by the way, what do think about this post?
http://posterous.richardcunningham.co.uk/using-a-hybrid-of-xmlreader-and...
Comment #6
mikeryanRe: the "hybrid" post - that's exactly the approach MigrateSourceXML takes, using XMLReader to grab each element identified by the element query (which is a restricted subset of xpath syntax, since we have to implement this search ourselves), then SimpleXML over each element retrieved (enabling you to use full xpath syntax within each element).
Comment #7
dropfen CreditAttribution: dropfen commentedah, ok. then the problem with this, that the item we got is cutted from the whole xml file, so we can't access (in my case, the parent) nodes above?
The MigrateItemsXMLs class thas is an extension of MigrateItemsXML works now. The memory Issue is solved, :) thanks for you suggestion mike. When I have finished the development I'll post it to the Issue.
But I still have some problems, with analyze() Funktion. It doesn't work, but the Migrate process self works fine. Does the the analyze Function not access the same methods?
Comment #8
dropfen CreditAttribution: dropfen commentedSo, here's the beta version of the class. MigrateItemsXMLs
Beta, because it needs to be tested.
I find it works very well and is fast as the MigrateItemsXML with about 4-5000/min
Maybe some one will find it useful for own Migration. You can use it the same way you use MigrateItemsXML just with the (s) at the end and you can put an array of urls in your Migration, or a singe url as a string, it doesn't matter.
However, download, test, enjoy ;)
Comment #9
dropfen CreditAttribution: dropfen commentedComment #10
dropfen CreditAttribution: dropfen commented@mikeryan, what do you think about the implementation?
Is it smart enought to contrib?
Comment #11
mikeryanSorry, I think you misunderstood when I said "extend" the MigrateItemXML class to handle multiple URLs - what I meant was not to define a new class extending it, but to modify that class so it can handle an array of URLs as well as a single URL, similarly to what MigrateSourceXML does. There's no need to introduce another class here, it can be enhanced without breaking existing code.
Comment #12
dropfen CreditAttribution: dropfen commentedOK, this is what I did first :)
I will merge my last overrides with the MigrateItemsXML. The last version in #8, seems to be stable.
Can you explain me please why the rollback process (1300/min) takes 3-4 times longer then the migration (6100/min)?
Is there a bug in my implementation?
Thx
Comment #13
mikeryanNo, at the database level deletion is often slower than insertion, it's not surprising for rollback to be slower.
Comment #14
dropfen CreditAttribution: dropfen commentedhere's the patch to get MigrateItemsXML to accept an array of urls,
It works very nice by the greater part.
One bug that I could't fix, is that on the analyze method you will get only values of non imported items.
Comment #17
dropfen CreditAttribution: dropfen commentedComment #18
dropfen CreditAttribution: dropfen commented#14: 1998632-use_urls_in_migrate-items-xml.patch queued for re-testing.
Comment #20
dropfen CreditAttribution: dropfen commentedWTF?
Comment #21
mikeryanIt definitely breaks the wine.inc role migration. Don't have time to look in detail, I did notice at least one code typo "chache_ids"...
When rerolling, please make sure to adhere to Drupal coding standards (such as a space after "if").
Comment #22
dropfen CreditAttribution: dropfen commentedI did some cleanup, and fixed drupal coding standards. Maybe the problem comes because of the xml property.
It's dynamically now and it depend on the $id we give to the getItem method.
Comment #23
dropfen CreditAttribution: dropfen commentedComment #24
dropfen CreditAttribution: dropfen commentednew patch, the last had an horrible performance bug :|
Comment #25
dropfen CreditAttribution: dropfen commentedComment #26
dropfen CreditAttribution: dropfen commentedfixed: array_unique($ids);
Comment #27
dropfen CreditAttribution: dropfen commented@mikeryan
the patch works very well now, if you have the time to get a look of it,
however I would be happy to see it in the commits ;)
Thanks, dropfen
Comment #28
mikeryanDon't be scared! The actual code looks good, I just finally got around to installing Dreditor, which makes it easier to get extra-picky about comments and coding conventions...
@var array - the constructor parameter could be a string, but the property will always be an array.
Huh? Better described as essentially a cursor over the urls array, I think.
currently
Variable naming convention is lower-case separated by _, please don't change the parameters.
Add a period at the end.
Looks kind of hacky (and I think you meant <br />), how about an <ul>?
Additionally store them in cache_ids.
Declare the property before the constructor. Also, don't forget to declare idsMap.
// Make sure to load new xml.
Reordering the functions adds to the size of the patch, and makes it harder to follow what's changed.
Misplaced comment - unnecessary anyway I think, the code comments cover it.
Needs an indent.
Comment #29
mikeryanOh, and $cache_ids should be $cacheIDs (lowerCamel convention for class properties).
Comment #30
dropfen CreditAttribution: dropfen commentedThank you very much for reviewing and it's a good feeling to get instructed by a such dev.
So I made the corrections and after installing phpstorm got some notifications which I fixed on the fly.
So it's not dreditor alone so picky.
Thanks for the idea with the urls list markup. I have to say, that it was very late when I wrote the hack with the spaces ;)
I set the xpath selection rules before the url list, since the list could become very large and this info should be available on page load and not after scrolling.
Comment #31
dropfen CreditAttribution: dropfen commentedComment #32
dropfen CreditAttribution: dropfen commentedWhat do you think, should the getAllItems() method be overriden maybe with getItems()? Because it's not really All, you know. But for now this method is public so I'm not sure if other classes whants to call it.
Comment #33
mikeryanI wouldn't want to change the public API unless there's a really compelling reason.
Committed, thanks!
Comment #34
dropfen CreditAttribution: dropfen commentedThe reason is:
You should never be able to load all the Items at the same time, because of performance see your own comment #4.
And now when you call the getAllItems method you will get just the Items from the currentUrl. This is probably not what you want.
So, if the API should be used in a clean way, there is no reason to call the getAllItems method since you always need the special Item ($this->getItem($id) ) depend on the ID which you can get with $this->getIdList()
Thank you, for reviewing&committing!
Comment #36
dropfen CreditAttribution: dropfen commentedI have added a minimal change to the __toString() method of the MigrateItemsXML class.
Changed the order of the urls and selection rules, because when you deal with more then 50 urls you have to scroll down to see you settings.
So I think selection Rules can be pulled on top.
I think it's not necessary to make an own Issue for that, so I put it here.
Thanks
Comment #37
jcisio CreditAttribution: jcisio commentedI think this commit breaks "drush mi --idlist=...". Don't have time to check more.
Comment #38
mikeryanRestoring status, new patches for a committed feature should be separate issues.