For anybody else wondering how this all might work:
* I defined manually a dir on the /import/ page. By default this is 'sites/default/files'.
* In the basic settings for the importer, set Refresh: as often as possible. I think the import is triggered by cron, but this setting controls how often that actually results in an import.
* Do the first import manually by clicking the 'import' button. But I am not sure this first manual step is needed.
* After this, with every cron job one (!) file is imported. Even if there are more then one files, only one file is processed.
* The order of processing seems to be oldest first (date, time, filename)
If I got something wrong, or if this can be done better then please tell.
Some option / setting / workaround to make this setup process more then one file at a time would be really useful to me.
And some automated way to remove processed files from the dir and database table would also be helpful.
| Comment | File | Size | Author |
|---|---|---|---|
| #8 | directory import.jpg | 39.08 KB | hiral.chandora |
Comments
Comment #0.0
l_o_l commentedspelling error
Comment #1
rudyard55 commentedThanks for the post. Not finding much information elsewhere.
I'm still having trouble with the path though. The physical location is D:\wamp\www\websitename\sites\default\files\extracts
Any idea on how to reference it? Every path I've tried results in an invalid URI.
Comment #2
rudyard55 commentedOk... Posting back for others.
I had apparent (?) success using the following path:
public://
...but I don't know why yet. I'm going to go back and check the data to see what was imported and I'll report back as to what degree of success I was met with. 31 nodes were created... but I need to check the extracts to see what should have been created. Still confused.
Comment #3
rudyard55 commentedOk. The imports were successful and I tested the automation and it appears to be working.
Apparently, if you put your files in the default folder (sites/default/files) or even in a child folder of the default(?), the path of
public://
will allow the fetcher to find them.
In my case, I'm parsing XML files. And I've dropped additional xml files into the folder and ran cron and it fetches the new file and parses it into nodes.
I'm still not clear on the pathing though, and I need to figure it out further because this project involves several xml feeds and they will require separate parsers. So I'll have to figure out how to differentiate between directories and filenames.
Really wish there were some examples on the pathing regarding directories and filenames.
Comment #4
l_o_l commentedFrom my experience:
By default the directory to be scanned is 'sites/default/files' plus all of its sub-directories. So if the directory setting is empty, that is where this module will look for new files.
If there are a lot of files in there (40 Gig of images in my case) then cron will fail (I think because of a time-out or something) or the import will give strange errors because the files found are not xml files (I think any file found is tried to be processed ?).
So for the import I use the directory 'sites/default/files/feeds' and therefore as a setting for this module I use 'feeds/'.
This works well.
But be warned: the directory setting in the import page can get lost ! The fetcher keeps track of imported files in a text blob field. Can't remember where at the moment. In the same field it keeps the settings, like from which (non-default) directory it has to import. I think at some point (after x thousand of imported files) the text blob field has a overflow and then the settings are lost. From then on the import will go back to default (sites/default/files) directory scanning and you would have to enter manually the directory again !
Comment #5
rudyard55 commentedThanks for the warning on the settings being reverted to default. That will save me some confusion later. Huge help.
And thank you for clarifying the path for me. I think I understand it now. Its all relative to the default.
I already have the xml files being automatically put into the default folder on a schedule. I was going to create a batch file on the server to move the files out of the folder daily into an archive, but by the way you are describing it (the module tracking imports in the blob field) maybe I shouldn't do that. It might be better to leave the files in the default directory and just copy them to an archive. I might test what occurs in that scenario while still in development.
Thanks again for your guidance.
Rudy
Comment #6
l_o_l commentedIt is still advisable to remove the imported xml files. Just as the settings will get lost at some point because of the limits of the blob field, so will the names of the earliest imported files get lost. And then get processed again. This might not seem to be a problem (depending on kind of import and settings), but still: only one file is imported each time, so it may happen that from some future moment only old files get imported again while the new files are still waiting.
The Windows app that I wrote to export the xml files therefore checks also for previously successfully imported xml files (with a direct query into the database) and removes that files.
Yes it would have been very helpful if the module (optionally) deleted successfully imported files :-)
Comment #7
rudyard55 commentedOk. I'll proceed with my original plan to move the files out after import. Thanks for your help. :-)
Comment #7.0
rudyard55 commentedRevised the text from experience.
Comment #8
hiral.chandora commentedI am just wondering, how do you set this for periodic import?
When I am looking at code, I could think to get "directory" value in config form instead of source form.
But I am not sure, how one can set values so that when reloaded the config form, it should display assigned values to each field.
Any one have any idea?
Other thing is, cant we set any private folder instead of public directory ony?
Please update me.
Thanks in advance