About Feed Import
Feed Import allows you to import content from various file types into entities (like node, user, ...) using XPATH to fetch whatever you need. You can create a new feed using php code in you module or you can use the provided UI (recommended). If you have regular imports you can enable import to run at cron. Now Feed Import provides five methods to process content:
- XML Normal: loads the xml file with simplexml_load_file() and parses it's content. This method isn't good for huge files because needs very much memory.
- XML Chunked: gets chunks from xml file and recompose each item. This is a good method to import huge xml files. (still needs some tests)
- XML Reader: reades xml file node by node and imports it. The parent xpath is limited to one attribute (the most complex is something like //tagname[@attribute="value"])
- HTML: process a html page
- CSV: process CSV file line by line (you need php > 5.3.x)
Examples can be found at http://drupal.org/node/1360374
Example for importing youtube videos with Feed Import can be found at http://drupal.org/node/1365220
Example for importing taxonomy terms can be found at http://drupal.org/node/1542294
If you want, you can use your own function to parse content. Please check README.txt file for a better documentation.
- easy to use interface
- alternative xpaths support and default value
- ignore field & skip item import
- multi value fields support
- pre-filters & filters
- some usefull provided filters
- auto-import/delete at cron
- import/export feed configuration
- add taxonomy terms to field (can add new terms)
- process HTML pages
- process CSV files
- add image to field (used as filter)
- custom settings on each feed process function
- do not save info about imported items (usually used for one-time import)
- schedule cron import to run only in a specified time interval
How Feed Import works?
Step 1: Downloading file and creating items
- If we selected processXML function for processing this feed then all xml file is loaded. We apply parent xpath, we create entity objects and we should have all items in an array.
- If we selected processXMLChunked function for processing then xml file is read in chunks. When we have an item we create the SimpleXMLElement object and we create entity object. We delete from memory content read so far and we repeat process until all xml content is processed.
- If we selected processXmlReader then xml is read node by node and imported
- If we selected processHTMLPage function then HTML is converted to XML and imported like processXML.
- If we selected processCSV function then file is read line by line and imported
- If we selected another process function then we should take a look at that function (this means that function isn't provided by feed import module)
Step 2: Creating entities
Well this step is contained in Step 1 to create entity objects from SimpleXMLElement objects using feed info:
We generate an unique hash for item using unique xpath from feed (if isn't a one-time import). Then for each field in feed we apply xpaths until one xpath passes pre-filter. If there is an xpath that passed we take the value and filter it. If filtered value is empty (or isn't a value) we use default action/value. In this mode we can have alternative xpaths. Example:
Here we can use the following xpaths to take friend name:
If bestfriend is missing we go to normal friend.
If normal friend is missing too, we can specify a default value like "Forever alone".
Step 3: Saving/Updating entities
First we get the IDs of generated hashes to see if we need to create a new entity or just to update it.
For each object filled with data earlier we check the hash:
If hash is in IDs list then we check if entity data changed to see if we have to save changes or just to update the expire time.
If hash isn't in list then we create a new entity and hash needs to be inserted in database. Hashes are saved if isn't a one-time import.
Feed Import can add multiple values to fields which support this. For example above we need only one xpath
and both Tom and Jerry will be inserted in field values, which is great.
Expire time is used to automatically delete entities (at cron) if they are missing from feed for more than X seconds.
Expire time is updated for each item in feed. For performance reasons we make a query for X items at once to update or insert.