It would be great to have complete support of general XML types in addition to RSS/Atom. There were many requests for this in FeedAPI which was eventually answered (sortof) by the extensible feed parser. It would be a great to have this support from early on using feeds.

http://drupal.org/node/359179

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

chrism2671’s picture

velosol’s picture

chrism2671 : I agree this would be a great feature, however... After writing my own parser for the custom XML I needed to consume, I do not relish the idea of trying to make it extensible in such a way that non-developer types could use. I've posted a generalized version of my code below in case anyone else wants to use it as a starting point. It is far from clean and includes lots of old debugging code.

Some starting ideas:
Make the xpath queries variables that can be set in a configuration form,
Make the various parts (item, title, description) equal to variables of the form parent_object->child,
Make the whole set of code a bit more OOP so attribute elements and other elements can be added without bound.

This code is mostly lifted from other places around the d.o issue queue (or Feeds) and the php SimpleXML examples - original credit goes to them which are more than I can remember at this point.

function generic_parser_parse($raw) {
  //libxml_use_internal_errors(true);
  if (!defined('LIBXML_VERSION') || (version_compare(phpversion(), '5.1.0', '<'))) {
    @ $xml = simplexml_load_string($raw, NULL);
  }
  else {
    @ $xml = simplexml_load_string($raw, NULL, LIBXML_NOCDATA | LIBXML_NOERROR | LIBXML_NOWARNING);
  }

  // $errors = libxml_get_errors();
  // foreach ($errors as $error) {
    // $message = (string)$error->message;
    // trigger_error("Error is: $message");
  // }
  
  // Got a malformed XML.
  if ($xml === FALSE || is_null($xml)) {
    trigger_error("generic_parser_parse received bad XML, beginning:" . substr($string, 0, 20));
    return FALSE;
  }
  $parsed_source = array();
  
  // trigger_error("Parsing with generic_parser_parse helper");
  
  //@todo: Add formatted time string to title.
  $parsed_source['title'] =  'Hardcoded value (bad) or pulled from XML (good)';
  $parsed_source['description'] = 'Hardcoded value (bad) or pulled from XML (good)';
  $parsed_source['items'] = array();
  
  // @todo: (alex_b) Make xpath case insensitive.
  $genericItems = $xml->xpath('//items');
  foreach ($genericItems as $genericItem) {
    $item = array();
    $item['guid'] = (string)$genericItem->id;

    $timeStampIs = strtotime($genericItem['ItemTime']);
    if ($timeStampIs !== FALSE && $timeStampIs != -1) {
      //If able to parse timestamp, use it - checks false for both strtotime vers.
      $item['timestamp'] = (int)$timeStampIs;
    }
    else {
      $item['timestamp'] = time();
    }
    //Short version:
    //$item['timestamp'] = ($timeStampIs !== FALSE && $timeStampIs != -1) ? (int)$timeStampIs : time();
    
    $item['title'] = $genericItem['title'];
    // Set location
    $item['location']['lat'] = (double)$genericItem->geocode->lat;
    $item['location']['lon'] = (double)$genericItem->geocode->lon;
    
    //Pull attributes from an element to create body
    //<ElementWithAttributes attribute1="XX">XX</E...>
    //Write as "{attribute1} is: {elementValue}"
    $elementsWithAttributes = $genericItem->xpath('anElement/ElementsWithAttributes');
    $attribText;
    foreach ($elementsWithAttributes as $attrib) {
      $attribText .= "<br/>\n" . (string)$attrib['attribute1'] . " is: " . (string)$attrib;
    }
    //The 'description' is the 'body' for a node.
    $item['description'] = $attribText;
    //Clean output array for the next trip through in case diff. set of attribs.
    unset($attribText);
    unset($timeStampIs);
    $parsed_source['items'][] = $item;
  }
  return $parsed_source;
}
chrism2671’s picture

It's unfortunate that this is quite fiddly! Will have a go and report back!

alex_b’s picture

Title: Extensible XML parser » Extensible XML parser (mapping more sources)
Version: 6.x-1.0-alpha7 » 6.x-1.x-dev

With mapping on import implemented, we could rely (as in FeedAPI Mapper) on analyzing the feed before offering mapping sources #651478: Mapping on import.

alex_b’s picture

For a demo project I have written a rule based parser called RParser.

In its documentation you can see how RParser allows for a definition of parsing rules with XPaths:

function hook_rparser_xpath($feed) {
 return array(
      'title' => '//default:Document/default:name[1]',
      'items' => array(
        '#xpath' => '//default:Placemark',
        '#children' => array(
          'title' => '//default:name',
          'description' => '//default:description',
          'picture' => '//default:pic',
          'url' => '//default:url',
          'lat' => '//default:LookAt/default:latitude',
          'lon' => '//default:LookAt/default:longitude',
        ),
      ),
    );
}

What's missing are namespace declarations and some simple fallbacks (what if the title does not yield a result?). This approach could get us very far. People who are looking for parsing XML documents with uncommon namespaces could easily add these with a simple hook implementation.

Further, if we were to write a 'magic parser' that allows a site builder to enter xpaths through a site UI, this is the infrastructure we'd want to have running underneath it.

mottolini’s picture

Links to RParser and its documentation are not working.
I'm really interested in RParser and willing to add a useful UI to it.

alex_b’s picture

#6: fixed.

whatdoesitwant’s picture

I forget to say that rparser looks clean and smart, as does the new feeds: amazing work. As a sitebuilder/themer I welcome the idea of opening this functionality up to the drupal ui.

For my 2 cts I think it better to work towards a general add-on that does open up the (feed)source specific (xml)data - as an extra array entry within the parsed php-array, along with the default fields - to the drupal ui.
This would mean that a setup as in your primary feeds tutorial vid would have to serve as a base on which more specific mapping can take place when the (feed)source's (xml)data is available.

If extensible parser does just that, i am sorry, but as a non-developer i never got it (which is my ish with that entire module). I'd much rather use something like exhaustive parser.

This approach is not as bad as it sounds because a three step configuration for specific (feed)sources is already necessary on the processing side (You have to create target fields based on the (feed)source's (xml)data, after creating a preliminary feed ct instance.)

SeanBannister’s picture

Might be worth looking at phpQuery and QueryPath which also has a module. I'm currently evaluating both for easy screen scrapping. They allow you to use jQuery style syntax to select XHTML elements.

netentropy’s picture

how does one actually use Rparser?

JayCally’s picture

I need to import a custom XML file and have been trying to come up with a way to do it using Feeds. I downloaded Rparser but it is dependent on the feedsapi not feeds. Can this be used with feeds?

Thoor’s picture

Sorry - this is maybe not the right Place to ask, but can anybody tell me ... is it possible to handle a XML File like shown under:
http://wiki.zanox.com/en/Product_Data_Download#Example_XML_file and create nodes from it? And if yes - do I have o use the OPML IMPORT, or FEED for it?

Beg your pardon - I read the documentation, watched the three videos and asked already in the regular forum without any answers ... THX

JayCally’s picture

You would need a custom XML parser.

Thoor’s picture

@ JayCally

THX a lot for your answer ... so I dont have to "try and error" anymore :) ... so I will use the CSV Import ... this is working so far.

JayCally’s picture

FileSize
6.91 KB

I'm creating a custom parser to pull in my custom XML file and used common_syndication_parser as a starting point. I have it set up and can select it in the parser section of feed importers. I can map the source to the target but am getting an error when importing. The error is: Invalid argument supplied for foreach() in FeedsNodeProcessor.inc line 22. I'm not sure what's wrong. I've attached a zip file with the custom parser and example XML file. Could someone take a look and see what I may have forgotten or screwed up?

netentropy’s picture

how do you use the custom parser once you write it?

JayCally’s picture

Add the below to feeds.plugin.inc. so you can select it as the parser to use in feed importers. You can also create a plugin for it to set up the mapping options.

$info['YourCustomParser'] = array(
'name' => 'Custom parser',
'description' => 'Parse custom XML files.',
'handler' => array(
'parent' => 'FeedsParser',
'class' => 'FeedsCustomParser',
'file' => 'FeedsCustomParser.inc',
'path' => $path,
),
);

I have it all working except the parser. I used common_syndication_parser as a base but I think I left some things out. So I'm redoing it and hopefully I can get it working soon.

JayCally’s picture

OK. Have trouble getting the custom parser to work. I used common_syndication_parser as a starting point, removed anything that I didn't need. I tested it on a normal feed and it works. But when I go into to add the custom fields I get an error message "feeds/plugins/FeedsNodeProcessor.inc on line 22" and I can't figure it out. :( Just as an example my custom feed doesn't have title for an item but "ADID". When I change title to this common_syndication_parser I get the error message.

Can anyone help me out with this?

JayCally’s picture

Was able to get the custom parser to work. I can now import my custom feeds. Wish this was a standard feature of feeds. News companies and other sites that need to import data from their in-house databases get their content exported as a custom XML file. I work for a small newspaper publisher and all of our vendors have changed their product to export content as XML. This would be a very helpful feature to have standard with Feeds.

netentropy’s picture

ehaustive parser for FeedAPI works nicely for FeedAPI, i will talk to the maintainer about porting it

velosol’s picture

FileSize
5.88 KB

JayCally, I'm sorry I didn't get back to you sooner. I'm glad you were able to get the custom parser working. Since I posted the first code blocks, I've had a chance to refine what I used for my custom parser; unfortunately my work has moved away from Feeds for the time being and I haven't been trawling the issue queue recently.

I'm posting the files I used (which is a custom module that uses Feeds hooks so that I don't have to modify Feeds' core files) in such a way that if you, or another developer needs a custom parser, you'll have a fairly good starting point/skeleton.

I would like to note that this is not in line with #4/#5 and should really only be a stopgap until rparser and mapping on import come to fruition. Also, it doesn't support namespaces as written, but in customizing for your files it should be easy to add. It could also use some additional OOP updates if you're planning on consuming really complex files.

The module says that it requires Feeds - for full functionality as written it also needs location_cck - this support is easily taken out by commenting the block relating to it (as mentioned in the files); especially useful if you don't have location data.

I hope this helps you, or someone else until a better solution is found.

JayCally’s picture

Thanks velosol. I was able to get my custom parser working and added a patch to Feeds to allow mapping to image fields. It took me a while but was a good learning experience. I am now looking for away to map location cck fields, email field and website field. I can probably just use text fields for email/web and make them linkable in content template but location will be the big one to get done.

velosol’s picture

JayCally - I have lat/lon location CCK support in the module I posted in #21. It's definitely not robust, but could serve as a starting point for you.

JayCally’s picture

Thanks velosol! I pulled the section of code I need from your module and used that as a base. I had to make some changes and additions but I have it working now.

rjbrown99’s picture

velosol - for what it's worth, your module was fantastically helpful in allowing me to create a custom XML parser. Great stuff and well commented. Thanks!

gregag’s picture

I get:
* user notice: customxml_parser_parse received bad XML, beginning: in C:\SERVER\htdocs\sites\testnadrupal.com\modules\cleandata\customxml_parser.inc on line 40.
* warning: Invalid argument supplied for foreach() in C:\SERVER\htdocs\sites\testnadrupal.com\modules\feeds\plugins\FeedsNodeProcessor.inc on line 22.

from code I inserted:
// Got a malformed XML.
if ($xml === FALSE || is_null($xml)) {
trigger_error("customxml_parser_parse received bad XML, beginning:" . substr($string, 0, 20));
return FALSE;
}
$parsed_source = array();

//@todo: Add formatted time string to title.
$parsed_source['title'] = '//html/body/form/table[2]/tbody/tr/td[1]/table'; //(string)current($xml->xpath('//Some/appropriate/path')); ////
$parsed_source['description'] = 'Webpart'; ////
$parsed_source['items'] = array();

// @todo: (alex_b) Make xpath case insensitive.
$customItems = $xml->xpath('//http://www.kjekupim.si/si/vroca_ponudba.wlgt'); ////
foreach ($customItems as $cItem) { ////
$item = array();
//Set the guid of the
$item['guid'] = (string)$cItem->Level1->Level2; ////

/**
* Skip those items without a guid - if you want this functionality
if( $item['guid'] == NULL) {
//We don't want it if it doesn't have an id.
continue;
}
*/

What did I do wrong?
help me please...

gregag’s picture

FileSize
231.21 KB

I would like to feed from table. Attached picture for easier understanding what I want...

gregag’s picture

FileSize
238.42 KB

Sorry for previous post, here is example

gregag’s picture

some help how to write appropriate path in line $todo would be seriously big THANKS from me

geerlingguy’s picture

Subscribe... would love an in-the-gui option, as I have a few news sites where we change the mappings from time to time, and maintaining a plugin/module is always a burden.

Should this be marked as a duplicate? Or vice-versa?
#651478: Mapping on import

Here's the feed I'm trying to import (a few custom xml siblings have been removed, for brevity.

<entry>
  <title type="text"><![CDATA[Title goes here.]]></title>
  <id>9713</id>
  <updated>2010-02-04T10:20:39</updated>
  <created>2010-02-04T06:00:00</created>
  <summary type="html"><![CDATA[By Author Name]]></summary>

  <content type="html"><![CDATA[
    <p>Content goes here.</p>
]]></content>

  <author><![CDATA[]]></author>
  <published>2010-02-04T10:20:39</published>
  <category term="sunday scripture" start="2010-02-04T11:13:13" end="2010-05-15"><![CDATA[sunday scripture]]></category>        
  <keywords><![CDATA[]]></keywords>
</entry>

I can get everything to move to the site except for the "summary" - I want to move that into a CCK 'Byline' field...

paganwinter’s picture

Subscribing...

jay-dee-ess’s picture

subscribing

jay-dee-ess’s picture

I'm trying to use velosol's cleandata module and am getting the following error when I try to use the Custom XML Parser:

Fatal error: Declaration of FeedsCustomXMLParser::parse() must be compatible with that of FeedsParser::parse()

I haven't made any changes to the code and am a noob...any help would be much appreciated.

sutch’s picture

subscribing

drewish’s picture

subscribing

TimG1’s picture

subscribing.

And if there is any developers for hire out there who can write a custom parser for my feed please get in touch via my contact form.

Thanks!
-Tim

reglogge’s picture

subscribing

pvhee’s picture

It would be great to have out-of-the box support for XML import in Feeds. What are the blockers to get this done?

pvhee’s picture

Building on the efforts on XML Parser, I created a Feeds plugin that can deal with (simple) XML, much like is being dealt with CSV at the moment.

You can find the module at http://github.com/pvhee/feeds_xmlparser. Note that this is only tested in my sandbox.

I'd be glad to receive any feedback on this, and hope we can tackle the problem of XML import using Feeds.

funkmasterjones’s picture

Install the feeds_xmlparser module
The patch is mostly to give a parsing example in the mapping section, I can't recall if feeds_xmlparser needs it to run, but I would recommend it anyways

Features:
- parse any type of xml file finding items specified by XPath (see parser_settings.png)
- preview the parsed data
- gets mapping source targets dynamically (see mapping.png)
- groups alike single values and array structures together for easier data access (Advanced)

Cons:
- must specify an xml template file (normally just the file you want to import) in the importer settings
(this is the only way you can generate source targets)

Note: in the screenshots I use a relative url, this is a different patch of mine, Feeds only supports absolute urls

pvhee’s picture

Would there be a possibility to merge both projects (your Feeds XML Parser and the one I posted in the previous comment)? Note that I created a temporary Drupal project for this at http://drupal.org/project/feeds_xmlparser. However, I'd be more than happy to have it included into the Feeds module once the code is more or less finalized.

CheezItMan’s picture

Interesting, I'm trying to write a parser for an Apple Podcast Producer ATOM Feed. Seems like this could do the job perfectly.

However with Alpha12, when I try to Feed Imports-->Edit the feed of my choice and then select the parser, I get a whitescreen of death. Pretty much the same error I get on a parser selection.

serbanghita’s picture

Priority: Normal » Critical

OMG i'm subscribing to this and testing #40

minus’s picture

tested #40 with version 6.x-1.0-alpha12 and got a whitescreen of death as well - i'm trying to write a custom parser but have a hard time doing it :-/

pvhee’s picture

@serbanghita @minus: did you try #41?

the module can be downloaded directly from github as eg zip file: http://github.com/pvhee/feeds_xmlparser/zipball/master. The XML parser is fully functional and does not require any feeds patches.

minus’s picture

thank you pvhee! didn't know it was updated, i had a version from feb.13 :-) will try this version right now :-)

minus’s picture

hmm, i'm trying to import the data from a xml test file which contains two fields, a description and an url. When i add these two fields using the XML Parser it gives me a error message saying

    * warning: copy(u) [function.copy]: failed to open stream: No such file or directory in /Sites/acquia-drupal-site/acquia-drupal/sites/all/modules/feeds/plugins/FeedsParser.inc on line 148.
    * Cannot write content to /tmp/FeedsEnclosure-u

I'm really new to this and i guess this has something to do with either the person behind the computer or the xml file he uses :-/ If you have the time i could send you the url to the feed i'm using.

webwriter’s picture

Subscribing

serbanghita’s picture

@pvhee actually #41 inspired me to create a custom module that extends Feeds and parse and maps Twitter search feed (eg. http://search.twitter.com/search.atom?q=starcraft2). Twitter search feed is not like a standard feed so it needs custom mapping.

I haven't tested the class inside #41. Will do in the next days.

Meanwhile i finally understand the approach of #40. Please correct me i'm wrong:

We need a parser that "knows" the <item> fields we are about to parse, so we can select them from the dropdown and map them to our already created CCK fields. The approach of #40, with a sample file, seems to be the only logical solution. Will test #40 and post comments.

alex_b’s picture

FYI - for some of you here maybe interesting #706984: Allow extension of FeedsSimplePieParser parsing

gaele’s picture

subscribing

reglogge’s picture

I am using the module and patch from #40 and get a very curious behavior: Importing works just fine but there is only ever the last item from an xml-file imported. So on each import I get just one new node, even if there are many more in the feed

Any ideas?

Monkey Master’s picture

My compact version of FeedsXMLParser (#39) with no need of external module (XML Parser) and with xpath setting (#40):

class FeedsXMLParser extends FeedsParser {

  public function parse(FeedsImportBatch $batch, FeedsSource $source) {
    $file = realpath($batch->getFilePath());
    if (!is_file($file)) {
      throw new Exception(t('File %name not found.', array('%name' => $batch->getFilePath())));
    }
    $xml = simplexml_load_file($file);
    if (!$xml) {
      throw new Exception(t('Could not parse XML file: %file', array('%file' => $batch->getFilePath())));
    }
    if (!empty($this->config['xpath'])) {
        $xml = $xml->xpath($this->config['xpath']);
    }
    $batch->setItems(unserialize_xml($xml));
  }

  public function configDefaults() {
    return array(
      'xpath' => '',
    );
  }

  public function configForm(&$form_state) {
    $form = array();
    $form['xpath'] = array(
      '#type' => 'textfield',
      '#title' => t('XPath'),
      '#default_value' => $this->config['xpath'],
      '#description' => t('XPath to locate items.'),
    );
    return $form;
  }
}

function unserialize_xml($data) {
  if ($data instanceof SimpleXMLElement)
    $data = (array)$data;
  if (is_array($data))
    foreach ($data as &$item)
      $item = unserialize_xml($item);
  return $data;
}
minus’s picture

how can this be used :$
Best regards
minus aka Noob

Monkey Master’s picture

This is alternative FeedsXMLParser.inc for feeds_xmlparser module (post #39) which uses SimleXML php extension instead of external xml_parser module

alex_b’s picture

#53: You should be able to use xpath on the mapping UI easily.

Implement a getSourceElement($item, $element_key) method:

On execution, $item will be one of the simple XML items yielded by $this->config['xpath']. $element_key will be a xpath entered by the site builder on the mapping UI.

AntiNSA’s picture

subscribe

Monkey Master’s picture

FileSize
1.46 KB

I implemented getSourceElement():

  public function getSourceElement($item, $element_key) {
    $xml = $item->xpath($element_key);
    if (is_array($xml)) {
      foreach ((array)$xml[0] as $value) {
        if (is_string($value)) return $value;
      }
    }
    return '';
  } 

It returns the first string found at specified xpath since arrays are not accepted (lots of warnings, I tried)

Working module is attached.

Monkey Master’s picture

#58 fails if number of items greater than 50.
SimpleXMLObject has big problems with unserialize() and FeedsSource->load() cannot restore batch object after first iteration.

Monkey Master’s picture

FileSize
1.52 KB

I don't see a simple way to fix SimpleXMLObject serialization...
So here's version without getSourceElement() with good old arrays.

Successfully tested on importing 9500 ubercart products with cck and filefield.

AntiNSA’s picture

Is that the best final solution?

AntiNSA’s picture

When I install and select this as the processor, only the bottom right has a drop down feild to select where to map things to. The left side field is a text entry fields,,, not a drop down list with options to select mappiingg sources from......

Is this how it is intended to work?

Monkey Master’s picture

Yes, like csv-parser - just open xml file and copy field names to mapper settings.
When we configure mappings no xml file is loaded yet - nowhere to get field names from.

alex_b’s picture

Priority: Critical » Normal
Status: Active » Needs review

Very interesting - the limitation is that it can only detect elements one hierarchy level deep in what's returned by config['xpath'] - correct?

Setting this to non-critical as we're keeping track of changes to upcoming release alpha 13 with 'critical'.

Monkey Master’s picture

yes, one level deep. Fits both of my projects I have at this moment - import uc_products from external program.

Version with getSourceElement() doesn't have this limitation.
Does anyone have a solution for SimpleXMLObject serialization problem?

srobert72’s picture

Subscribing

blasthaus’s picture

subscribing

Monkey Master’s picture

New version of #58 with a workaround for SimpleXMLElement serialization problem: Items added to batch as XML strings and then parsed again individually in getSourceElement().

It supports xpath on the mapping UI.

xcusso’s picture

Fresh install of #68 fail with this error Drupal: "error HTTP 500 /batch?id=53&op=do" in apache server log: "PHP Fatal error: Call to a member function asXML() on a non-object in /sites/my_site/modules/feeds_xmlparser/FeedsXMLParser.inc on line 24". Any ideas?
Thanks.

Monkey Master’s picture

@xcusso: did you set correct xpath in XML parser settings?
An example of XML that causes this error would be helpful

hampshire’s picture

What else needs to be installed along with the files in #68 as I have tried every combination on this page it seems and I still have been unable to get this parser to show up on the select a parser page.

Thank you and sorry for the dumb question.

milesw’s picture

subscribing

reglogge’s picture

subscribing

Monkey Master’s picture

@hampshire: you just need to enable module - Feeds XML Parser

Monkey Master’s picture

FileSize
1.63 KB

New version
added support of multiple value fields and more checks (e.g. for #69)
xpath in XML parser settings is obligatory, defaults to "item"

gdud’s picture

subscribing

AntiNSA’s picture

I am trying to use this with this feed :
http://feeds.pheedo.com/OcwWeb/rss/new/mit-newarchivedcourses

I have mapped:

title Title Remove
dc:date Published date Remove
dc:subject Taxonomy: User Keyword Remove
URL URL Remove
GUID GUID Remove
Body Body Remove

HTTP Fetcher
Download content from a URL.

XML parser
Parse data in XML format.

Node processor
Create nodes from parsed content.

XPath:
item
XPath to locate items.---------------------- whats this?

I keep getting the message that no update available...... everytime I try to import..... If you could tell me what I am doing wrong I would really appreciate it!

Thanks for all your hard work-
Robert

AntiNSA’s picture

I should say that everything works with the simple pie parser... I really would like to use your custom xml parser though! thanks again

Monkey Master’s picture

I am trying to use this with this feed :
http://feeds.pheedo.com/OcwWeb/rss/new/mit-newarchivedcourses

This XML uses namespaces - not supported at the moment.

XPath to locate items.---------------------- whats this?

The XPath query that is executed on loaded XML file to get an array of XML nodes.
Then XPath queries from mapping UI are executed on that XML nodes to get source values.
(i.e. "Name of source field" in mapping UI is actually XPath for this parser)

So it's better to learn XPath before working with this module...

dvbii’s picture

I have been trying the XPath expressions to map the fields, but can't get them to work.
Using http://eventful.com/atom/richmond/events as my feed, and I set the XML parser Xpath to: entry
Then trying to map start time from the feed as "//gd:when/@startTime". I get no entries..

Any advice?

dvbii

Monkey Master’s picture

@dvbii: This XML also uses namespaces (as #77), not supported. Any advises are welcome.
BTW Feeds module already has RSS/Atom parser.

Here's what I found:

SimpleXMLElement->xpath() function doesn’t support default namespaces and will not be able to perform an XPath search on those types of documents.
dvbii’s picture

thanks for the reply. If I use the atom parser, how do i map other fields?

clevername’s picture

What else needs to be installed along with the files in #68 as I have tried every combination on this page it seems and I still have been unable to get this parser to show up on the select a parser page.

Thank you and sorry for the dumb question.

Not showing up for me either despite module being enabled. Subscribing in hopes for an answer.

gordns’s picture

I had the same problem with the parser not showing. Worked after using the "flush all caches" menu in the admin_menu module.

hampshire’s picture

I deleted the database and started from scratch but mine was a test site and not important, I never tried flush all caches, did that work for you?

Dmxr100’s picture

Hi, this is exactly what I'm looking for, and is working really well apart from one bit that I can't work out.

I have an XML feed from a News service which is importing fine, apart from categories -> Taxonomy.

Here is how a news article appears in the XML:

<Article Created="17:20:54" ID="19393452">
    
    <Heading>Advertising "essential tool"</Heading>
    <Date>05/10/2009</Date>
    <Contents><![CDATA[One charity has discussed how advertising is an &quot;essential tool&quot; that enables it to get its message out...]]</Contents>
    
    <Categories>
          <Category ID="438033218">Customer loyalty</Category>
          <Category ID="438033219">Customer retention</Category>
    </Categories>
	
</Article>

I'm not sure how to map the "Category" values to taxonomy, because they are wrapped within "Categories". Is this possible? If so I would greatly appreciate some instruction.

If I remove the "Categories" wrap in a test XML upload, the mapping works fine, however as it is a feed provided by a third party, they will not change the structure.

Thanks in advance

tinem’s picture

subscribing

clemens.tolboom’s picture

Status: Needs review » Needs work

Could you please add a patch file here :)

Monkey Master’s picture

I'm not sure how to map the "Category" values to taxonomy, because they are wrapped within "Categories". Is this possible?

Yes, just use xpath in mapping UI - try "category"

Could you please add a patch file here :)

It's a standalone module, nothing is changed in feeds.

Dmxr100’s picture

Status: Needs work » Needs review

Hi Monkey Master

Thanks for your help with this. Turns out I wasn't using the correct XPath syntax. So having "Article" defined in the XPath UI imports the node correctly, and using "Categories//Category" in the mapping to taxonomy pulls in the nested "Category" element.

Success :-)

robbertnl’s picture

Subscribing.
I am also using #75 now, now finding out how to map nested xml's

Exploratus’s picture

This module works great. I was able to import using the XML Parser. My only question now is how do we import the images referenced in the XML feed so we do not have to download manually...

Cheers

Edit: Never mind, got it working...

alex_b’s picture

Monkey Master: Very nice work. You are clearly very actively developing feeds_xmlparser. I don't want to stand in the way as gatekeeper of Feeds module - why don't you break out feeds_xmlparser into its own project on d. o. or on github? I see that pvhee's already got a version on github:

http://github.com/pvhee/feeds_xmlparser

chrisirhc’s picture

Hi there, just to add on, I've incorporated the code from Monkey Master into my fork of the github (at http://github.com/chrisirhc/feeds_xmlparser) as well as made an amendment myself.

I have a question though on Monkey Master's code and my amendment though, how come this was used (1):

    if (is_array($xml)) {
      foreach ((array)$xml[0] as $value) {
        if (is_scalar($value)) return $value;
      }
    }

Instead of just (2):

    if (is_array($xml)) return (string) $xml[0];

If there are issues with what I am doing, please let me know, because I used (2) in my code. I made the change because the current code (1) does not support queries for attributes (results will be blank) e.g. /post/@title
Also, I noticed in my testing of XPath to my knowledge, (1) seems to behave unexpectedly in certain cases. If needed, I'll be happy to post some cases here if anyone needs an example. Just ask! :)

Monkey Master’s picture

chrisirhc: you've incorporated old version, latest is in post #75

mokko’s picture

I am just checking back to see what's the status of this very interesting importer. I am interested in namespace support,too. To me http://www.php.net/manual/en/intro.simplexml.php looks like it supports namespace. Where am I wrong?

chrisirhc’s picture

Sorry about that. Incorporated your latest changes. :)

@mokko Regarding namespaces, what kind of support are we looking at? Providing another field so that you can specify a namespace URI? or? I'm interested in getting it to work too.

I'm currently using Feeds with YQL (Yahoo Query Language) to do some imports. Also looking into doing up a fetcher that respects paging and 'staggering' requests (not sure if that's what it's called). I'm not sure if someone else is working on that now.

mokko’s picture

In the meantime I found out that you are using drupal.org/project/xml_parser and I am reading through that code. I haven't found out yet how it all works. I am new to feeds etc., so I wasn't able yet to test if it works. I just have xml files that I would like to digest (import to nodes) and they have namespaces. I don't need this urgently, so I guess I will play with it a little more.

chrisirhc’s picture

The latest code I used from Monkey Master does not use xml_parser . It uses the native SimpleXML support in PHP.
I believe in order to retrieve the namespaced elements, you need to specify the namespace before retrieving it. This can be done by listing the namespaces in the Feeds XML Parser settings page. I might consider adding this. It should be just a few lines of code I think.

I dug up an example of SimpleXML with namespaced elements. You need PHP 5.2 and above to get it to work.
http://www.ibm.com/developerworks/library/x-simplexml.html

http://sg2.php.net/manual/en/simplexmlelement.registerXPathNamespace.php

It looks rather expensive but the following could be a simple fix to allow namespaced xpaths:

//Fetch all namespaces
$namespaces = $sxe->getNamespaces(true);

//Register them with their prefixes
foreach ($namespaces as $prefix => $ns) {
    $sxe->registerXPathNamespace($prefix, $ns);
} 

I'm committing and pushing to the git.

I can already think of a possible problem that might occur. While you might be able to get the namespaced elements out, after that, during mapping, there might still be other challenges.. If the namespace does not persist till the getSourceElement method...
(Night, I'm off to sleep.)

Exploratus’s picture

Hi everyone.

Quick question. Is there a way to get the value inside the category, not just within the brackets.

this is what i mean:

i can get the value

whatever

but how do I get the value from within the brackets, like this:

How do I pull the venue ID "106543".

Thank you so much.

chrisirhc’s picture

Please put everything that is code within the <code></code> tags when giving code examples here.
Anyways, I think you're talking about attributes (<el someattribute="blah"/>) right?

You have to use the @ symbol to refer to attributes.

Take a look at the following links for some information on XPath.
http://www.w3schools.com/xpath/xpath_syntax.asp
http://www.quackit.com/xml/tutorial/xpath_attributes.cfm

Queries on XPath should be somewhere else.. This is a thread on the issues regarding the parser.

Exploratus’s picture

Sorry about that, that is exactly what I wanted. I didn't realize that the Drupal Forum took the codes out...

mokko’s picture

I have problems installing the feeds_xmlparser. A few more details here: http://drupal.org/node/800256. Any help appreciated.

ChaosD’s picture

subscribed

this functionality would add a lot of value to the feeds module. how usable is it so far? do you need help with testing?

mkalisz’s picture

subscribing

chrisirhc’s picture

Help needed to test whether it works with namespaced XML documents (or any other documents).
You can download it from:
http://github.com/chrisirhc/feeds_xmlparser

Exploratus’s picture

I am using namespaces and it works fine...

Sagar Ramgade’s picture

Issue tags: +xml parser
FileSize
23.46 KB

Hi,
I am using feed_xmlparser with xml_parser, I am not able to import data inside the subtags of xml file, however if i save the file and Remove those parent tags like media, media item, it is able to fetch data.
I want to fetch data inside the caption, mediaUrl, datePrefix etc.
The xml file which i am trying to import is attached. rename it .xml

ChaosD’s picture

read #90 - that helped for me

Sagar Ramgade’s picture

Here's my first item :

<articleListing>
−
<article id="6174899">
<articleType id="1">News Story</articleType>
<incomingBasket id="2">soccer</incomingBasket>
<title>Benfica won't hook Huntelaar - Milan</title>
−
<abstract>
Sporting director Ariedo Braida insists Klaas Jan Huntelaar will not be leaving AC Milan this summer despite reported interest from Benfica.
</abstract>
−
<body>
The 26-year-old arrived in Milan from Real Madrid last summer and is under contract for a further three years with the Rossoneri.
Braida said: "Huntelaar to Benfica? I don't think that will be possible.
"He is a player that is not for sale and that represents the future of Milan.
"Huntelaar has a contract for a further three seasons with us."
The former Ajax forward scored seven goals in 21 appearances for Milan during the 2009/10 campaign.
</body>
−
<media>
−
<mediaItem>
<caption>Huntelaar: Unlikely to join Benfica</caption>
<datePrefix>10/01</datePrefix>
<mediaURL>Klaas-Jan-Huntelaar-AC-Milan-celebrate_2404785.jpg</mediaURL>
−
<url>
http://clientimages.teamtalk.com/10/01/800x600/Klaas-Jan-Huntelaar-AC-Milan-celebrate_2404785.jpg
</url>
</mediaItem>
</media>
<last-updated>Thu, 27 May 2010 13:26:00</last-updated>
</article>

I tried //media//mediaItem//caption and //caption
Both of them didn't work.
Could you help please ?

ChaosD’s picture

don´t forget to set your xpath in /settings/FeedsXMLParser to your "node identifier" (thats how i called it). in you case it would be "article". for the mapping you should use "media//mediaItem//caption"

Sagar Ramgade’s picture

Thank you for you reply but i couldn't see that xpath setting anywhere, i am using httpfetcher and using a custom content type for importing.
Where do i find this setting, I might be acting dumb sorry for that however i couldn't find it anywhere.

Sagar Ramgade’s picture

Hi,

Actually i was being dumb i didn't read the thread carefully, i was using this http://github.com/pvhee/feeds_xmlparser
instead of http://github.com/chrisirhc/feeds_xmlparser.
I could see that xpath setting now, however need to test it with the settings.
Will update here if successful.

Sagar Ramgade’s picture

Hi,

It worked like a charm, I am using Feeds Image grabber module to fetch images too...
Thank you all who worked on this.
Cheers

tfranz’s picture

Im getting the example.xml of chrisirhc/feeds_xmlparser to work, but doesn't succeed with my own XML-file:

<?xml version="1.0" encoding="ISO-8859-1"?>
<master>
    <customer>
        <product>
            <object>
                <objecttyp>
                    <value valuetyp="special"/>
                </objecttyp>
            </object>
            <tech>
                <name>franz</name>
                <id>111</id>
            </tech>
        </product>
		
        <product></product>			
    </customer>
</master>

I tried master//customer//product and only product – as XPATH, but none of them worked ("There is no new content") ...

I need the following mapping:

tech//id => GUID
tech//name => Title
object//objecttyp//value@valuetyp => CCK-Textfield

Thank you for any help or hint,

Tobias

Sagar Ramgade’s picture

I think you need to set Xpath as //customer and then
product//object//objecttyp//value@valuetyp => CCK field
product//tech//id => GUID
product//tech//name => Title

Hope this helps.

Thoor’s picture

First of all - THX to all who worked on this Issue! Great Job!

The example.xml is working fine for me and my example from january in http://drupal.org/node/631104#comment-2437960 is almost working now, when I use the http://github.com/chrisirhc/feeds_xmlparser in addition to the FEEDS Module.

But I still have a problem I can´t solve. As you can see, there is a socalled XML SCHEMA in the Feed I want to use.

<products xmlns="http://zanox.com/productdata/exportservice/v1" xmlns:xsi=http://www.w3.org /2001/XMLSchema-instance 
xsi:schemaLocation="http://zanox.com/productdata/exportservice/v1 http://productdata.zanox.com/exportservice/schema/export-1.0.xsd">

With this line in the feed i receive the message ("There is no new content") ... while importing.

When I change the line manually to <products> ... everything works fine and I can parse and map the XM Feed!

Does anyone have a solution, how I can successfully handle the additional XML Schema in my Feed with FEEDS and the FEEDS_XMLPARSER?

srobert72’s picture

@thoor
Not a solution to your problem, but just a workaround.
With Zanox you could also use CSV export instead of XML.

chrisirhc’s picture

@Thoor

Is that line proper XML? I notice there's a space in the URL and there are no quotes.

Should it be:

<products xmlns="http://zanox.com/productdata/exportservice/v1" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
xsi:schemaLocation="http://zanox.com/productdata/exportservice/v1 http://productdata.zanox.com/exportservice/schema/export-1.0.xsd">

Will have to look into this issue if it persists even if it's proper validated XML.

Thoor’s picture

@chrisirhc
The line is how it is in the feed. An example feed can be seen under: http://wiki.zanox.com/en/Product_Data_Download#Example_XML_file

I don´t know, if this is proper XML?

@srobert72
THX for your reply! I knew already, that CSV import works well with FEEDS. Also the "Classic XML Feed" from Zanox is working good! Because there it is possible to turn the XML Schema "off"! Without it, FEEDS with FEEDS_XMLPARSER is working without Problems!

TimG1’s picture

Hi Chris,

Huge thanks for your work on this!

I've been messing around testing this for a bit today. I have a feed full of namespace prefixes. Importing works fine for elements with no prefix, but for the elements with a prefix I get a page full of errors like this...

warning: SimpleXMLElement::__construct() [simplexmlelement.--construct]: namespace error : Namespace prefix dataField on caseId is not defined in /sites/all/modules/feeds_xmlparser/FeedsXMLParser.inc on line 51.

warning: SimpleXMLElement::__construct() [simplexmlelement.--construct]: ^ in /sites/all/modules/feeds_xmlparser/FeedsXMLParser.inc on line 51.

I have a bunch of elements such as

  <dataField:caseId></dataField:caseId> 
  <dataField:lastUpdateDate></dataField:lastUpdateDate> 
  <dataField:categoryName></dataField:categoryName> 
  <dataField:isFeatured></dataField:isFeatured> 

I can send you the feed I'm working with via your contact form if you would like to mess around with it.

Thanks again!
-Tim

tfranz’s picture

Hi Sagar Ramgade, thanks for your reply!

I tried:
XPath: //customer
product//tech//id => GUID
product//tech//name => Title

At least it created one empty node, and i got the following error (3x):

* warning: mysql_real_escape_string() expects parameter 1 to be string, array given in /mnt/web6/21/84/51321584/htdocs/drupal/includes/database.mysql.inc on line 321.

Line 321 is return mysql_real_escape_string($text, $active_db);

/**
 * Prepare user input for use in a database query, preventing SQL injection attacks.
 */
function db_escape_string($text) {
  global $active_db;
  return mysql_real_escape_string($text, $active_db);
}

Any idea?!

mokko’s picture

Using chrisirhc's feeds_xmlparser I ran into the "mysql_real_escape_string()"-warning too. I looked into it a little bit.

In my case it comes up when my source xml has two title-elements while drupal has only one title. I can create a cck field title which allows multiple values, use xmlparser to map to the CCK field (instead of drupal's title) and then this warning disappears. Not sure this work around is what you want. At least then you should be able to import all your nodes.

mokko’s picture

@Thoor: I also looked into using namespace with chrisirhc's feeds_xmlparser. My understanding is that currently default namespaces are a bit problematic with the module. I ended up naming my namespace. Something like:

<products 
   xmlns:default="http://zanox.com/productdata/exportservice/v1" 
   xmlns="http://zanox.com/productdata/exportservice/v1" 
   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
   xsi:schemaLocation="http://zanox.com/productdata/exportservice/v1 http://productdata.zanox.com/exportservice/schema/export-1.0.xsd">

In the xpath setting I use
//default:products

I compare this functionality to my xml editor (oxygen). If I remember correctly it has an option where it replaces no namespace with default namespace. If feeds_xmlparser would do this, you could use your original xml with your original xpath setting
//products

Hope it helps
mokko

tfranz’s picture

I tried the following:

Xpath: //customer//product
//tech//name => Title
//tech//id => GUID

... and it works! Thanks for your help!

alex_b’s picture

Repeating what I said in #93: Does anyone want to break out this module into its own Drupal project? Seems like there is a large enough user base to justify it - just the issue queue that comes with it would be a big help for maintaining it :-)

While I welcome feeds_xmlparser to the Feeds ecosystem, I am not planning on committing it (or a similar extension) to Feeds at the moment - it is an extension that requires high involvement with its user base and I am personally not dealing with its use cases.

Steven Jones’s picture

Project: Feeds » Feeds XML Parser
Version: 6.x-1.x-dev »
mokko’s picture

I am probably repeating what everyone knows here already :
- it seems that pvhee created http://drupal.org/project/feeds_xmlparser and
- we were last talking about chrisirhc's fork of that code.

Both have a hosted their module on github (more than d.o), see e.g. http://github.com/chrisirhc/feeds_xmlparser.

Both have been quiet for at least a few days.

I am not very familiar with github. It seems for hosting the code it's fine. I miss drupal's issue queue and would prefer to use the http://drupal.org/project/feeds_xmlparser issue queue for chrisirhc's fork, but currently I think this would risk confusion. This would be somehow along the lines suggested by alex_b.

To me it seems the easiest would be if chrisirhc and or pvhee would say something. Any other suggestions?

podox’s picture

Very excited by this, especially the namespaces development. I'm having trouble importing the following feed:

http://rss.oucs.ox.ac.uk/oxitems/generatersstwo2.php?channel_name=classi...

The format is as follows

<?xml version="1.0" encoding="utf-8"?>
<rss version="2.0">
  <channel>
    <title>The Beazley Archive - Classical Art Research Centre</title>

    <item>
       <title>Introduction to Art of the Ancient World</title>
       <itunes:author xmlns:itunes="http://www.itunes.com/dtds/podcast-1.0.dtd">John Boardman, Donna Kurtz</itunes:author>
       <link>http://media.podcasts.ox.ac.uk/oucs/classics/boardman-kurtz-oxonian.mp3?CAMEFROM=podcastsRSS</link>
    </item>

    <item>
      etc etc.
    </item>
  </channel>
</rss>

I've tried a combination of channel//item, item, //item //channel and many more as the Xpath setting, with //title as the sole mapping setting, but I get the error "Could not retrieve title from feed" on import.

mokko’s picture

Warning:I don't know much about RSS , so my wording might be strange.

Of course, you have two different titles: the channel title and the item title. I assume you want to map the item information. Correct xpath should be one of two

/rss/channel/item

or

//item

Then you should be able to map the title by entering "title" in the mapping. This should work definitely. With itunes namespace I am less sure. Try separately.

In the namespace tests I made successfully only my root element has a namespace.

You should be able to enter itunes:author in the mapping, but I am not at all sure this does work.

Of course you have to be using http://github.com/chrisirhc/feeds_xmlparser

What do the others think?

chrisirhc’s picture

Hi there everyone, apologies for the disappearance, I have been watching this thread but did not realise that it was awaiting me or pvhee's input.

pvhee has contacted me about co-maintaining the module with him. I've responded that I'm interested but he has not since responded. I believe he is busy at the moment. He has plans to bring the module from git back into the CVS system to be downloadable from the module here.

podox’s picture

Thanks mokko. I do want to map the item information to the feed item node, and /rss/channel/item works (for the item title, description, media file URl etc.) Not sure how to map the itunes info (itunes:author doesn't work but I'll keep trying).

The error message comes about from not entering a title in the *feed node* - you have to enter it manually with the XML parser, whereas with the common syndication parser, the feed node title is correctly generated automatically from the source feed.

Thanks for your help - looking forward to seeing this develop.

EDIT: See http://drupal.org/node/838172#comment-3134636 for working mapping settings, including a way to map itunes:author

arski’s picture

sub

arski’s picture

Hey,

I just installed feeds_xmlparser on my site but when I create a new feed there seems to be no new "XML Parser" option or anything like that in the Parser section. Any ideas what's going on?

Thanks,
Martin

PS. I agree with the post above - please please move the code to d.o. CVS asap - that will make it so much easier for everyone to test/submit issues/comments.. you don't have to make a release until you're ready, but at least the code/issues will be all in one place :)

mokko’s picture

FileSize
120.11 KB

I think I had that problem at first, too, but I don't remember exactly what was the problem. Did you check the rights of feeds_xmlparser directory? I attach a screenshot of how it looks for me. Hope it helps.

arski’s picture

Umm yea that's what I expected to see too.. the rights seem to be the same as for any other directory, don't think that should be a problem.. you really don't remember what you did to fix this? :)

arski’s picture

Gah.. got it.. I clicked on the "Download Source" button at the top of that github page and apparently I downloaded an old version of the module with basically empty code files.. great stuff :/

After I got the latest individual files manually - you still need to flush your cache for the XML Parser option to appear.

Just writing this down so that anyone else having the same trouble can read up :)

Other than that, let's try this out! :)

TamboWeb’s picture

@arski

Have you tried clearing your cache? Go to administer>site configuration > performance at the bottom, clear all cache.

I am currently doing testing with namespaces I will be reporting my findings sometime tonight.

Sandro.

arski’s picture

Hey, well yea, as my 2nd post (#137) says - there were 2 issues, including a cache one.

Having tested the thing out - hats off, works like a charm! :)

TamboWeb’s picture

I just imported 65 nodes from an xml file. I had to do a small change to the code to be able to import xml node elements within a namespace without prefixes. Other than that, it works great.

TamboWeb’s picture

Issue parsing xml file with namespaces.

Setup: Using chrisirhc code from http://github.com/chrisirhc/feeds_xmlparser,

This is just an observation as I am not much of a coder. I am not implying that this is a bug nor it is a fix. Somewhere in the posts above there is mention that namespaces are not supported. On the other hand some posts indicate successful results. For my particular application this is a workaround until I find a robust solution compatible with the module. I am not intending to hack the module.

It seems that the module is unable to import content if the XML file used as a source of content contains XML nodes elements with namespaces but without namespace prefixes. The module responds with no content found or with an error depending on the parser xpath setting.

Sample xml (truncated for simplicity sake):

<?xml version="1.0" encoding="utf-8"?>
<soap:Envelope xmlns:soap="http://schemas.xmlsoap.org/soap/envelope/" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" 
xmlns:xsd="http://www.w3.org/2001/XMLSchema">
<soap:Body>
<RetrieveListingDataResponse xmlns="http://www. example.com/ExampleNetServices">
<RetrieveListingDataResult>
<Listings xmlns="http://www.example.com/Schemas/Standard/StandardXML1_1.xsd">

<Residential><LN>34193</LN><PTYP>RESI</PTYP><LAG>15649</LAG></Residential>

</Listings>
</RetrieveListingDataResult>
</RetrieveListingDataResponse>
</soap:Body>
</soap:Envelope>

When trying to parse the above code with xpath, the module responds with no new content found.

If I remove the namespace from the Listing element, then the module parses the code correctly. This however is not ideal since the xml file will be updated daily with cron. I am not sure I want to manipulate the xml file with a search and replace, nor I want to do it manually prior to parsing.

After reviewing the code in FeedsXMLParser.inc Line 29 through 35,

      //Fetch all namespaces
      $namespaces = $xml->getNamespaces(true);

      //Register them with their prefixes
      foreach ($namespaces as $prefix => $ns) {
        $xml->registerXPathNamespace($prefix, $ns);
      }

I realized that there are no prefixes in the namespace, so there is no reference to the Listing element. Hence no content is found.

I proceeded to experiment a little, by hardcoding the namespace and the prefix as follows:

      //Fetch all namespaces
      $namespaces = $xml->getNamespaces(true);

      //Register them with their prefixes
      //foreach ($namespaces as $prefix => $ns) {
        //$xml->registerXPathNamespace($prefix, $ns);
      $xml->registerXPathNamespace('default', "http://www. example.com/Schemas/Standard/StandardXML1_1.xsd"); //hardcoded prefix + namespace, and commented out original code.
      //}

and changed the “Settings for XML parser” xpath settings to //default:Residential.

Now all nodes are imported without errors. It seems that the Fetch all namespaces needs some revision to accommodate for namespaces without prefixes since a blank prefix cannot be registered.

Sandro.

mokko’s picture

yes, Sandro! This is similar to what I meant in #124. I didn't see that this has consequences for the import (updates). I don't get why is it not necessary to register the usual namespaces (via $xml->getNamespaces(true)). Shouldn't we just add default for default namespace?

Anyways, I guess we now need a generic way to access namespace without prefix. Apparently ->getNamespaces does not work. Then ->getDocNamespaces will also not work, will it?

At http://de3.php.net/manual/en/simplexmlelement.getDocNamespaces.php it says:

If there is no prefix (a default namespace), the empty string will be used as a key in the array referencing that namespace value. 
The earliest ancestor will be used for overwriting any identical  prefixes (or lack thereof).

This doesn't sound good. It probably means we can access only one default namespace. That would be an improvement though. When I have some time, I can look into it tonight or so.

robbertnl’s picture

What about performance? i am using this, when importing a 8 mb XML file. It works but it eats a lot of memory ( about 1,5 gb) and several hours to complete.
Well i am doing some processing (creating users, content profiles, groups and importing small images), but i don't think it should take that long. I think it's an issue of XML Parser, but maybe also for this Extensible XML parser ?

TamboWeb’s picture

Hi mokko,

Thanks for the input and the link. I will take a look at that information as well. Currently I have to pull away for some time to take care of another project but I will get back to this as soon as I can.

TamboWeb’s picture

robbertnl

Regarding #143,

I am also concerned about performance. I have read somewhere that Xpath was slower than using the native php xml funcitons (i think). Eventually, my application will be doing some processing as well. So far this is a good start and proof of concept.

One feature that i need would be to be able to delete nodes referenced by another xml file. For example, import new content from xml file #1. then later, delete content specifed on XML file #2. So far the Feeds functionality allow you to delete content from the same feed only.

Another feature would be to be able to specify a feed import via cron. For instance run import of xml file #1 daily. and Run delete nodes as per xml file #2 every other day.

If anyone know who to do this I would be interested in hearing some pointers.

Thanks.

Sagar Ramgade’s picture

Hi,

The xml feed which i am trying to fetch contains date as shown below

I tried to map with the my cck date field (date or datetime tried both ), it gives me an error message :
warning: DateTime::__construct() [datetime.--construct]: Failed to parse time string (18/06/2010) at position 0 (1): Unexpected character in /../sites/all/modules/feeds/plugins/FeedsParser.inc on line 388.

In my cck date field i have set the custom input format to d/m/Y which should ideally match with the date format in the xml file, however it doesn't import anything stops with the error message mentioned above.

rjbrown99’s picture

It seems like this issue has started to become a catch-all for any problems related to XML parsing. Since the issue was first created, we now have a Feeds XML Parser module and a dedicated issue queue for it.

Would it be too much to ask that we take any new issues, problems, or questions and break them out into separate items?

michellezeedru’s picture

Plowing through this, knowing next to nothing about feeds and RSS, I was able to follow leads from this thread and set up a successful importer for an XML feed. Thought I'd subscribe to keep aprised of progress and also share my scenario, in case it's helpful to anyone else.

Setup: Using chrisirhc code from http://github.com/chrisirhc/feeds_xmlparser

The feed I'm trying to import looks like this:

<?xml version="1.0" encoding="ISO-8859-1"?>
<rss version="2.0" xmlns:ra="http://www.radioactivity.fm/inc/namespace">
   <channel>
      <title>Last twenty four hours of plays on KALX </title>
      <link>http://kalx.radioactivity.fm/feeds/last24hours.xml</link>
      <description>The last twenty hour of plays on KALX</description>
      <language>en-us</language>
      <pubDate>Wed, 23 Jun 2010 15:48:44 EST</pubDate>	
	
      <item>
         <title>Play # 2 </title>
         <link>http://kalx.radioactivity.fm/</link>
         <ra:time>06/23/2010 11:54 am</ra:time>
         <ra:track>Alien Heart</ra:track>
         <ra:artist>Elodie Lauten</ra:artist>
         <ra:album>Piano Works Revisited</ra:album>
         <ra:label>Unseen Worlds</ra:label>
         <ra:genre></ra:genre>
      </item>
</channel>
</rss>

In my feed importer>XML Processor settings, I set X-Path to //item
In mappings, I set up the following sources to map to CCK fields:

  • Title
  • //track
  • //artist
  • //album
  • //time

And it works -- for the most part! The feed apparently contains no type of unique identifier, so updating is not working. I guess my next step is to work with the producer of this feed to include GUID in the feed?

Thanks for the work on this everyone.

ChaosD’s picture

you could use the <link> as guid because its most likely unique and wont give duplicates (assuming that each item has a own link on the original site) ... title might also work but personally i would not rely on that.

meatbag’s picture

When i test chrisirhc's xml parser on the following feed, i always get the "no new content" message.

<rdf:RDF
 xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
 xmlns="http://purl.org/rss/1.0/"
 xmlns:content="http://purl.org/rss/1.0/modules/content/"
 xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/"
 xmlns:dc="http://purl.org/dc/elements/1.1/"
 xmlns:syn="http://purl.org/rss/1.0/modules/syndication/"
 xmlns:image="http://purl.org/rss/1.0/modules/image/"
 xmlns:admin="http://webns.net/mvcb/"
 xmlns:atom="http://www.w3.org/2005/Atom"
>
feed contents......
</rdf:RDF>

So i try to remove sth from the feed while keeping other settings untouched.

<rdf:RDF
>
feed contents......
</rdf:RDF>

And it works this time! But i just can't figure out why it failed to parse the first one.
Any helps would be appreciated.

meatbag’s picture

Ok i figure out that maybe namespace without prefix is the reason...

Sagar Ramgade’s picture

Hi All,

I want to fetch the Home team, away team score, MATCHDETAIL attendance.
I am able to fetch the match id , home team name, away team name However not able to fetch home team score, away team score and attendance.
Can anyone help.
I had set the xpath as : //XML//DETAILS

//HOMETEAMNAME[@SNAME] => Title

//HOMETEAMNAME[@SNAME] => Home team term

//AWAYTEAMNAME[@SNAME] => Away team term

//MATCHSUMMARY[@ID] => GUID

//MATCHSUMMARY[@ID] => matchID

//MATCHSUMMARY//HOMETEAMSCORE//SCORE => home team score

//MATCHSUMMARY//AWAYTEAMSCORE//SCORE => away team score

//MATCHDETAIL[@ATTENDANCE] => attendance

<XML>
−
<DETAILS>
−
<MATCHSUMMARY SPORTID="1" ID="301096" PAID="3222305" CPID="1627" SBID="12345058" TVCHANNEL="" FXDATE="21/06/2010" CURRENTDATE="" ENDDATE="" TIME="12:30" LONGCOMPETITIONNAME="FIFA World Cup" SHORTCOMPETITIONNAME="World Cup" LONGROUNDNAME="Group G" SHORTROUNDNAME="Group G" DETSLASTUPDATED="1277145082" COMMSLASTUPDATED="1277130900" STATSLASTUPDATED="1" TABLELASTUPDATED="1277140093" COMMENTARY="/football/commentary/301096_1277130900.xml" GALLERYLASTUPDATED="1277132280" GALLERY="/football/editorial/gallery_match_301096_1277132280.xml" PLAYERSTATS="Y">
<HOMETEAMNAME ID="195" SNAME="Portugal">Portugal</HOMETEAMNAME>
−
<HOMETEAMSCORE>
<SCORE AGG="">7</SCORE>
</HOMETEAMSCORE>
<AWAYTEAMNAME ID="9742" SNAME="Korea DPR">Korea DPR</AWAYTEAMNAME>
−
<AWAYTEAMSCORE>
<SCORE AGG="">0</SCORE>
</AWAYTEAMSCORE>
<MATCHSTATUS STATUS="5"/>
</MATCHSUMMARY>
<MATCHDETAIL VENUENAME="Green Point Stadium" VENUEIMAGE="http://img.skysports.com/10/06/68x93/Green-Point-Stadium-South-Africa_2464128.jpg" ATTENDANCE="63644"/>
−
<LAST6>
−
<HOMETEAM ID="195">
<MATCH RESULT="win" HOMETEAMNAME="Portugal" AWAYTEAMNAME="Korea DPR" HOMETEAMSCORE="7" AWAYTEAMSCORE="0"/>
<MATCH RESULT="draw" HOMETEAMNAME="Ivory Coast" AWAYTEAMNAME="Portugal" HOMETEAMSCORE="0" AWAYTEAMSCORE="0"/>
<MATCH RESULT="win" HOMETEAMNAME="Portugal" AWAYTEAMNAME="Mozambique" HOMETEAMSCORE="3" AWAYTEAMSCORE="0"/>
<MATCH RESULT="win" HOMETEAMNAME="Portugal" AWAYTEAMNAME="Cameroon" HOMETEAMSCORE="3" AWAYTEAMSCORE="1"/>
<MATCH RESULT="draw" HOMETEAMNAME="Portugal" AWAYTEAMNAME="Cape Verde Islands" HOMETEAMSCORE="0" AWAYTEAMSCORE="0"/>
<MATCH RESULT="win" HOMETEAMNAME="Portugal" AWAYTEAMNAME="China PR" HOMETEAMSCORE="2" AWAYTEAMSCORE="0"/>
</HOMETEAM>
−
<AWAYTEAM ID="9742">
<MATCH RESULT="lose" HOMETEAMNAME="Portugal" AWAYTEAMNAME="Korea DPR" HOMETEAMSCORE="7" AWAYTEAMSCORE="0"/>
<MATCH RESULT="lose" HOMETEAMNAME="Brazil" AWAYTEAMNAME="Korea DPR" HOMETEAMSCORE="2" AWAYTEAMSCORE="1"/>
<MATCH RESULT="lose" HOMETEAMNAME="Nigeria" AWAYTEAMNAME="Korea DPR" HOMETEAMSCORE="3" AWAYTEAMSCORE="1"/>
<MATCH RESULT="draw" HOMETEAMNAME="Greece" AWAYTEAMNAME="Korea DPR" HOMETEAMSCORE="2" AWAYTEAMSCORE="2"/>
<MATCH RESULT="lose" HOMETEAMNAME="Paraguay" AWAYTEAMNAME="Korea DPR" HOMETEAMSCORE="1" AWAYTEAMSCORE="0"/>
<MATCH RESULT="draw" HOMETEAMNAME="South Africa" AWAYTEAMNAME="Korea DPR" HOMETEAMSCORE="0" AWAYTEAMSCORE="0"/>
</AWAYTEAM>
</LAST6>
−
<SQUADS>
−
<SQUAD TEAM="195" FORMATION="433" HOMEFLAG="1">
<PLAYER PLFORN="" PLSURN="Eduardo" PLID="131544" SQUADNO="1" RANK="1" CAPTAIN="No" SUB="No">Eduardo </PLAYER>
<PLAYER PLFORN="" PLSURN="Miguel" PLID="94251" SQUADNO="13" RANK="2" CAPTAIN="No" SUB="No">Miguel </PLAYER>
<PLAYER PLFORN="Ricardo" PLSURN="Carvalho" PLID="99309" SQUADNO="6" RANK="3" CAPTAIN="No" SUB="No">Ricardo Carvalho </PLAYER>
<PLAYER PLFORN="Bruno" PLSURN="Alves" PLID="168852" SQUADNO="2" RANK="4" CAPTAIN="No" SUB="No">Bruno Alves </PLAYER>
<PLAYER PLFORN="Alexandre" PLSURN="Fabio Coentrao" PLID="190752" SQUADNO="23" RANK="5" CAPTAIN="No" SUB="No">Alexandre Fabio Coentrao </PLAYER>
<PLAYER PLFORN="" PLSURN="Tiago" PLID="96219" SQUADNO="19" RANK="6" CAPTAIN="No" SUB="No">Tiago </PLAYER>
<PLAYER PLFORN="Pedro" PLSURN="Mendes" PLID="99312" SQUADNO="8" RANK="7" CAPTAIN="No" SUB="No">Pedro Mendes </PLAYER>
<PLAYER PLFORN="Raul" PLSURN="Meireles" PLID="126502" SQUADNO="16" RANK="8" CAPTAIN="No" SUB="No">Raul Meireles </PLAYER>
<PLAYER PLFORN="Cristiano" PLSURN="Ronaldo" PLID="94592" SQUADNO="7" RANK="9" CAPTAIN="No" SUB="No">Cristiano Ronaldo </PLAYER>
<PLAYER PLFORN="Hugo" PLSURN="Almeida" PLID="163879" SQUADNO="18" RANK="10" CAPTAIN="No" SUB="No">Hugo Almeida </PLAYER>
<PLAYER PLFORN="" PLSURN="Simao" PLID="94259" SQUADNO="11" RANK="11" CAPTAIN="No" SUB="No">Simao </PLAYER>
<PLAYER PLFORN="" PLSURN="Beto" PLID="154641" SQUADNO="12" RANK="12" CAPTAIN="No" SUB="Yes">Beto </PLAYER>
<PLAYER PLFORN="Paulo" PLSURN="Ferreira" PLID="99310" SQUADNO="3" RANK="13" CAPTAIN="No" SUB="Yes">Paulo Ferreira </PLAYER>
<PLAYER PLFORN="" PLSURN="Rolando" PLID="125249" SQUADNO="4" RANK="14" CAPTAIN="No" SUB="Yes">Rolando </PLAYER>
<PLAYER PLFORN="" PLSURN="Duda" PLID="97005" SQUADNO="5" RANK="15" CAPTAIN="No" SUB="Yes">Duda </PLAYER>
<PLAYER PLFORN="Liedson" PLSURN="da Silva Muniz" PLID="100957" SQUADNO="9" RANK="16" CAPTAIN="No" SUB="Yes">Liedson da Silva Muniz </PLAYER>
<PLAYER PLFORN="" PLSURN="Danny" PLID="126966" SQUADNO="10" RANK="17" CAPTAIN="No" SUB="Yes">Danny </PLAYER>
<PLAYER PLFORN="Miguel" PLSURN="Veloso" PLID="132102" SQUADNO="14" RANK="18" CAPTAIN="No" SUB="Yes">Miguel Veloso </PLAYER>
<PLAYER PLFORN="" PLSURN="Pepe" PLID="126237" SQUADNO="15" RANK="19" CAPTAIN="No" SUB="Yes">Pepe </PLAYER>
<PLAYER PLFORN="Ricardo" PLSURN="Costa" PLID="101563" SQUADNO="21" RANK="20" CAPTAIN="No" SUB="Yes">Ricardo Costa </PLAYER>
<PLAYER PLFORN="Ruben" PLSURN="Amorim" PLID="127221" SQUADNO="17" RANK="21" CAPTAIN="No" SUB="Yes">Ruben Amorim </PLAYER>
<PLAYER PLFORN="" PLSURN="Deco" PLID="99315" SQUADNO="20" RANK="22" CAPTAIN="No" SUB="Yes">Deco </PLAYER>
<PLAYER PLFORN="Daniel" PLSURN="Fernandes" PLID="138805" SQUADNO="22" RANK="23" CAPTAIN="No" SUB="Yes">Daniel Fernandes </PLAYER>
</SQUAD>
+
<SQUAD TEAM="9742" FORMATION="541" HOMEFLAG="0">
<PLAYER PLFORN="Ri Myong" PLSURN="Guk" PLID="187701" SQUADNO="1" RANK="1" CAPTAIN="No" SUB="No">Ri Myong Guk </PLAYER>
<PLAYER PLFORN="Jong-Hyok" PLSURN="Cha" PLID="167876" SQUADNO="2" RANK="2" CAPTAIN="No" SUB="No">Jong-Hyok Cha </PLAYER>
<PLAYER PLFORN="Chol-Jin" PLSURN="Pak" PLID="187707" SQUADNO="13" RANK="3" CAPTAIN="No" SUB="No">Chol-Jin Pak </PLAYER>
<PLAYER PLFORN="Jun-Il" PLSURN="Ri" PLID="167875" SQUADNO="3" RANK="4" CAPTAIN="No" SUB="No">Jun-Il Ri </PLAYER>
<PLAYER PLFORN="Yun-Nam" PLSURN="Ji" PLID="167866" SQUADNO="8" RANK="5" CAPTAIN="No" SUB="No">Yun-Nam Ji </PLAYER>
<PLAYER PLFORN="Kwang-Chon" PLSURN="Ri" PLID="167874" SQUADNO="5" RANK="6" CAPTAIN="No" SUB="No">Kwang-Chon Ri </PLAYER>
<PLAYER PLFORN="Yong-Hak" PLSURN="Ahn" PLID="167868" SQUADNO="17" RANK="7" CAPTAIN="No" SUB="No">Yong-Hak Ahn </PLAYER>
<PLAYER PLFORN="In-Guk" PLSURN="Mun" PLID="167872" SQUADNO="11" RANK="8" CAPTAIN="No" SUB="No">In-Guk Mun </PLAYER>
<PLAYER PLFORN="Nam-Chol" PLSURN="Pak" PLID="167484" SQUADNO="4" RANK="9" CAPTAIN="No" SUB="No">Nam-Chol Pak </PLAYER>
<PLAYER PLFORN="Yong-Jo" PLSURN="Hong" PLID="168073" SQUADNO="10" RANK="10" CAPTAIN="No" SUB="No">Yong-Jo Hong </PLAYER>
<PLAYER PLFORN="Tae-Se" PLSURN="Jong" PLID="190754" SQUADNO="9" RANK="11" CAPTAIN="No" SUB="No">Tae-Se Jong </PLAYER>
<PLAYER PLFORN="Myong-Gil" PLSURN="Kim" PLID="167865" SQUADNO="18" RANK="12" CAPTAIN="No" SUB="Yes">Myong-Gil Kim </PLAYER>
<PLAYER PLFORN="Kum-il" PLSURN="Kim" PLID="167873" SQUADNO="6" RANK="13" CAPTAIN="No" SUB="Yes">Kum-il Kim </PLAYER>
<PLAYER PLFORN="Chol-Hyok" PLSURN="An" PLID="167482" SQUADNO="7" RANK="14" CAPTAIN="No" SUB="Yes">Chol-Hyok An </PLAYER>
<PLAYER PLFORN="Kum-Chol" PLSURN="Choe" PLID="173853" SQUADNO="12" RANK="15" CAPTAIN="No" SUB="Yes">Kum-Chol Choe </PLAYER>
<PLAYER PLFORN="Nam-Chol" PLSURN="Pak" PLID="189952" SQUADNO="14" RANK="16" CAPTAIN="No" SUB="Yes">Nam-Chol Pak </PLAYER>
<PLAYER PLFORN="Yong-Jun" PLSURN="Kim" PLID="167485" SQUADNO="15" RANK="17" CAPTAIN="No" SUB="Yes">Yong-Jun Kim </PLAYER>
<PLAYER PLFORN="Song-Chol" PLSURN="Nam" PLID="167867" SQUADNO="16" RANK="18" CAPTAIN="No" SUB="Yes">Song-Chol Nam </PLAYER>
<PLAYER PLFORN="Chol-Myong" PLSURN="Ri" PLID="176320" SQUADNO="19" RANK="19" CAPTAIN="No" SUB="Yes">Chol-Myong Ri </PLAYER>
<PLAYER PLFORN="Kwang-Hyok" PLSURN="Ri" PLID="175920" SQUADNO="21" RANK="20" CAPTAIN="No" SUB="Yes">Kwang-Hyok Ri </PLAYER>
<PLAYER PLFORN="Kyong-Il" PLSURN="Kim" PLID="175921" SQUADNO="22" RANK="21" CAPTAIN="No" SUB="Yes">Kyong-Il Kim </PLAYER>
<PLAYER PLFORN="Sung Hyok" PLSURN="Pak" PLID="189953" SQUADNO="23" RANK="22" CAPTAIN="No" SUB="Yes">Sung Hyok Pak </PLAYER>
<PLAYER PLFORN="Myong-Won" PLSURN="Kim" PLID="167864" SQUADNO="20" RANK="23" CAPTAIN="No" SUB="Yes">Myong-Won Kim </PLAYER>
</SQUAD>
</SQUADS>
−
<EVENTS>
<EVENT TIME="90" TYPE="7" PLAYER=""/>
<EVENT TIME="89" TYPE="1" TEAMFLAG="0" PLID="96219" PLAYER="Tiago"/>
<EVENT TIME="87" TYPE="1" TEAMFLAG="0" PLID="94592" PLAYER="Ronaldo"/>
<EVENT TIME="81" TYPE="1" TEAMFLAG="0" PLID="100957" PLAYER="da Silva Muniz"/>
<EVENT TIME="77" TYPE="9" TEAMFLAG="0" PLID="163879" SUBON="100957" PLAYER="Almeida"/>
<EVENT TIME="75" TYPE="9" TEAMFLAG="1" PLID="167876" SUBON="167867" PLAYER="Cha"/>
<EVENT TIME="74" TYPE="9" TEAMFLAG="0" PLID="94259" SUBON="97005" PLAYER="Simao"/>
<EVENT TIME="70" TYPE="9" TEAMFLAG="0" PLID="126502" SUBON="132102" PLAYER="Meireles"/>
<EVENT TIME="70" TYPE="4" TEAMFLAG="0" PLID="163879" INFO="Dissent" PLAYER="Almeida"/>
<EVENT TIME="60" TYPE="1" TEAMFLAG="0" PLID="96219" PLAYER="Tiago"/>
<EVENT TIME="58" TYPE="9" TEAMFLAG="1" PLID="167872" SUBON="167485" PLAYER="Mun"/>
<EVENT TIME="58" TYPE="9" TEAMFLAG="1" PLID="167484" SUBON="167873" PLAYER="Pak"/>
<EVENT TIME="56" TYPE="1" TEAMFLAG="0" PLID="163879" PLAYER="Almeida"/>
<EVENT TIME="53" TYPE="1" TEAMFLAG="0" PLID="94259" PLAYER="Simao"/>
<EVENT TIME="48" TYPE="4" TEAMFLAG="1" PLID="168073" INFO="Dissent" PLAYER="Hong"/>
<EVENT TIME="45" TYPE="14" PLAYER=""/>
<EVENT TIME="45" TYPE="6" PLAYER=""/>
<EVENT TIME="38" TYPE="4" TEAMFLAG="0" PLID="99312" INFO="Unsporting behaviour" PLAYER="Mendes"/>
<EVENT TIME="33" TYPE="4" TEAMFLAG="1" PLID="187707" INFO="Unsporting behaviour" PLAYER="Jin"/>
<EVENT TIME="29" TYPE="1" TEAMFLAG="0" PLID="126502" PLAYER="Meireles"/>
<EVENT TIME="0" TYPE="10" PLAYER=""/>
</EVENTS>
−
<STATS>
−
<TEAM ID="195">
<STAT TYPE="1">7</STAT>
<STAT TYPE="2">11</STAT>
<STAT TYPE="3">15</STAT>
<STAT TYPE="4">5</STAT>
<STAT TYPE="5">3</STAT>
<STAT TYPE="6">58.6</STAT>
<STAT TYPE="7">2</STAT>
<STAT TYPE="8"/>
<STAT TYPE="9">4</STAT>
<STAT TYPE="10">17</STAT>
<STAT TYPE="11">82.00</STAT>
</TEAM>
−
<TEAM ID="9742">
<STAT TYPE="1">0</STAT>
<STAT TYPE="2">3</STAT>
<STAT TYPE="3">11</STAT>
<STAT TYPE="4">1</STAT>
<STAT TYPE="5">3</STAT>
<STAT TYPE="6">41.4</STAT>
<STAT TYPE="7">2</STAT>
<STAT TYPE="8"/>
<STAT TYPE="9">17</STAT>
<STAT TYPE="10">4</STAT>
<STAT TYPE="11">74.49</STAT>
</TEAM>
</STATS>
<VALID>1</VALID>
</DETAILS>
<DETAILS>
.....
</DETAILS>
<DETAILS>
....
</DETAILS>
</XML>
mason@thecodingdesigner.com’s picture

#75 works for me. Thanks Monkey Master!

ChaosD’s picture

twistor’s picture

It is related to the problem at hand. Currently (if you check out from cvs) XPath, Regex, and QueryPath are supported. I think I solved the namespace issues that exist with SimpleXML. I was unaware of this thread when I started this project, but it seems that the amount of duplicated work is minimal.

ChaosD’s picture

maybe you should consider merging your projects as an extensible parser for feeds

twistor’s picture

I'm not opposed to that. I've been thinking about breaking up the functionality of Feeds XPath Parser into different modules, however, I'd like to enable per-field query types. Such that you can use XPath for one field, regex for the second, and QueryPath for the third. The name is not so great for the functionality I do admit. The only thing about merging is that Feeds XML Parser configures the queries in the mapper and Feeds XPath Parser configures them at the endpoint. One allows for greater flexibility, one allows for more control of end users. I'm open to ideas.

ChaosD’s picture

maybe you could call it feeds flexible parser as a collection of all those methods you mentioned. iam currently happy with the configuration in the mapper but i have to admit that iam not too familiar with XPath parser

meatbag’s picture

The "entensible" XML parser supports only xpath
while the XML "Xpath" parser supports xpath, querypath and regex...

Really confusing...These two modules should swap their names...

blasthaus’s picture

are there any immediate plans to finally integrate namespace support? or any tips to get several different namespaces registered or even just hardcoded for now? lookin' fwd.

wil

mokko’s picture

As mentioned frequently in this thread, Chrisirhc's version supports namespaces basically: http://github.com/chrisirhc/feeds_xmlparser

Did you already check out if this version meets your requirements? If I remember correctly it does handle multiple namespaces well, but has problems with default namespaces. Please be more specific.

Also: I haven't had a chance to look at the new http://drupal.org/project/feeds_xpathparser if this might be an alternative.

fereira’s picture

I am having some problems with namespaces as well.

The XML that I'm using defines several names spaces. Here is a small snippet of the code

<ags:resources
xmlns:ags="http://purl.org/agmes/1.1/"
xmlns:dc="http://purl.org/dc/elements/1.1/"
xmlns:agls="http://www.naa.gov.au/recordkeeping/gov_online/agls/1.2"
xmlns:dcterms="http://purl.org/dc/terms/">
<ags:resource ags:ARN="KE2009400905">
....
</ags:resource>

First of all when the the $namespaces = $xml->getNamespaces(true);
line gets executed it silently ignores the "agls" name space. I poked around a bit and found that the uri for that namespace doesn't appear to be valid anymore. If I change it to http://www.naa.gov.au/agls/ then the prefix and namespace showed up in the $namespaces array.

Secondly, I haven't been able to come up with some XPath yet that will parse the attribute of the each of the ags:resource elements. Even when I've hardcoded the registerXPathNamespace function calls with that correct list of namespaces I get the following errors:

warning: SimpleXMLElement::__construct() [simplexmlelement.--construct]: namespace error : Namespace prefix ags for ARN on resource is not defined in /www/html/feedsdev/sites/all/modules/feeds_xmlparser/FeedsXMLParser.inc on line 57.

That's failing when the $xml = new SimpleXMLElement($item); line is executed in the getSourceElement function.

The problem for me is that the attribute what I need to use for the unique GUID mapping.

I wonder if doing something like the evoc module is doing might work. It keeps a couple of database tables for rdf name spaces and uses a simple form element for adding the prefix and namespaces that are used by the rdfcck module for assigning rdf classes and properties.

mokko’s picture

From the xml snippet you are posting I would assume that you want to set xpath to ags:resources and

resource/@ARN as GUID (unique target).

(assuming that you do really have recourse tags inside resource tags as in the snippet).

I haven't used xmlparser with namespaces for attributes.

I don't know about your concrete problem with the agls namespace. Can you find something here http://www.php.net/manual/en/book.simplexml.php?

fereira’s picture

Yes, each ags:resource element is a unique resource identified by the ags:ARN attribute in that tag and they are nested within an ags:resources element (that's resources, not resource). Sorry, I didn't provide the closing ags:resources tag. Here's a real example:

http://mayfly.mannlib.cornell.edu/agrisdata/agris.xml

That content is using a schema developed by FAO of the UN for identifying resources held in an Agriculture Information Systems and is in use by dozens of institutions worldwide.

There are other issues that seem to be causing problems with the module as well. I notice in the feeds_xpathparser module that for each field one can specify whether or not to show the raw content. When parsing this xml using that module I had to tell it *not* to show raw xml for any element that had content wrapped with a CDATA tag.

The dc:description element for at least one of the resources contains html entiies (‘, ’) and I'm seeing errors on those elements as well.

In any case, I think a handbook with lots of examples of XML content with more complex structures with examples of XPath for parsing them, will go a long way to demonstrating just how flexible the module can be and may require code changes to handle all the possible *valid* xml constructs.

fereira’s picture

Forgot to comment on your suggestion about the simplexml book. I had looked at the site but it didn't really answer all my questions.

I'm not so concerned about the agls namespace as that can be corrected in the Agris documentation, though it would seem to me that the getNamespace function should produce an error (and return false) if one of the namespace uris doesn't resolve correctly rather than just excluding it from the list.

More concerning, to me, is that when it's instantiating the new SimpleXMLElement class that it generates a warning about the ags namespace. I didn't see anything in the simplexml book but can a new SimpleXMLElement be instantiated with contains namespaces?

blasthaus’s picture

i am using chrisirhc version and its not recognizing my media namespace, alas

<rss version="2.0" xmlns:media="http://search.yahoo.com/mrss/" xmlns:sc="http://www.screencast.com/" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
   <channel>
      <title>Title here</title>
      <link>http://www.screencast.com/users/username/rss</link>
      <atom:link rel="next" href="http://www.screencast.com/username/rss/skip=50" />
      <atom:link href="http://www.screencast.com/users/username/rss" rel="self" type="application/rss+xml" />
      <sc:totalAvailable>62</sc:totalAvailable>

      <description>Video tutorials on ...</description>
      <pubDate>Tue, 13 Jul 2010 14:33:31 GMT</pubDate>
      <item>
         <title>Title here</title>
         <description />
         <pubDate>Tue, 13 Jul 2010 14:33:31 GMT</pubDate>
         <media:thumbnail width="70" height="70" url="http://thumb.screencast.com/thumb.gif" />
         <media:content url="http://content.screencast.com/users/RegressionBands2.swf" width="1008" height="658" duration="267" fileSize="6913835" type="application/x-shockwave-flash" />
         <enclosure url="http://content.screencast.com/users/RegressionBands2.swf" type="application/x-shockwave-flash" length="6913835" />
         <link>http://www.screencast.com/users/media/f6cb7c7f-c98f-4eb9-bd2d-196b5a0a5068</link>
         <guid>http://www.screencast.com/users/media/f6cb7c7f-c98f-4eb9-bd2d-196b5a0a5068/yes-its-unique</guid>
      </item>

maybe i'm not doing something right?!

thx-wil

meatbag’s picture

#166
The urls are set as attribute of the tag, so they can't be parsed out.
You need to write a customized version to do that.

podox’s picture

#166 Try...

*[name()='media:content']/@url
*[name()='media:content']/@duration
*[name()='media:content']/@fileSize

...etc. Use /rss/channel/item as the Xpath setting

Vaene’s picture

Having same issues as #166, tried


[name()='media:thumbnail']/@url    =>  Thumbnail  (cck text field)

but got this error for each attempted import:

warning: SimpleXMLElement::xpath() [simplexmlelement.xpath]: xmlXPathEval: evaluation failed in /var/www/vhosts/mysite.com/httpdocs/sites/all/modules/feeds_xmlparser/FeedsXMLParser.inc on line 52.

it imported other non-namespaced tags in the item tag so my original Xpath setting is correct I am assuming.

Sorry if I missed this before in the chain but could you give us a hint as to how to go about writing a custom parser that would specifically work for capturing attributes of media namespaced tags?

<item>
         <title>Title here</title>
         <description />
         <pubDate>Tue, 13 Jul 2010 14:33:31 GMT</pubDate>
         <media:thumbnail width="70" height="70" url="http://thumb.screencast.com/thumb.gif" />
         <media:content url="http://content.screencast.com/users/RegressionBands2.swf" width="1008" height="658" duration="267" fileSize="6913835" type="application/x-shockwave-flash" />
         <enclosure url="http://content.screencast.com/users/RegressionBands2.swf" type="application/x-shockwave-flash" length="6913835" />
         <link>http://www.screencast.com/users/media/f6cb7c7f-c98f-4eb9-bd2d-196b5a0a5068</link>
         <guid>http://www.screencast.com/users/media/f6cb7c7f-c98f-4eb9-bd2d-196b5a0a5068/yes-its-unique</guid>
      </item>

Fidelix’s picture

OMG i want this badly. Is it working ATM ?

blasthaus’s picture

thx podox, i tried that and get this error

warning: SimpleXMLElement::__construct() [simplexmlelement.--construct]: namespace error : Namespace prefix media on thumbnail is not defined in /Users/me/Sites/this_site/sites/default/modules/feeds_xmlparser/FeedsXMLParser.inc on line 52.
 

i assume this is where we need to be looking around line 28 of FeedsXMLParser.inc


    if (version_compare(PHP_VERSION, '5.2.0', '>=')) {
      //Fetch all namespaces
      $namespaces = $xml->getNamespaces(true);

      //Register them with their prefixes
      foreach ($namespaces as $prefix => $ns) {
/*        $xml->registerXPathNamespace($prefix, $ns); */
        $xml->registerXPathNamespace($prefix, $ns);
      }

question is how to get url attributes within a namespace tag, however it doesn't even look as if the namespace is getting recognized here.

thx-w

meatbag’s picture

To deal with attributes and namespace better, i suggest using QueryPath as the parsing interface.

Here's an article by the QueryPath author which helps a lot.
http://www.ibm.com/developerworks/opensource/library/os-php-querypath/

podox’s picture

#169 There is an asterisk before [name()='media:thumbnail']/@url - could you try again?

SeanBannister’s picture

@meatbag QueryPath would be awesome, it's amazingly powerful and surprisingly much simpler.

twistor’s picture

No to toot my own horn, the feeds_xpathparser module has tentative support for QueryPath in dev. I need people to test it out. All the features aren't currently implemented, but anything you can do with XPath should be accomplishable at this point. I agree that the syntax is much simpler plus it avoids having to learn yet another syntax.

arski’s picture

hmm, any chance of actually having this module on d.o. anytime soon? You don't have to make a stable release straight away, but it would be nice to have the code in a proper place..

also what goes for the issues - it would be really great if you could take a look at the other issues reported for this project in here, and also maybe we should start splitting them a bit more instead of continuing in this 175-reply thread :o

Looking forward to this module!

Cheers

meatbag’s picture

I modified the module to use QueryPath as parsing interface. It works perfectly for me.
Here's the code.
Note: This just serves as an example for those who want to know how QueryPath works.
You may need further modification before using it on your own server.

// $Id$
require_once 'QueryPath.php';
require_once 'QPXML.php';
/**
 * Parses a given file as an XML document.
 */
class FeedsXMLParser extends FeedsParser {
  /**
   * Implementation of FeedsParser::parse().
   */
  public function parse(FeedsImportBatch $batch, FeedsSource $source) {
    if (empty($this->config['xpath'])) {
      throw new Exception(t('Please set xpath for items in XML parser settings.'));
    }
    $temp = realpath($batch->getFilePath());
    if (!is_file($temp)) {
      throw new Exception(t('File %name not found.', array('%name' => $batch->getFilePath())));
    }
    $file = QueryPathEntities::replaceAllEntities(file_get_contents($temp));
    foreach (qp($file, $this->config['xpath']) as $qpitem) {
      $result[] = $qpitem;
    }
    if (!is_array($result)) {
      throw new Exception(t('Xpath %xpath in XML file %file failed.', array('%xpath' => $this->config['xpath'], '%file' => $batch->getFilePath())));
    }
    $batch->setItems($result);
  }
  function getSourceElement($item, $element_key) {
    switch ($element_key) {
      case 'img':
        foreach (qp($item, 'content|encoded') as $qpitem) {
          preg_match('/<img.*?src="(.*?)"/', strtolower($qpitem->cdata()), $itemmatch);
          $ext = substr(trim($itemmatch[1]), -4);
          if ($ext == '.jpg' or $ext == '.png') {
            $result[] = $itemmatch[1];
          }
        }
        break;
      case 'dc|date':
        foreach (qp($item, 'dc|date') as $qpitem) {
          $result[] = strtotime($qpitem->text());
        }
        break;
      default:
        foreach (qp($item, $element_key) as $qpitem) {
          $result[] = $qpitem->text();
        }
    }
    if (is_array($result)) {
      $values = array();
      foreach ($result as $obj) {
        $values[] = (string)$obj;
      }
      if (count($values) > 1) {
        return $values;
      }
      elseif (count($values) == 1) {
        return $values[0];
      }
    }
    return '';
  }
  public function configDefaults() {
    return array(
      'xpath' => 'item',
    );
  }
  public function configForm(&$form_state) {
    $form = array();
    $form['xpath'] = array(
      '#type' => 'textfield',
      '#title' => t('XPath'),
      '#default_value' => $this->config['xpath'],
      '#description' => t('XPath to locate items.'),
    );
    return $form;
  }
}
robbertnl’s picture

Does this work with batch support as wel (see http://drupal.org/node/744660)?

fereira’s picture

I'll toot your horn for you. The latest version of the feeds_xpathparser works really well. It would probably worth going through this issue to see if there is anything that could be added to the xpath parser but as I see it, there isn't any reason to keep this (feeds_xmlparser) around as the feeds_xpathparser does about everything one would need, its' easy to use, and it's even fairly well documented.

twistor’s picture

Status: Needs review » Closed (fixed)

Cleaning old issues.