I have a need to retrieve VMN elements from feeds (sample: http://www.cbsnews.com/feeds/rss/mobile/politics.rss). A simple call to simplexml_load_string() doesn't do the job, you've got to go looking for the particular namespaces referenced. Attached is a first pass at this for the RSS 2.0 feeds using the common syndication parser. I haven't dealt with parsing feeds before, so all possible eyes are welcome on this.

Doing this in all cases adds overhead - it may be a good idea to make this a per-feed option, only enable it if you know you need the additional elements.

Now to figure out how to take advantage of these elements in FEMP...

Comments

moshe weitzman’s picture

the patch is not attached.

mikeryan’s picture

StatusFileSize
new1.74 KB

OK, trying again..

alex_b’s picture

Version: 5.x-1.x-dev » 6.x-1.x-dev
Component: Code » Code parser_common

This looks like a sound approach. Bumping up to 6.x

likewhoa’s picture

StatusFileSize
new1.38 KB

patch in #2 didn't apply cleanly in my drupal-6.11 setup with feedapi, so I am attaching one that should apply cleanly.

alex_b’s picture

Status: Needs review » Needs work

Only seeing this now: #2/#4 don't adhere to http://drupal.org/coding-standards .

Further: does this code not provoke namespace collisions? Harvested items could be keyed by their namespaces on the $item->options->raw array.

cedarm’s picture

StatusFileSize
new642 bytes
new966 bytes

I've been working with simplepie and have the same need. I assume common syndication parser is basically the same in concept, just differing in implementation. I started out with merging namespaces, then though it was probably a bad idea:

  raw->link
  raw->title
  raw->customtaga
  raw->customtagb

So, I agree with keying by namespace, but I'm unsure of how we should handle the feed's namespace. Assume "customns" is a valid namespace URI. Should we end up with

  raw->link
  raw->title
  raw->customns->customtaga
  raw->customns->customtagb

or

  raw->http://purl.org/dc/elements/1.1/->link
  raw->http://purl.org/dc/elements/1.1/->title
  raw->customns->customtaga
  raw->customns->customtagb

or

  raw->link
  raw->title
  rawns->customns->customtaga
  rawns->customns->customtagb

The first option is the most immediately backwards compatible while the second is probably most correct (for a low level function like parser_simplepie_simplify_raw_item()), but (I think) would break things and require some mappings to be redone. I opted for the third with my patch, but it also requires modification to any code which uses parser_simplepie_simplify_raw_item(). feedapi-199360-6b.patch is a patch for Feed Element Mapper.

I'm not saying my way is best... Just looking for some feedback and discussion.

aron novak’s picture

Title: Retrieve namespace elements » Add namespace support to Parser Common Syndication
Status: Needs work » Needs review
StatusFileSize
new2.36 KB

likewhoa: I modified your patch to avoid namespace collisions and enforce drupal conding standards. Also I introduced a function do to this namespace-handling. Can you review it? How does this patch fits your needs?

cedarm: Yes, you're right, this is the same for simplepie. However it would be good to keep the thread on track, i mean this ticket is assigned to "code parser_common" component and i'd say it's not the best to mix patches for different part of code. So let's continue the simplepie-related things here: #552092: Add namespace support to Parser SimplePie

cedarm’s picture

Status: Needs review » Needs work
StatusFileSize
new2.46 KB

No problem. Let's figure out how best to represent the ns data in options, then I'll post an updated patch to the new issue.

If I read your patch correctly, the namespaced data goes into options->ns->{namespace}->? Does it even make sense to try to keep SimplePie to the same structure?

...
Just gave #7 patch a test drive. It looks like _parser_common_syndication_namespaces_data() could use some work. Some items I expected to be in options weren't. I'm no expert with simplexml, so I can't help much here.

Also, it looks like this patch only works for RSS2 feeds, since the new function is only called inside of _parser_common_syndication_RSS20_parse(). I was able to get some atom1.0 things to work.

Patch to my current code is attached, but _parser_common_syndication_namespaces_data() replaced with other code which only works somewhat. Its only value is probably just the first two hunks for atom.

aron novak’s picture

Status: Needs work » Postponed (maintainer needs more info)

You're right, it wasn't nice to ignore non-RSS type feeds

"Some items I expected to be in options weren't." - Can you tell me a bit more about the test drive? Maybe a feed URL can be helpful.

I'm afraid that the solution needs to be more complex. To process an XML tree, a tree traversal needs to be implemented.
So yeah, this definitely needs work.

After i receive more details about the problems that you faced w/ #7, i'm happy to prepare a new patch w/ full tree traversal and so on.

cedarm’s picture

"Some items I expected to be in options weren't." - I don't remember exactly, and the test URL isn't public, but it was an atom feed with some namespaced elements. What I remember from looking at _parser_common_syndication_namespaces_data() (from #7) is that cases weren't covered. I think with this feed is_object($childvalue) always evaluated TRUE, and is_object($children->$childkey) was only sometimes TRUE. So when (is_object($childvalue) && !is_object($children->$childkey)), those items were skipped. I hope that's clearer than mud. Probably a lot of the issue is just how simplexml works.

FWIW parser_simplepie_simplify_raw_item() seems to work quite well, but that's probably due to _feedapi_mapper_obj2array() running before it. I think this avoids the simplexml issues. Perhaps Common Syndication could/should take a similar "obj -> array -> tree traversal" approach?