Project:Feedparser
Version:4.7.x-1.x-dev
Component:feedaggregator_node
Category:bug report
Priority:normal
Assigned:Unassigned
Status:closed (fixed)

Issue Summary

Starting with the update to feedparser module that includes feedaggregator_node.module 1.42.2.15, all images get stripped out (which is not really a bad thing) and most HTML elements get stripped. I have found <div> pairs in some nodes, though. Special characters get munged because the ampersand gets replaced with &amp;. And paragraphs often get stuffed together, of course, when the <p> elements are removed. Links are removed as well. Also, my input formatting allows html and hasn't changed.

Also, for many items, the full page is the same content as the excerpt when the xml file shows differently. But not always. I'm still narrowing this down. Feedburner feeds seem to always get munged. Some others may actually work fine, but I only have about four days worth of data to compare, so far.

The following feedburner feed is the most munged. The version of feedparser that was current as of about 2 weeks ago was pulling in everthing including images from this feed, but now it only gets the first 300 to 400 characters and strips out all formatting. http://feeds.feedburner.com/getreligion/DmXm

btw, thanks for a great module. I'll do more looking and add what I find out. But I thought that there must be others affected by this, too. Haven't used the developer module yet, so I'll install that and see if it gives better insight.

bill

Comments

#1

Viewing http://feeds.feedburner.com/getreligion/DmXm in FireFoxes built in RSS formatter highlights a number of double encoded HTML entities within the content descriptions.

Seems to be related to the fancy Microsoft (non UTF-8 standard) open and close double quotes etc.

My personal sites running the same Feedparser code are pulling in HTML with images and tags as usual, and I use Drupals filter to strip out some tags I don't want too.

I also noticed that FireFox is taking the same text from the RSS feed as FeedParser does - the 'summary' text is used in preference to the full 'content' provided in the feed. This must be something which has recently changed in SimplePie engine.

#2

. . .highlights a number of double encoded HTML entities within the content descriptions.

Seems to be related to the fancy Microsoft (non UTF-8 standard) open and close double quotes etc.

Yeah, and these seem to have started at about the same time too, which gave the appearance of being related to the module update when they probably are not since I don't see them in any other feed so far. So, it seems that this source has changed some things on his end that are affecting his feed.

Sorry about that. I should have picked another feed as an example, but this one seemed to have all the problems in one place.

Nevertheless, the problem with cutting off the full post remains and it seems consistent with feedburner and typepad sources. Here's another example that seems to show that only the excerpt is coming in and the full page is dropped. Following is the all that gets syndicated from this feed: http://slacktivist.typepad.com/slacktivist/index.rdf

David Kurtz at Talking Points Memo cites a statistic that knocked me back on my heels: Sixteen million Americans live in "severe poverty," defined as individuals making less than $5,080 annually and families of four making less than $9,903. DK...

Am I missing a switch?

#3

Just committed a small change to CVS 4.7-dev which fixes the full content being ignored in feed. Seems SimpePie introduced two functions for getting content out of an item.

#4

Works great! Many thanks.

#5

Status:active» fixed

#6

Status:fixed» closed (fixed)
nobody click here