Jump to:
| Project: | Feedparser |
| Version: | 4.7.x-1.x-dev |
| Component: | feedaggregator_node |
| Category: | bug report |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | closed (fixed) |
Issue Summary
Starting with the update to feedparser module that includes feedaggregator_node.module 1.42.2.15, all images get stripped out (which is not really a bad thing) and most HTML elements get stripped. I have found <div> pairs in some nodes, though. Special characters get munged because the ampersand gets replaced with &. And paragraphs often get stuffed together, of course, when the <p> elements are removed. Links are removed as well. Also, my input formatting allows html and hasn't changed.
Also, for many items, the full page is the same content as the excerpt when the xml file shows differently. But not always. I'm still narrowing this down. Feedburner feeds seem to always get munged. Some others may actually work fine, but I only have about four days worth of data to compare, so far.
The following feedburner feed is the most munged. The version of feedparser that was current as of about 2 weeks ago was pulling in everthing including images from this feed, but now it only gets the first 300 to 400 characters and strips out all formatting. http://feeds.feedburner.com/getreligion/DmXm
btw, thanks for a great module. I'll do more looking and add what I find out. But I thought that there must be others affected by this, too. Haven't used the developer module yet, so I'll install that and see if it gives better insight.
bill
Comments
#1
Viewing http://feeds.feedburner.com/getreligion/DmXm in FireFoxes built in RSS formatter highlights a number of double encoded HTML entities within the content descriptions.
Seems to be related to the fancy Microsoft (non UTF-8 standard) open and close double quotes etc.
My personal sites running the same Feedparser code are pulling in HTML with images and tags as usual, and I use Drupals filter to strip out some tags I don't want too.
I also noticed that FireFox is taking the same text from the RSS feed as FeedParser does - the 'summary' text is used in preference to the full 'content' provided in the feed. This must be something which has recently changed in SimplePie engine.
#2
Yeah, and these seem to have started at about the same time too, which gave the appearance of being related to the module update when they probably are not since I don't see them in any other feed so far. So, it seems that this source has changed some things on his end that are affecting his feed.
Sorry about that. I should have picked another feed as an example, but this one seemed to have all the problems in one place.
Nevertheless, the problem with cutting off the full post remains and it seems consistent with feedburner and typepad sources. Here's another example that seems to show that only the excerpt is coming in and the full page is dropped. Following is the all that gets syndicated from this feed: http://slacktivist.typepad.com/slacktivist/index.rdf
Am I missing a switch?
#3
Just committed a small change to CVS 4.7-dev which fixes the full content being ignored in feed. Seems SimpePie introduced two functions for getting content out of an item.
#4
Works great! Many thanks.
#5
#6