I'm using FeedAPI to create nodes from various RSS feeds.

Some of the feeds end up building some sloppy node descriptions that contain many undesirable anchor tags into the node body.

What would be the best way to strip these anchor tags out?

Comments

emackn’s picture

You ever dig up anything about this?

jghyde’s picture

I ran into a similar problem, only I wanted to strip out all image tags that were embedded in the $content variable of node.tpl.php. So, I looked at the raw html and saw it was wrapping a div around each image. It looked like this:

<div style="width: 604px" class="image-attach-body"><a href="/image/joe-hyde"><img src="http://www.hydeinteractive.com/sites/default/files/images/me1.jpg" alt="Joe Hyde" title="Joe Hyde"  class="image image-preview " width="604" height="402" /></a></div>

The common traits? Each image was wrapped inside a div with the same class name:

<div ... class="image-attach-body" ... </div>

And so, not wanting to get into the preprocessor stuff inside the template.php, I decided that the quickest way to get rid of those images was to use the good 'ol php function called preg_replace. All I had to do was find an awesome regex pattern that matched that div of that class name and replace the tag and everything in it, the images, with, well, nuthin' (e.g. "").

I searched for a good replacement regexp (regex) on the Web.

This is a very good regular expression tool to test your output: http://regex.larsolavtorvik.com/
and I found a good pattern here:
http://stackoverflow.com/questions/226562/how-can-i-remove-an-entire-htm...

I then applied the pattern to the $content variable.

$content = preg_replace('/<div[^>]*class=\"image-attach-body\"[^>]*>(.*?)<\/div>/im', '', $content);

Note: the "i" means to ignore case, and the "m" allows for multi-line matching.

It worked!

Now, when you

print $content;

Inside the node.tpl.php, it no longer displays images!

And I am so stoked.

Joe
http://www.hydeinteractive.com/

Local News Platform Built on Drupal
http://sanangelolive.com/