This project is not covered by Drupal’s security advisory policy.

This is a parser plugin for Feeds that uses the SimpleHTMLDOM library to extract elements from HTML documents. It can be used to building screen scraping functionality with Feeds, and for automatically importing items from websites that do not support RSS.

Installation

Download the module and extract into your sites module folder. You will also need the Feeds module, and it's dependencies. Enable the module.

Create a new Feed importer. You probably want to use the HTTP fetcher to download the web page. Change the feed configuration to use SimpleHTMLDOM as the parser, then configure the extractions you wish to make from the page (see docs). You then probably want to use the Node Mapper to map these items onto nodes/fields.

Documentation

Full documentation doesn't exist yet, but there's a usage example on my website.

You can read more about the syntax used in the configuration on the SimpleHTMLDOM website.

Notes

These notes are compiled from support requests as they may be of use to you when configuring parsers:

  • It only makes sense to use the "multi-value" option when you are mapping to a field that has been specified to use multiple values. If you are mapping to something like the node body or a single value text field then you need a single value.
  • For each extraction you need to specify an "attribute" to extract. You can use innertext to get the HTML code inside the extracted DOM node. You can use plaintext to just parse the displayable text from the node. You can also return HTML attributes, like src from images or href from links.

Please continue to post support requests to the issue queue and I'll update this list when possible.

Project information

Releases