Download & Extend

A transporter interface to allow for batching

Project:Feeds
Version:6.x-1.x-dev
Component:Code
Category:feature request
Priority:normal
Assigned:Unassigned
Status:closed (duplicate)

Issue Summary

Problem

It's quite hard to work with really big feeds due to timeouts and memory limits etc. Feeds support queuing of feed refreshments with the Queue API (Drupal queue). But that isn't always sufficient. As stated here: #600584: Use Batch API it would be great to be able to use Batch API in some cases . But I think we could aproach this in a more generic and abstracted way.

Solution

Wouldn't it be great to have some kind of transporter API for Feeds. A transporter would simply transport a parsed item to the processor side. Off the top of my head I can come up with three different transporter implementations:

  • Straight passing - this was how it worked in FeedAPI
  • Batch API - parsed items are passed to the processor from a queue in Drupals buildt in Batch API
  • Queue API - parsed items are passed directly to the Queue API (Drupal queue) for later processing by the processor.

While the Batch API implementation wouldn't support refreshing of a feed through a cron job, the Queue API implementation would do that, all the time.

Why this could work

Generally there isn't a problem to parse a big feed, but to process it. As I stated in this FeedAPI issue: #545304: Use Batch API to improve refresh batching about 95% of the time was spent on the processing side when refreshing feeds. So having a transporter API that can split up the processing in an arbitrary way would be a huge improvement.

I haven't had the time to take a closer look at Feeds code base yet so I don't know if this is possible. I shall take a closer look at it, and see if I can come up with a patch for it.

Comments

#1

Status:active» needs review

So, here is a first take at the new FeedsTransporter interface. The patch comes with two implementations of the interface:

  • FeedsDirectTransporter - transports parsed items directly to the processor.
  • FeedsBatchTransporter - uses Drupal's built in Batch API to pass items to the processor.

FeedsDirectTransporter is attached to all importers by default as this is the current behavior of Feeds. So to test FeedsBatchTransporter you have to programmatically attach it to an importer.

The patch still needs some work. I haven't implemented any configurable options for the transporters yet etc so some values like the batch size is hard codes at the moment. Also, transporters should do the work in FeedsImporter::clear() instead of processors handling that by them selfs.

I think it makes sense to ship Feeds with FeedsQueueTransporter also, as Feeds have built in support for queuing whole feeds refreshments with drupal_queue. But I haven't written FeedsQueueTransporter yet.

AttachmentSize
feeds-transporter-626352-1.patch 10.71 KB

#2

Title:Create a transporter API» A transporter interface to allow for batching

#3

Hey, glad you're attacking batching.

I'm trying to understand the underlying assumptions better. Can you explain why you decided to add a separate transporter plugin type instead of adding batch support into FeedsProcessor class?

#4

The main reason why I added a new plugin type is because of semantics and separation of the API.

Importing 5 RSS items a week and importing 500,000 rows CSV a day is actually very similar in one way actually - they both parses items at one end and processes them at the other end. But whats totally different between the cases is the way they need to transport and handle that data. A direct transport of the items under the same HTTP request may be appropriate in the first case. But in the last case, being able to queue processes in the database so several web fronts can hit the queue at the same time may be crusial. That is something FeedsProcessor shouldn't need to care about, imho.

And being able to switch transporter implementation without the need to touch the parsing or processing side can only be a good thing. I think Feeds can (and should) go this way. It would greatly improve the already awesome architecture of this module and prove it self even more as a very powerful importing tool for all kinds of situations and data types.

#5

I'm struggling to see the inherent advantage of making batching a seperate import stage on the same level as fetching, parsing, processing, I do see a couple of problems though:

  1. The batching is a cross section problem, a problem of import execution and not a problem of a particular import stage. While I agree that we should focus our batching work on the processing stage for now, a stage-based like FeedsTransporter approach is painting ourselves into a corner.
  2. The patch introduces a novel pattern: a plugin (transporter) executes another plugin for import. FeedsImporter is the only class involved in the execution of an import so far.
  3. Moreover, using the plugin concept for batching will require switching the specific class in use depending on the context of the import: when importing from cron you'll have to use a different class than when you're importing from through the browser.
  4. A new stage does introduce complexity on the architectural level.

I propose implementing the main batching functionality on the FeedsImporter level while keeping FeedsImporter and plugins agnostic to the batching implementation (Batch API or cron).

  1. Introduce an additional parameter: FeedsProcessor::process($parserResult, $source, $batch = FALSE);
  2. Introduce an additional parameter: FeedsImporter::import($source, $batch = FALSE);
  3. Introduce a return value in FeedsProcessor::process() that indicates whether process() is done: FeedsProcessorResult extends FeedsResult (alternatively, the return value may just be a string that indicates the status).
  4. Have FeedsImporter::import() return the FeedsProcessorResult.
  5. Modify import() so that it does not invoke fetchers/parsers if a batch is active for a given source. This functionality needs to be concurrency safe: Batching must not break if two processes start a batch at the same time or a batch is running while cron updates the same source. Concretely, this means that a) a process is importing and b) the progress of the batch will need to be stored with the source (in feeds_source table).

This approach is designed to fit snugly with the existing architecture of Feeds, it allows us to implement batching with BatchAPI and (with modifications of FeedsScheduler) on cron. It further doesn't limit us to batching only on the processing level (while we're only focussing on processing for now). It uses the existing infrastructure for modifying FeedsImporter behavior by swapping the FeedsImporter class defined in the Drupal variable 'feeds_importer_class'.

Some further considerations:

  • The return value from import() will neatly take care of batching via batch API. Batching via Drupal Queue should not be hard: we just need to add a feed to the queue again if it didn't finish processing. Batching on cron without DrupalQueue will require more refactoring, as FeedsScheduler right now adds a source for next refresh at the time it starts processing it, not at the time it is done processing. This last problem could pose some inefficiencies when using Drupal Queue, too: when a source is scheduled for 'refresh on cron as often as possible' and it is further very large and heavily batched, we will wind up with multiple queue entries for that source.
  • We should time batching (15 to 20 seconds a batch) and not set it to a number limit.
  • Instead of adding an additional parameter to process() and import() we could think about a class FeedsImportContext that carries a source and a batch flag. But I think at this stage, this is overdoing it.

Thoughts?

#6

Status:needs review» closed (duplicate)

This is solved now #600584: Use Batch API