| Project: | Feeds |
| Version: | 6.x-1.x-dev |
| Component: | Code |
| Category: | feature request |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | closed (duplicate) |
Issue Summary
Problem
It's quite hard to work with really big feeds due to timeouts and memory limits etc. Feeds support queuing of feed refreshments with the Queue API (Drupal queue). But that isn't always sufficient. As stated here: #600584: Use Batch API it would be great to be able to use Batch API in some cases . But I think we could aproach this in a more generic and abstracted way.
Solution
Wouldn't it be great to have some kind of transporter API for Feeds. A transporter would simply transport a parsed item to the processor side. Off the top of my head I can come up with three different transporter implementations:
- Straight passing - this was how it worked in FeedAPI
- Batch API - parsed items are passed to the processor from a queue in Drupals buildt in Batch API
- Queue API - parsed items are passed directly to the Queue API (Drupal queue) for later processing by the processor.
While the Batch API implementation wouldn't support refreshing of a feed through a cron job, the Queue API implementation would do that, all the time.
Why this could work
Generally there isn't a problem to parse a big feed, but to process it. As I stated in this FeedAPI issue: #545304: Use Batch API to improve refresh batching about 95% of the time was spent on the processing side when refreshing feeds. So having a transporter API that can split up the processing in an arbitrary way would be a huge improvement.
I haven't had the time to take a closer look at Feeds code base yet so I don't know if this is possible. I shall take a closer look at it, and see if I can come up with a patch for it.
Comments
#1
So, here is a first take at the new FeedsTransporter interface. The patch comes with two implementations of the interface:
FeedsDirectTransporter is attached to all importers by default as this is the current behavior of Feeds. So to test FeedsBatchTransporter you have to programmatically attach it to an importer.
The patch still needs some work. I haven't implemented any configurable options for the transporters yet etc so some values like the batch size is hard codes at the moment. Also, transporters should do the work in FeedsImporter::clear() instead of processors handling that by them selfs.
I think it makes sense to ship Feeds with FeedsQueueTransporter also, as Feeds have built in support for queuing whole feeds refreshments with drupal_queue. But I haven't written FeedsQueueTransporter yet.
#2
#3
Hey, glad you're attacking batching.
I'm trying to understand the underlying assumptions better. Can you explain why you decided to add a separate transporter plugin type instead of adding batch support into FeedsProcessor class?
#4
The main reason why I added a new plugin type is because of semantics and separation of the API.
Importing 5 RSS items a week and importing 500,000 rows CSV a day is actually very similar in one way actually - they both parses items at one end and processes them at the other end. But whats totally different between the cases is the way they need to transport and handle that data. A direct transport of the items under the same HTTP request may be appropriate in the first case. But in the last case, being able to queue processes in the database so several web fronts can hit the queue at the same time may be crusial. That is something FeedsProcessor shouldn't need to care about, imho.
And being able to switch transporter implementation without the need to touch the parsing or processing side can only be a good thing. I think Feeds can (and should) go this way. It would greatly improve the already awesome architecture of this module and prove it self even more as a very powerful importing tool for all kinds of situations and data types.
#5
I'm struggling to see the inherent advantage of making batching a seperate import stage on the same level as fetching, parsing, processing, I do see a couple of problems though:
I propose implementing the main batching functionality on the FeedsImporter level while keeping FeedsImporter and plugins agnostic to the batching implementation (Batch API or cron).
This approach is designed to fit snugly with the existing architecture of Feeds, it allows us to implement batching with BatchAPI and (with modifications of FeedsScheduler) on cron. It further doesn't limit us to batching only on the processing level (while we're only focussing on processing for now). It uses the existing infrastructure for modifying FeedsImporter behavior by swapping the FeedsImporter class defined in the Drupal variable 'feeds_importer_class'.
Some further considerations:
Thoughts?
#6
This is solved now #600584: Use Batch API