Hello
Thanks for this great module.
I have a question : i created a fetcher that extends FeedsFileFetcher in order that i can configure my "directory path" in fetcher's form rather than in the standalone form. It's the only solution i found to import periodically my xml files.

When i'm doing manually (standalone form) the import, all is fine : my 40 xml files are fetched, parsed and converted to nodes.

But when i set "Periodic import" in "Basic settings" to "as often as possible"; then only ONE file is imported each time the cron is fired. I would like to parsed all the xml contained in my folder each time ...

Why only one xml file is imported ? how to change this behavior ?

Thanks

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

nyl_auster’s picture

PS : my fetcher's code is exactly the same as FeedsFileFetcher for now

  public function fetch(FeedsSource $source) {
    $source_config = $source->getConfigFor($this);

    // Just return a file fetcher result if this is a file.
    if (is_file($source_config['source'])) {
      return new FeedsFileFetcherResult($source_config['source']);
    }

    // Batch if this is a directory.
    $state = $source->state(FEEDS_FETCH);
    $files = array();
    if (!isset($state->files)) {
      $state->files = $this->listFiles($source_config['source']);
      $state->total = count($state->files);
    }
    if (count($state->files)) {
      $file = array_shift($state->files);
      $state->progress($state->total, $state->total - count($state->files));
      return new FeedsFileFetcherResult($file);
    }

    throw new Exception(t('Resource is not a file or it is an empty directory: %source', array('%source' => $source_config['source'])));
  }
nyl_auster’s picture

nobody has a clue for me ?

Cajun’s picture

hey did you figure this out? I got the exact same problem

juhaniemi’s picture

Category: support » bug

Confirming this issue and setting it as a bug.

janfang’s picture

I have the same problem. Have you found a solution?

valderama’s picture

Seems like we are having the same problem here. Any clues someone?

Thanks,
walter

surf12’s picture

the same problem. Help us pleace!
Thanks...

siva.thanush’s picture

Version: 7.x-2.0-alpha4 » 7.x-2.0-alpha5
Priority: Normal » Critical

Same problem persists.
Or this post is duplicate?
For me this happens in the next version also.

siva.thanush’s picture

Version: 7.x-2.0-alpha5 » 7.x-2.0-alpha4
Category: bug » task
Priority: Critical » Normal
Status: Active » Fixed

Its working in the latest version 7.x-2.0-alpha5
I am not sure with the lower one.

franz’s picture

Status: Fixed » Closed (won't fix)
cmarcera’s picture

Version: 7.x-2.0-alpha4 » 7.x-2.0-alpha7

I'm using alpha7 and this bug persists for me. I have cron running every 10 minutes and my periodic import is set to run as often as possible. I'm using the file upload fetcher to parse XML files with the "Supply path to file or directory directly" option checked.

Every 10 minutes, my Feed Importer imports 1 XML file from the directory. After that, the feed is locked and must be unlocked if I want to run it manually. If I run it manually, it imports all of the XML files as expected.

gurrmag’s picture

I'm having this issue too...

I'm using feeds to take new files uploaded to the server and import them periodically as nodes of a specific content type, with update existing nodes selected. However, when jobs scheduler fires, only one node is processed at a time, and, when the import is complete, it reports that 100's of nodes have been imported, i.e. the total of all nodes imported, rather than just the ten or so new files that are available daily. None of these files are particularly long - many of them are just one paragraph of text.

I have minimised this issue by running jobs scheduler every five minutes - fortunately feeds is the only thing using this.

cmarcera’s picture

My issue will import 1 item, then lock the feed saying it's XX% done. After the next cron run, it imports another item and increase the percentage done. It's baffling because the percentage clearly knows how many files are in the directory to process, it's just stopping after one.

cmarcera’s picture

Category: task » bug
Status: Closed (won't fix) » Active

I've now tried various settings in my Feeds importer and none seem to import more than a single item.

Basic settings
• Attached to: [none]
• Periodic import: as often as possible
• Import on submission: Checked
• Process in background: Unchecked

Fetcher
• File upload: Upload content from a local file.

Parser
• XPath XML parser: Parse XML using XPath.

Processor
• Node processor: Create and update nodes.

gurrmag, what settings are you using? Trying to find a common denominator.

gurrmag’s picture

My settings are:
Basic settings
• Attached to: [none]
• Periodic import: 1 day
• Import on submission: Checked
• Process in background: Unchecked

Fetcher
• File upload: Upload content from a local file.

Parser
• XPath XML parser: Parse XML using XPath.

Processor
• Node processor: Create and update nodes.

I've tried a number of different combinations, but haven't found a culprit for this yet either...

dgtlmoon’s picture

what does the content of your job_schedule table look like?

cmarcera’s picture

--
-- Table structure for table `d7_job_schedule`
--

CREATE TABLE IF NOT EXISTS `d7_job_schedule` (
  `item_id` int(10) unsigned NOT NULL AUTO_INCREMENT COMMENT 'Primary Key: Unique item ID.',
  `name` varchar(128) NOT NULL DEFAULT '' COMMENT 'Name of the schedule.',
  `type` varchar(128) NOT NULL DEFAULT '' COMMENT 'Type identifier of the job.',
  `id` int(10) unsigned NOT NULL DEFAULT '0' COMMENT 'Numeric identifier of the job.',
  `period` int(10) unsigned NOT NULL DEFAULT '0' COMMENT 'Time period after which job is to be executed.',
  `crontab` varchar(255) NOT NULL DEFAULT '' COMMENT 'Crontab line in *NIX format.',
  `data` longblob COMMENT 'The arbitrary data for the item.',
  `expire` int(11) NOT NULL DEFAULT '0' COMMENT 'Timestamp when job expires.',
  `created` int(11) NOT NULL DEFAULT '0' COMMENT 'Timestamp when the item was created.',
  `last` int(10) unsigned NOT NULL DEFAULT '0' COMMENT 'Timestamp when a job was last executed.',
  `periodic` smallint(5) unsigned NOT NULL DEFAULT '0' COMMENT 'If true job will be automatically rescheduled.',
  `next` int(10) unsigned NOT NULL DEFAULT '0' COMMENT 'Timestamp when a job is to be executed (next = last + period), used for fast ordering.',
  `scheduled` int(10) unsigned NOT NULL DEFAULT '0' COMMENT 'Timestamp when a job was scheduled. 0 if a job is currently not scheduled.',
  PRIMARY KEY (`item_id`),
  KEY `name_type_id` (`name`,`type`,`id`),
  KEY `name_type` (`name`,`type`),
  KEY `next` (`next`),
  KEY `scheduled` (`scheduled`)
) ENGINE=InnoDB  DEFAULT CHARSET=utf8 COMMENT='Schedule of jobs to be executed.' AUTO_INCREMENT=53525 ;

--
-- Dumping data for table `d7_job_schedule`
--

INSERT INTO `d7_job_schedule` (`item_id`, `name`, `type`, `id`, `period`, `crontab`, `data`, `expire`, `created`, `last`, `periodic`, `next`, `scheduled`) VALUES
(53522, 'feeds_source_import', 'xml_files_to_stories', 0, 0, '', NULL, 0, 0, 1355433121, 1, 1355433121, 0),
(53523, 'feeds_source_import', 'xml_files_to_pages', 0, 0, '', NULL, 0, 0, 1355433121, 1, 1355433121, 0),
(53524, 'feeds_source_import', 'xml_files_to_editions', 0, 0, '', NULL, 0, 0, 1355433121, 1, 1355433121, 0);
gurrmag’s picture

-- Table structure for table `job_schedule`
--

CREATE TABLE IF NOT EXISTS `job_schedule` (
`item_id` int(10) unsigned NOT NULL AUTO_INCREMENT COMMENT 'Primary Key: Unique item ID.',
`name` varchar(128) NOT NULL DEFAULT '' COMMENT 'Name of the schedule.',
`type` varchar(128) NOT NULL DEFAULT '' COMMENT 'Type identifier of the job.',
`id` int(10) unsigned NOT NULL DEFAULT '0' COMMENT 'Numeric identifier of the job.',
`period` int(10) unsigned NOT NULL DEFAULT '0' COMMENT 'Time period after which job is to be executed.',
`crontab` varchar(255) NOT NULL DEFAULT '' COMMENT 'Crontab line in *NIX format.',
`data` longblob COMMENT 'The arbitrary data for the item.',
`expire` int(11) NOT NULL DEFAULT '0' COMMENT 'Timestamp when job expires.',
`created` int(11) NOT NULL DEFAULT '0' COMMENT 'Timestamp when the item was created.',
`last` int(10) unsigned NOT NULL DEFAULT '0' COMMENT 'Timestamp when a job was last executed.',
`periodic` smallint(5) unsigned NOT NULL DEFAULT '0' COMMENT 'If true job will be automatically rescheduled.',
`next` int(10) unsigned NOT NULL DEFAULT '0' COMMENT 'Timestamp when a job is to be executed (next = last + period), used for fast ordering.',
`scheduled` int(10) unsigned NOT NULL DEFAULT '0' COMMENT 'Timestamp when a job was scheduled. 0 if a job is currently not scheduled.',
PRIMARY KEY (`item_id`),
KEY `name_type_id` (`name`,`type`,`id`),
KEY `name_type` (`name`,`type`),
KEY `next` (`next`),
KEY `scheduled` (`scheduled`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COMMENT='Schedule of jobs to be executed.' AUTO_INCREMENT=1533 ;

--
-- Dumping data for table `job_schedule`
--

INSERT INTO `job_schedule` (`item_id`, `name`, `type`, `id`, `period`, `crontab`, `data`, `expire`, `created`, `last`, `periodic`, `next`, `scheduled`) VALUES
(1530, 'feeds_source_import', 'psl_xml_importer', 0, 86400, '', NULL, 0, 0, 1355476741, 1, 1355563141, 0),
(1532, 'feeds_source_import', 'news_xml_importer', 0, 0, '', NULL, 0, 0, 1355476801, 1, 1355476801, 0);

DannyPfeiffer’s picture

This is related to the hardcoded time parameter in the queue_info callback - changing the default 15 seconds to something higher (240 in my case) solves this problem.

Lines 85-104 of feeds.module:

/**
 * Implements hook_cron_queue_info().
 */
function feeds_cron_queue_info() {
  $queues = array();
  $queues['feeds_source_import'] = array(
    'worker callback' => 'feeds_source_import',
    'time' => 15,
  );
  $queues['feeds_source_clear'] = array(
    'worker callback' => 'feeds_source_clear',
    'time' => 15,
  );
  $queues['feeds_importer_expire'] = array(
    'worker callback' => 'feeds_importer_expire',
    'time' => 15,
  );
  $queues['feeds_push_unsubscribe'] = array(
    'worker callback' => 'feeds_push_unsubscribe',
    'time' => 15,
  );
  return $queues;
}

the hook_cron_queue_info adds entries to Drupal's queue table - and processes the queue for "up to" the time period specified - 15 seconds in the default.

If you have alot of large feed sources to import on each scheduled run (like i did), this will usually only get through one or two feed sources before hitting that limit.

You'll end up with a lot of items stuck in the queue table (Mine had 70,000+ rows of duplicate entries. That's because every scheduled run, all your feed sources get added to the queue, but only one or two get processed.

If you increase your time limit to something large, and have lots of duplicate rows in your queue table, the next run will process as many of those records as it can - so I'd suggest blanking out the queue table first (make a backup of it in case you need to restore something).

queenvictoria’s picture

Component: Feeds Import » Code

I also had the issue in #19. After chasing around all over the place setting timeouts in nginx, php, http_request_timeout in settings file, I've settled on drush calling feeds import. I've added some work over here to aid in this task.
http://drupal.org/node/608408

My table had 600k feeds imports queued. Nice tip for clearing the table. Good idea to back up first. This op took 5 minutes.
mysql> delete from queue where name = "feeds_source_import";

klausi’s picture

Version: 7.x-2.0-alpha7 » 7.x-2.x-dev
Status: Active » Needs review
FileSize
1012 bytes

Here is a patch that increases the default run time for the feeds import queue to 60 seconds, which is the same as core's aggregator module uses.

I also modified feeds_source_import() to re-queue itself immediately if importing of a feed has not finished. That allows us to process more items of one particular feed during a cron run for example.

Status: Needs review » Needs work

The last submitted patch, feeds-queue-1231332-21.patch, failed testing.

klausi’s picture

Status: Needs work » Needs review
FileSize
2.96 KB

Fixed the test case, since now it is not possible to determine how many items have been processed in one cron run.

twistor’s picture

Assigned: Unassigned » twistor

Nifty. Assigning to myself so I can review it at a normal hour.

Overall, I like the idea.

Making cron non-deterministic is a bit scary. We already have a bunch of problems with it. That said, this would solve a lot of problems with people's expectations. I don't think this will affect sites with a large number of feeds. Well, the re-queuing part won't, but increasing the time limit obviously will.

I kind of like the idea to use the Queue directly, in this case, rather than JobScheduler. There are a couple more places we could do this as well: clearing and expiring.

Could we move the logic back into FeedsSource::scheduleImport()?

lwalley’s picture

I've been running into the same issue described in #19, with 70,000+ queue entries and I'm wondering if Job Scheduler might be able to help prevent these duplicate jobs. I've added my thoughts to this ticket: #2061647: Rescheduling 'stuck' periodic jobs results in duplicate queue entries?

lalit774’s picture

I have done by following method. so we don't need to hack feeds module.

function hook_cron_queue_info_alter(&$queues) {
  $queues['feeds_source_import']['worker callback'] = '_custom_function_name_feeds_source_import';
  $queues['feeds_source_import']['time'] = 90;
}

function _custom_function_name_feeds_source_import($job) {
  $source = feeds_source($job['type'], $job['id']);
  try {
    $source->existing()->import();
  }
  catch (FeedsNotExistingException $e) {
    // Do nothing.
  }
  catch (Exception $e) {
    $source->log('import', $e->getMessage(), array(), WATCHDOG_ERROR);
  }
  if ($source->progressImporting() == FEEDS_BATCH_COMPLETE) {
    // Feed import finished, so we schedule the next execution in the future.
    $source->scheduleImport();
  }
  else {
    // Feed is not fully imported yet, so we put this job back in the queue
    // immediately for further processing.
    $queue = DrupalQueue::get('feeds_source_import');
    $queue->createItem($job);
  }
}
klausi’s picture

Issue summary: View changes
FileSize
3.52 KB

Patch does not apply anymore, rerolled. I moved the queuing to scheduleImport() as suggested by twistor.

twistor’s picture

Apologizes, this fell off my radar. I really like this patch, just trying to flatten out the logic.

twistor’s picture

Assigned: twistor » Unassigned
Status: Needs review » Fixed

Thanks everybody, especially klausi for coming up with a clever fix.

If somebody wants to try and backport this, they are more than welcome to. But, queue usage in D6 is optional which complicates this a bit.

http://drupalcode.org/project/feeds.git/commit/83f1a1d

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.