Hi,
I daily do a bulk insert of new nodes from my local database using a custom script. It would be great, if at every cron run or whenever I want all added nodes would be 'bulk processed'. What I'm actually asking for is a 'bulk processing'-script which I can run with various parameters e.g. all nodes of type 'x', all nodes with a creation date > yyyymmdd etc.
Kind regards,
Jacques Bopp

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

rares’s picture

There's a very good bulk processing interface in 6.x-3.0, perhaps all you need to do is figure out what settings you need at admin/settings/calais/bulk-process, and then add something like the following to calais.module

function calais_cron(){

  $form_state = array();
  $values = array();
  $values['calais_bulk_type'] = 'page';
  $values['calais_bulk_limit'] = '2';
  $values['calais_bulk_throttle'] = TRUE;
  $values['calais_bulk_threshold'] = 0.2;
  $form_state['values'] = $values;
  calais_bulk_process_submit($form_state, $form_state);

}

I haven't tested this, though, but it's worth trying.

webchick’s picture

febbraro’s picture

Version: 6.x-2.x-dev » 6.x-3.1
Assigned: Unassigned » febbraro

Hey folks, I was wondering if you knew of some other modules that did bulk processing on cron run that I could use as an example. If there was some bey-oooooo-tiful code already written, it might make getting this in here a helluva lot easier/faster. Figured I'd ask. :)

webchick’s picture

Feed API has a pretty good example, although maybe a bit overly complex since we don't need to go back and repeatedly re-tag nodes, just do it the one time.

Otherwise the code that rares posted looks pretty close. It would require making variables for each of the settings on the current bulk processing page, which would conveniently take care of #433802: Remember last bulk processing settings too. :)

sphism’s picture

Has anyone managed to get a batch process to run during an automatic cron job?

I can get a batch to run when i hit the 'run cron' button.

But on automatic cron I just get a timeout error - the wierd thing tho is that the timeout report is generated at within a second of the cron job starting, so it's not a real timeout???

febbraro’s picture

Have not tried to do the cron run thing just yet. Sounds like there could also just be some inherent issue with how that bulk api works such that calling it via cron.php wont work, but logged in as uid = 1 (or whomever) works.

webchick’s picture

Hey, febbraro!

Did you ever get anywhere on this? If not, I'm going to try looking into it tonight/tomorrow. If you have even a partial patch laying around somewhere I'd be happy to help. :)

webchick’s picture

Actually, after reading through #212084: perform bulk updates during cron and/or via the batch API which is basically the same problem, but for pathauto, the solution proposed about halfway through was a separate script that could be called without invoking the overhead of all of the other code that runs during a cron run. I think that actually makes a lot of sense, so I'm going to work in that direction.

webchick’s picture

Title: bulk process on cron run » Create a way to do mass-importing without batch API
Status: Active » Needs review
FileSize
1.12 KB

Here's a first stab. Totally untested.

webchick’s picture

FileSize
1.32 KB

Now with fewer bugs!

I haven't been able to confirm that this works yet because it keeps coming back with empty keywords. I have a strong suspicion that this is because Calais doesn't speak Latin, which is the one spoken by Devel Generate. ;)

I'll have to try this with a copy of my production database, but for now there are at least no SQL syntax errors. ;)

webchick’s picture

FileSize
2.43 KB

Ok, I think this is kinda working now. Going to try on a fresh copy of the database next.

I'm not sure if that query for selecting un-tagged nodes is going to work though; it seems like if a node makes it through the processing and Calais doesn't find any keywords for it, we could come across an issue where it continues to get processed over and over again.

webchick’s picture

Oh, and additionally, it's not making any adjustments for nodes that anonymous users don't have access to. Could be a security risk, depending on your POV of sending your website's content off to some third-party provider.

febbraro’s picture

wow angie, thanks for taking the bull by the horns. I will take some time to review this over the next day. Thanks so much.

KarenS’s picture

Subscribing.

webchick’s picture

FileSize
2.27 KB

Now with fewer stupid bugs!

I actually think this is right, now. But I'd welcome Frank/Irakli's input. :)

irakli’s picture

Angie,

you rock!

Thanks

webchick’s picture

Status: Needs review » Needs work

Yeah, as I thought, this approach does run into problems eventually. I'm going to need to store the last processed node ID and check against that the next time it runs.

webchick’s picture

Status: Needs work » Needs review
FileSize
2.44 KB

Ok, let's try this instead.

webchick’s picture

Hm...

Word to the wise to those who are using this script. http://opencalais.com/documentation/calais-web-service-api/usage-quotas gives a maximum limit of 40,000 transactions per day, 4 transactions per second. If your cron job runs too often and/or you have too much content, you'll exceed this limit and get a "403 Developer Over Rate" after awhile. Sigh. :P

Frank/Irakli: Do you know, if it does this, will the node get processed again tomorrow, or did I just lose my chance to tag it?

webchick’s picture

FileSize
3.23 KB

Ok, here's a new better version which I really hope doesn't have major bugs because it's what we ended up deploying on the live site. ;)

Improvements:
a) It now processes nodes from newest to old, rather than oldest to new. Running this script can take a good week or two, so it's nice to tag the content users are more likely to see first.
b) It sets a variable when it's done and will check that before doing more work. :P
c) It sets a $node property during bulk processing which other modules can check in hook_calais_pre/postprocess to see if they need to do anything special.
d) Clears cache at the end. Unfortunately I needed to add this because I was having issues with Glossary module not quite "getting it" when new terms were added.

Hope this helps someone else!

febbraro’s picture

Thanks Angie. You are correct in #19, if it is missed in the first go around you will have to save it again to make sure it gets processed.

However, if I understand the script correctly this will only process nodes that have not been processed by Calais previously? You're checking for nodes that have no calais_term_node records, in which case it will handle those that may have possibly failed previously.

I like the script, I have a great data set for this that I will need to run in the next few weeks.

One question though, what is the proper approach for putting a php file such as this into a module? Should I change the extension and have the user explicitly change it to .php to execute it? Seem like it could be dangerous otherwise? Couldn't a browser execute it directly if they knew the full path to it?

patchak’s picture

Hey there, I just tried this script, and I got the error saying that no nodes are configured for Calais, which is not OK, since I have two content types that are working with Calais, but using Semantic Proxy.

Is this script supposed to work with semantic Proxy as well?' Is there anything special I need to do to make it work with semantic proxy?

thanks,
Patchak

webchick’s picture

I've never used semantic proxy, so I have no idea.

If you read the code, you'll see that it only acts if Calais processing is set to work on every update (CALAIS_PROCESS_AUTO). You might have to tweak it a bit to make the logic fit what you need it to do.

mikeytown2’s picture

Can I give it a list of NID's and have this process them? I guess I would just rewrite the SQL in calais_bulk_import() to do it, right?

Also how long would it take to do like 100,000 nodes?

deltab’s picture

Should we not make a feature request out of this? Reindexing the existing nodes ought to be a core feature, same as the Drupal search modules.

webchick’s picture

Category: support » feature

I thought about adding this to a cron hook to make it a "core" feature of Calais module, but in the end decided not to because:

1. We had 200K nodes to index, and wanted to do that as fast as possible, so we set this to run around every minute. It completed in about a week.
2. We did not want to run cron.php every minute because it's doing all kinds of other things, such as reindexing the search, XML Sitemap stuff, and lord knows what else, none of which can complete in only a minute, and none of which needs to be run that often.

I agree this is no longer a support request though. I'm not sure if it makes sense to include it in the "proper" module or not, since it was kind of a one-off thing.

deltab’s picture

My need is slightly different, we have around 160,000 nodes to index, however, a lot of them are in French and Spanish - when we started with Calais, there was no feature to extract terms from these languages, now it seems there is.

Also, the Calais system is improving all the time, so we would like to see how we can resend our data to Calais periodically and get the new improved metadata.

Do you think it possible to do a reindex feature without stopping the new nodes from indexing? Or it is more rigorously a job for another module?

febbraro’s picture

Thanks for all the comments.

@webchick, Again, thanks for putting this together. It has and will help a TON of people get off the ground with Calais. I think this script should definitely be part of the module, just not a feature that is run on cron, b/c like you said a ton of other crap happens on cron and every site is different. Ideally this will be available to run via shell/cron and also could be integrated into Drush (my newish sweetheart).

For all others, anything like 100k nodes or more could take QUITE some time to process. See @webchick response above think in terms of days or weeks, not hours. The biggest hold up is the 4 per second and 40k per day limits that are enforced on any one API key.

@mikeytown2, yeah you'd have to do something a bit more custom for that, but again, remember about the API throttling. The function calais_process_node will do most of the work for you.

webchick’s picture

Oooh. Drush integration is a GREAT idea! I probably would've done that but we started while 2.0 was still in process. Thanks, febbraro!

groovypower’s picture

subscribe

mikeytown2’s picture

Version: 6.x-3.1 » 6.x-3.2
Status: Needs review » Needs work

Couple of notes:
Right now, code doesn't work. This is returning for some reason.

function calais_process_node(&$node, $process_type, $threshold, $op = 'insert') {
  if ($process_type == CALAIS_PROCESS_NO || isset($node->calais->processor))
    return;

I think ASC is better then DESC for the SQL, thus new nodes will get indexed.
Note: Operation is not ATOMIC, thus if processing 10 nodes took over 1 minute, you could get "stuck".
This allows me to check the last time it was run. Made a cron hook to start it up again, if stopped over 'X' seconds.

variable_set('calais_bulk_import_last_run', time());

I used some of my tricks and made this call it's self in a loop, fixing the ATOMIC issue.
Doing a % of nodes done is nice $total = db_result(db_query("SELECT COUNT(*) FROM {node}"));.

I'm running a hacked version of calais now so I don't feel like uploading my version of this, since it won't work without some code changes. Everything I did to my code is documented above.

Here's the top of my php

define('CALAIS_BATCH_LIMIT', 10);
ignore_user_abort();

// Bootstrap Drupal.
require_once './includes/bootstrap.inc';
drupal_bootstrap(DRUPAL_BOOTSTRAP_FULL);
echo "Starting... ";
// Run the batch import process.
variable_set('calais_bulk_import_last_run', time());
calais_bulk_import();
echo "Calling Self... ";
call_self("/calais_bulk_import.php");
echo "Done";
mikeytown2’s picture

The sql join starts to really SLOW down too. I got rid of that, and I'm just going through all the nid's of type X.

mikeytown2’s picture

put the important parts of the while loop inside a try block, since it keeps bombing on me.

mikeytown2’s picture

if the last node in the database (from the batch) doesn't load then the bulk import stops; here's a way around that

  if (is_int($node->nid)) {
    variable_set('calais_bulk_import_last_processed', $node->nid);
  }
  else {
    variable_set('calais_bulk_import_last_processed', $last_processed - CALAIS_BATCH_LIMIT);
  }

This is what I found in my DB today
calais_bulk_import_last_processed, N;

shunting’s picture

Subscribing. I've got a mere 20,000 nodes or so, and so I let the process run overnight and it got through 58%. Then the browser bombed, and I had to start all over. That's a little frustrating.

Could whatever ends up in Calais core work like re-indexing search? That would seem to be the friendliest. I'd guess this is holding back adoption -- I suspect there are many who have an entire body of work that they'd like to put up, but the bulk misfeatures (?) get in the way.

(I guess the second friendliest would be CLI in drush, with parameters for content type, batch size, and run size)

deltab’s picture

@shunting, this can be very useful.

subscribing!

febbraro’s picture

I think you're right, something like core/apachesolr search indexing might be best. Drush support should be in there too. I don't have a time table for it, but it's something that is on the short list.

deltab’s picture

It could be as simple as a function in bulk processing (set nodes per batch, and run one batch per cron) right?

febbraro’s picture

Status: Needs work » Fixed

Good gravy. At long last I got some time to rework the bulk processing. Now includes Drush support too. Should be in a dev release real soon.

http://drupal.org/cvs?commit=395222

febbraro’s picture

To explain a bit better.

There is now a queue of nodes to be bulk processed. These nodes can be processed either via Drush or on cron. There is an admin interface to add particular node types to the queue, or you could also add it to the queue table manually.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.