Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
Hi,
I daily do a bulk insert of new nodes from my local database using a custom script. It would be great, if at every cron run or whenever I want all added nodes would be 'bulk processed'. What I'm actually asking for is a 'bulk processing'-script which I can run with various parameters e.g. all nodes of type 'x', all nodes with a creation date > yyyymmdd etc.
Kind regards,
Jacques Bopp
Comment | File | Size | Author |
---|---|---|---|
#20 | calais_bulk_import.php.txt | 3.23 KB | webchick |
#18 | calais_bulk_import.php.txt | 2.44 KB | webchick |
#15 | calais_bulk_import.php.txt | 2.27 KB | webchick |
#11 | calais_bulk_import.php.txt | 2.43 KB | webchick |
#10 | calais_bulk_import.php.txt | 1.32 KB | webchick |
Comments
Comment #1
rares CreditAttribution: rares commentedThere's a very good bulk processing interface in 6.x-3.0, perhaps all you need to do is figure out what settings you need at admin/settings/calais/bulk-process, and then add something like the following to calais.module
I haven't tested this, though, but it's worth trying.
Comment #2
webchickSubscribing.
Comment #3
febbraro CreditAttribution: febbraro commentedHey folks, I was wondering if you knew of some other modules that did bulk processing on cron run that I could use as an example. If there was some bey-oooooo-tiful code already written, it might make getting this in here a helluva lot easier/faster. Figured I'd ask. :)
Comment #4
webchickFeed API has a pretty good example, although maybe a bit overly complex since we don't need to go back and repeatedly re-tag nodes, just do it the one time.
Otherwise the code that rares posted looks pretty close. It would require making variables for each of the settings on the current bulk processing page, which would conveniently take care of #433802: Remember last bulk processing settings too. :)
Comment #5
sphism CreditAttribution: sphism commentedHas anyone managed to get a batch process to run during an automatic cron job?
I can get a batch to run when i hit the 'run cron' button.
But on automatic cron I just get a timeout error - the wierd thing tho is that the timeout report is generated at within a second of the cron job starting, so it's not a real timeout???
Comment #6
febbraro CreditAttribution: febbraro commentedHave not tried to do the cron run thing just yet. Sounds like there could also just be some inherent issue with how that bulk api works such that calling it via cron.php wont work, but logged in as uid = 1 (or whomever) works.
Comment #7
webchickHey, febbraro!
Did you ever get anywhere on this? If not, I'm going to try looking into it tonight/tomorrow. If you have even a partial patch laying around somewhere I'd be happy to help. :)
Comment #8
webchickActually, after reading through #212084: perform bulk updates during cron and/or via the batch API which is basically the same problem, but for pathauto, the solution proposed about halfway through was a separate script that could be called without invoking the overhead of all of the other code that runs during a cron run. I think that actually makes a lot of sense, so I'm going to work in that direction.
Comment #9
webchickHere's a first stab. Totally untested.
Comment #10
webchickNow with fewer bugs!
I haven't been able to confirm that this works yet because it keeps coming back with empty keywords. I have a strong suspicion that this is because Calais doesn't speak Latin, which is the one spoken by Devel Generate. ;)
I'll have to try this with a copy of my production database, but for now there are at least no SQL syntax errors. ;)
Comment #11
webchickOk, I think this is kinda working now. Going to try on a fresh copy of the database next.
I'm not sure if that query for selecting un-tagged nodes is going to work though; it seems like if a node makes it through the processing and Calais doesn't find any keywords for it, we could come across an issue where it continues to get processed over and over again.
Comment #12
webchickOh, and additionally, it's not making any adjustments for nodes that anonymous users don't have access to. Could be a security risk, depending on your POV of sending your website's content off to some third-party provider.
Comment #13
febbraro CreditAttribution: febbraro commentedwow angie, thanks for taking the bull by the horns. I will take some time to review this over the next day. Thanks so much.
Comment #14
KarenS CreditAttribution: KarenS commentedSubscribing.
Comment #15
webchickNow with fewer stupid bugs!
I actually think this is right, now. But I'd welcome Frank/Irakli's input. :)
Comment #16
irakli CreditAttribution: irakli commentedAngie,
you rock!
Thanks
Comment #17
webchickYeah, as I thought, this approach does run into problems eventually. I'm going to need to store the last processed node ID and check against that the next time it runs.
Comment #18
webchickOk, let's try this instead.
Comment #19
webchickHm...
Word to the wise to those who are using this script. http://opencalais.com/documentation/calais-web-service-api/usage-quotas gives a maximum limit of 40,000 transactions per day, 4 transactions per second. If your cron job runs too often and/or you have too much content, you'll exceed this limit and get a "403 Developer Over Rate" after awhile. Sigh. :P
Frank/Irakli: Do you know, if it does this, will the node get processed again tomorrow, or did I just lose my chance to tag it?
Comment #20
webchickOk, here's a new better version which I really hope doesn't have major bugs because it's what we ended up deploying on the live site. ;)
Improvements:
a) It now processes nodes from newest to old, rather than oldest to new. Running this script can take a good week or two, so it's nice to tag the content users are more likely to see first.
b) It sets a variable when it's done and will check that before doing more work. :P
c) It sets a $node property during bulk processing which other modules can check in hook_calais_pre/postprocess to see if they need to do anything special.
d) Clears cache at the end. Unfortunately I needed to add this because I was having issues with Glossary module not quite "getting it" when new terms were added.
Hope this helps someone else!
Comment #21
febbraro CreditAttribution: febbraro commentedThanks Angie. You are correct in #19, if it is missed in the first go around you will have to save it again to make sure it gets processed.
However, if I understand the script correctly this will only process nodes that have not been processed by Calais previously? You're checking for nodes that have no calais_term_node records, in which case it will handle those that may have possibly failed previously.
I like the script, I have a great data set for this that I will need to run in the next few weeks.
One question though, what is the proper approach for putting a php file such as this into a module? Should I change the extension and have the user explicitly change it to .php to execute it? Seem like it could be dangerous otherwise? Couldn't a browser execute it directly if they knew the full path to it?
Comment #22
patchak CreditAttribution: patchak commentedHey there, I just tried this script, and I got the error saying that no nodes are configured for Calais, which is not OK, since I have two content types that are working with Calais, but using Semantic Proxy.
Is this script supposed to work with semantic Proxy as well?' Is there anything special I need to do to make it work with semantic proxy?
thanks,
Patchak
Comment #23
webchickI've never used semantic proxy, so I have no idea.
If you read the code, you'll see that it only acts if Calais processing is set to work on every update (CALAIS_PROCESS_AUTO). You might have to tweak it a bit to make the logic fit what you need it to do.
Comment #24
mikeytown2 CreditAttribution: mikeytown2 commentedCan I give it a list of NID's and have this process them? I guess I would just rewrite the SQL in calais_bulk_import() to do it, right?
Also how long would it take to do like 100,000 nodes?
Comment #25
deltab CreditAttribution: deltab commentedShould we not make a feature request out of this? Reindexing the existing nodes ought to be a core feature, same as the Drupal search modules.
Comment #26
webchickI thought about adding this to a cron hook to make it a "core" feature of Calais module, but in the end decided not to because:
1. We had 200K nodes to index, and wanted to do that as fast as possible, so we set this to run around every minute. It completed in about a week.
2. We did not want to run cron.php every minute because it's doing all kinds of other things, such as reindexing the search, XML Sitemap stuff, and lord knows what else, none of which can complete in only a minute, and none of which needs to be run that often.
I agree this is no longer a support request though. I'm not sure if it makes sense to include it in the "proper" module or not, since it was kind of a one-off thing.
Comment #27
deltab CreditAttribution: deltab commentedMy need is slightly different, we have around 160,000 nodes to index, however, a lot of them are in French and Spanish - when we started with Calais, there was no feature to extract terms from these languages, now it seems there is.
Also, the Calais system is improving all the time, so we would like to see how we can resend our data to Calais periodically and get the new improved metadata.
Do you think it possible to do a reindex feature without stopping the new nodes from indexing? Or it is more rigorously a job for another module?
Comment #28
febbraro CreditAttribution: febbraro commentedThanks for all the comments.
@webchick, Again, thanks for putting this together. It has and will help a TON of people get off the ground with Calais. I think this script should definitely be part of the module, just not a feature that is run on cron, b/c like you said a ton of other crap happens on cron and every site is different. Ideally this will be available to run via shell/cron and also could be integrated into Drush (my newish sweetheart).
For all others, anything like 100k nodes or more could take QUITE some time to process. See @webchick response above think in terms of days or weeks, not hours. The biggest hold up is the 4 per second and 40k per day limits that are enforced on any one API key.
@mikeytown2, yeah you'd have to do something a bit more custom for that, but again, remember about the API throttling. The function
calais_process_node
will do most of the work for you.Comment #29
webchickOooh. Drush integration is a GREAT idea! I probably would've done that but we started while 2.0 was still in process. Thanks, febbraro!
Comment #30
groovypower CreditAttribution: groovypower commentedsubscribe
Comment #31
mikeytown2 CreditAttribution: mikeytown2 commentedCouple of notes:
Right now, code doesn't work. This is returning for some reason.
I think ASC is better then DESC for the SQL, thus new nodes will get indexed.
Note: Operation is not ATOMIC, thus if processing 10 nodes took over 1 minute, you could get "stuck".
This allows me to check the last time it was run. Made a cron hook to start it up again, if stopped over 'X' seconds.
I used some of my tricks and made this call it's self in a loop, fixing the ATOMIC issue.
Doing a % of nodes done is nice
$total = db_result(db_query("SELECT COUNT(*) FROM {node}"));
.I'm running a hacked version of calais now so I don't feel like uploading my version of this, since it won't work without some code changes. Everything I did to my code is documented above.
Here's the top of my php
Comment #32
mikeytown2 CreditAttribution: mikeytown2 commentedThe sql join starts to really SLOW down too. I got rid of that, and I'm just going through all the nid's of type X.
Comment #33
mikeytown2 CreditAttribution: mikeytown2 commentedput the important parts of the while loop inside a try block, since it keeps bombing on me.
Comment #34
mikeytown2 CreditAttribution: mikeytown2 commentedif the last node in the database (from the batch) doesn't load then the bulk import stops; here's a way around that
This is what I found in my DB today
calais_bulk_import_last_processed, N;
Comment #35
shunting CreditAttribution: shunting commentedSubscribing. I've got a mere 20,000 nodes or so, and so I let the process run overnight and it got through 58%. Then the browser bombed, and I had to start all over. That's a little frustrating.
Could whatever ends up in Calais core work like re-indexing search? That would seem to be the friendliest. I'd guess this is holding back adoption -- I suspect there are many who have an entire body of work that they'd like to put up, but the bulk misfeatures (?) get in the way.
(I guess the second friendliest would be CLI in drush, with parameters for content type, batch size, and run size)
Comment #36
deltab CreditAttribution: deltab commented@shunting, this can be very useful.
subscribing!
Comment #37
febbraro CreditAttribution: febbraro commentedI think you're right, something like core/apachesolr search indexing might be best. Drush support should be in there too. I don't have a time table for it, but it's something that is on the short list.
Comment #38
deltab CreditAttribution: deltab commentedIt could be as simple as a function in bulk processing (set nodes per batch, and run one batch per cron) right?
Comment #39
febbraro CreditAttribution: febbraro commentedGood gravy. At long last I got some time to rework the bulk processing. Now includes Drush support too. Should be in a dev release real soon.
http://drupal.org/cvs?commit=395222
Comment #40
febbraro CreditAttribution: febbraro commentedTo explain a bit better.
There is now a queue of nodes to be bulk processed. These nodes can be processed either via Drush or on cron. There is an admin interface to add particular node types to the queue, or you could also add it to the queue table manually.