Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
Currently the module passes a URL to tika to retrieve the files. When you have a lot of files to index this can be slow. I've attached a file that uses the real file path to access the file.
I'm not sure if this can be integrated somehow since this wouldn't work if you want to index remote documents (which is possible in D7). Maybe only use realpath() when it's available on the wrapper and use the URL if it's not?
Comment | File | Size | Author |
---|---|---|---|
#10 | search_api_attachments-realpath-1148162-10.patch | 2.12 KB | svendecabooter |
#4 | search_api_attachments-realpath-1148162-4.patch | 1.89 KB | floeschie |
realpath.patch | 772 bytes | Jax |
Comments
Comment #1
Anonymous (not verified) CreditAttribution: Anonymous commentedHi
Have you got any numbers about the preformance gain?
I'll try to set it up this weekend and test the patch.
Regards
Tim
Comment #2
Jax CreditAttribution: Jax commentedAnother reason for using realpath is that at the moment the indexing doesn't work with the "drush core-cron" command. The host part is missing from the url.
Comment #3
Anonymous (not verified) CreditAttribution: Anonymous commentedHi
I've run some test with realpath and there is no real performance gain
for indexing 34 large documents they both took around 1 minute 31 seconds, give or take 1 second.
So I don't think we should change it for the performance gain, but maybe try to solve for drush part.
What's your idea...
Tim
Comment #4
floeschie CreditAttribution: floeschie commentedI'm having the same issue with "drush cron". I applied the patch above as well the one I submitted in #1365148. Then I cleared the index and ran the "drush cron" command. I still get errors for files which are not parsed by Tika (images, plaintext).
So I extended the code a bit mor and added a new function which determines a file's realpath.
Comment #5
floeschie CreditAttribution: floeschie commentedSeems to work except that German umlauts are removed by escapeshellarg(). If a file has something like "ä ö ü" in its name, then Tika throws an exception cause it cannot find a file after the umlauts have been removed...
Comment #6
BerdirStream wrappers can register if they are remote or local using the type in hook_stream_wrappers(): http://api.drupal.org/api/drupal/modules--system--system.api.php/functio...
The function added in the patch should probably check the type of the of the scheme and if it's local, use realpath() and otherwise getExternalPath(). This allows to support both private:// (which currently doesn't work if anon users don't have access) and remote wrappers like s3://.
Comment #7
wwhurley CreditAttribution: wwhurley commentedAnother option is to set your $base_url in the appropriate settings.php file. That'll also fix URLs in various emails sent by the system on cron as well if you ever run into that.
Comment #8
sinasalek CreditAttribution: sinasalek commentedPatch #4 works for me, however i'm agree with Berdir that it should be a bit smarter , i had a look at http://api.drupal.org/api/drupal/includes%21stream_wrappers.inc/7 but couldn't find anything that clearly indicated whether a wrapper is local or remote. If that's the case then a workaround should be used.
One solution can be check $wrapper class to see if it's a subclass of DrupalLocalStreamWrapper
Comment #9
mfb(from image.gd.inc)
Comment #10
svendecabooterAttached is a new version of patch #4 based on the feedback by Berdir & mfb
Comment #11
osopolarPatch in #10 works for me, Thanks.
Comment #12
sinasalek CreditAttribution: sinasalek commentedHaven't test patch #10 but the approach looks reasonable to me too.
@osopolar if it works fine for you , you could mark it as RTBC, i'm doing on behalf of you
Comment #13
coreycondardo CreditAttribution: coreycondardo commentedI applied this patch and it seems to reduce the number of errors the drush command to index the site produces however I'm getting a lot of this..
INFO - unsupported/disabled operation: EI
What is that?
--Corey
Comment #14
apanag CreditAttribution: apanag commentedPatch worked for me. However my case was a password protected site, so every time extract was getting a 401 error code.
Using the realpath solved the problem, because no URL is required anymore.
Also some "INFO - unsupported/disabled operation: EI" messages were shown, but the vast majority of the .pdfs were indexed properly.
Thank you for the patch :-)
Comment #15
mike503 CreditAttribution: mike503 commentedpatch works great. after hours of debugging why java was dumping with absolutely no reasonable explanation... sigh
Comment #16
torpy CreditAttribution: torpy commentedThis worked brilliantly for me. I had issues where I was using a self-signed SSL certificate (on my development) server which was causing Tika to error out.
Comment #17
izus CreditAttribution: izus commentedassigning to me for testing and very probabely merging in the 7.x-1.x branch
Comment #18
izus CreditAttribution: izus commentedHi,
just merged it in 7.x-1.x branch
Thanks all !
Comment #19
izus CreditAttribution: izus commented