Currently the module passes a URL to tika to retrieve the files. When you have a lot of files to index this can be slow. I've attached a file that uses the real file path to access the file.

I'm not sure if this can be integrated somehow since this wouldn't work if you want to index remote documents (which is possible in D7). Maybe only use realpath() when it's available on the wrapper and use the URL if it's not?

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Anonymous’s picture

Hi

Have you got any numbers about the preformance gain?
I'll try to set it up this weekend and test the patch.

Regards

Tim

Jax’s picture

Another reason for using realpath is that at the moment the indexing doesn't work with the "drush core-cron" command. The host part is missing from the url.

Anonymous’s picture

Hi

I've run some test with realpath and there is no real performance gain
for indexing 34 large documents they both took around 1 minute 31 seconds, give or take 1 second.

So I don't think we should change it for the performance gain, but maybe try to solve for drush part.

What's your idea...

Tim

floeschie’s picture

Status: Needs work » Needs review
FileSize
1.89 KB

I'm having the same issue with "drush cron". I applied the patch above as well the one I submitted in #1365148. Then I cleared the index and ran the "drush cron" command. I still get errors for files which are not parsed by Tika (images, plaintext).

So I extended the code a bit mor and added a new function which determines a file's realpath.

floeschie’s picture

Seems to work except that German umlauts are removed by escapeshellarg(). If a file has something like "ä ö ü" in its name, then Tika throws an exception cause it cannot find a file after the umlauts have been removed...

Berdir’s picture

Stream wrappers can register if they are remote or local using the type in hook_stream_wrappers(): http://api.drupal.org/api/drupal/modules--system--system.api.php/functio...

The function added in the patch should probably check the type of the of the scheme and if it's local, use realpath() and otherwise getExternalPath(). This allows to support both private:// (which currently doesn't work if anon users don't have access) and remote wrappers like s3://.

wwhurley’s picture

Another option is to set your $base_url in the appropriate settings.php file. That'll also fix URLs in various emails sent by the system on cron as well if you ever run into that.

sinasalek’s picture

Patch #4 works for me, however i'm agree with Berdir that it should be a bit smarter , i had a look at http://api.drupal.org/api/drupal/includes%21stream_wrappers.inc/7 but couldn't find anything that clearly indicated whether a wrapper is local or remote. If that's the case then a workaround should be used.
One solution can be check $wrapper class to see if it's a subclass of DrupalLocalStreamWrapper

mfb’s picture

 $local_wrappers = file_get_stream_wrappers(STREAM_WRAPPERS_LOCAL);
 if (isset($local_wrappers[$scheme])) {
 // then it's local 

(from image.gd.inc)

svendecabooter’s picture

Attached is a new version of patch #4 based on the feedback by Berdir & mfb

osopolar’s picture

Patch in #10 works for me, Thanks.

sinasalek’s picture

Status: Needs review » Reviewed & tested by the community

Haven't test patch #10 but the approach looks reasonable to me too.
@osopolar if it works fine for you , you could mark it as RTBC, i'm doing on behalf of you

coreycondardo’s picture

I applied this patch and it seems to reduce the number of errors the drush command to index the site produces however I'm getting a lot of this..

INFO - unsupported/disabled operation: EI

What is that?

--Corey

apanag’s picture

Patch worked for me. However my case was a password protected site, so every time extract was getting a 401 error code.
Using the realpath solved the problem, because no URL is required anymore.

Also some "INFO - unsupported/disabled operation: EI" messages were shown, but the vast majority of the .pdfs were indexed properly.

Thank you for the patch :-)

mike503’s picture

patch works great. after hours of debugging why java was dumping with absolutely no reasonable explanation... sigh

torpy’s picture

This worked brilliantly for me. I had issues where I was using a self-signed SSL certificate (on my development) server which was causing Tika to error out.

izus’s picture

Assigned: Unassigned » izus

assigning to me for testing and very probabely merging in the 7.x-1.x branch

izus’s picture

Status: Reviewed & tested by the community » Fixed

Hi,
just merged it in 7.x-1.x branch
Thanks all !

izus’s picture

Assigned: izus » Unassigned

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.