Hi, I'm using Drupal 7.14 on a Linux server.

I was able to set up Apache Solr and the Drupal module for it with not path problems. However Tika has not been so simple. I have tried installing the tika-app-1.2.jar outside my web directory, inside the sites/all/library and in the apachesolr_attachments/tika direcoty - all to no avail. When I test the settings I keep getting the error message:

Text can not be succesfully extracted. Please check your settings

and in /admin/reports/status:

Error
Apache Solr Attachments Java executable not found
Could not execute a java command. You may need to set the path of the correct java executable as the variable 'apachesolr_attachments_java' in settings.php.

I've tried giving the absolute path as:
/var/home/username/domain.com/www/sites/all/modules/apachesolr_attachments/tika

and tika jar file as tika-app-1.2.jar and tika-app-1.1.jar (I downloaded them both and installed them both in the same /tika directory).

I've chosen:

Extract using
Tika (local java application)

I AM able to make Tika run if I simply access the library via the command line. So:

$ java -jar tika-app-1.2.jar -t ../tests/test-tika.pdf

correctly returnes:

Testing Apache Solr Attachments text extraction

I'm running out of options to test. I read an older issue about making sure settings.php knows where the java executable is, but typing 'java' on my server brings up the service.

$ java -version
java version "1.6.0_24"
Java(TM) SE Runtime Environment (build 1.6.0_24-b07)
Java HotSpot(TM) 64-Bit Server VM (build 19.1-b02, mixed mode)

Please let me know if there is something else I could try.

Thank you.

Comments

dawnbuie’s picture

Issue tags: +installation, +configuration, +tika, +apach solr, +apach solr attachment, +absolute path

Also note I've read the the readme.txt and have seen this last line

If you are using Solr to extract your content, you need to copy (or symlink)
the contents of contrib/extraction/lib to a directory named lib under your
solr home, or alter solrconfig.xml to add the orgiginal directory as a
lib directory.

I did apply the solrconfig.tika.patch to my working the solrconfig.xml file in my working solr directory. The patch contents are:

Index: solrconfig.xml
===================================================================
RCS file: /cvs/drupal-contrib/contributions/modules/apachesolr/solrconfig.xml,v
retrieving revision 1.1.2.20
diff -u -p -r1.1.2.20 solrconfig.xml
--- solrconfig.xml	14 Oct 2009 13:28:40 -0000	1.1.2.20
+++ solrconfig.xml	26 Oct 2009 00:12:24 -0000
@@ -357,7 +357,7 @@
     -->
   <requestDispatcher handleSelect="true" >
     <!--Make sure your system has some authentication before enabling remote streaming!  -->
-    <requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="2048" />
+    <requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="8192" />
 
     <!-- Set HTTP caching related parameters (for proxy caches and clients).
           
@@ -515,6 +515,16 @@
     </lst>
   </requestHandler>
 
+  <!-- An extract-only path for accessing the tika utility -->
+  <requestHandler name="/extract/tika" class="org.apache.solr.handler.extraction.ExtractingRequestHandler" startup="lazy">
+
+    <lst name="defaults">
+    </lst>
+    <!-- This path only extracts - never updates -->
+    <lst name="invariants">
+      <bool name="extractOnly">true</bool>
+    </lst>
+  </requestHandler>
 
   <!--
    Search components are registered to SolrCore and used by Search Handlers

I wonder if I'm missing any other files where I'm supposed to add the following path to my tika app?

/var/home/username/domain.com/www/sites/all/modules/apachesolr_attachments/tika/tika-app-1.2.jar

thanks again. I'd love to be able to try this excellent module out.

scott.whittaker’s picture

I have the exact same issue as dawnbuie attempting to get Drupal to execute tika on OS X. Apachesolr works and is indexing content. Tika works on the command line. Keep getting the above message when hitting the "Test your tika extraction" button. Can't get rid of the "java executable not found" message message in Status report. Have tried setting a variable $apachesolr_attachments_java in settings.php but nothing I've tried removes that message. Have tried setting it to /usr/bin/java, the location of JAVA_HOME, and the location of tika.jar.

I'm out of ideas.

moehac’s picture

I'm having the same issue. Thanks for your help and consideration.

nick_vh’s picture

Can you please try with tika 1.0? Im not sure if tika 1.2 made some major changes?

nick_vh’s picture

Status: Active » Postponed (maintainer needs more info)
jfhovinne’s picture

For my part, text extraction works. Here is my configuration:

  • Debian
  • Java 1.6.0_26
  • Tomcat 6
  • Solr 3.6.1 in multicore
  • Tika 1.2 with tika-app-1.2.jar in /usr/share/java and symlinked as /usr/share/java/tika-app.jar
  • Drupal apachesolr 7.x-1.1, apachesolr_attachments 7.x-1.2

apachesolr_attachments settings (nothing in settings.php):

  • Extract using: Tika (local java application)
  • Tika directory path: /usr/share/java
  • Tika jar file: tika-app.jar

I've manually patched solrconfig.xml and restarted Tomcat.

HTH
Jean-François

debzani’s picture

For my case, I have to put tika jar in some subfolder of apache solr server installation and then it works. The full path is as follows

~\apache-solr-3.6.0\contrib\extraction\lib\

eminencehealthcare’s picture

I am experiencing this exact same issue.

_randy’s picture

I had this issue as well. I used the tika patch for solrconfig.xml and found myself exactly in the same situation.

I removed the tika handler from the solrconfig.xml file and manually added the configuration to the xml file with the rest of the Request Handlers. It seems that the tika Request Handler setup injected itself in the middle of the <query> xml tags.

paultrotter50’s picture

I am also getting the error message "Text can not be succesfully extracted. Please check your settings" when i press "test your tika extraction". I have tried with tika-app-1.0.jar and tika-app-1.2.jar. I am using tomcat 5.5.36, with solr 3.6.2

admin/config/search/apachesolr/settings shows my localhost server in green, so that appears to be working.

On /admin/config/search/apachesolr I have noticed that the 'value' of 'Schema' is 'drupal-4.1-solr-3.x' which seems strange as I'm using Drupal 7.

I have used the tika patch for solrconfig.xml.

I would really appreciate and suggestions as to what I might be doing wrong, or what I should try next.

Panther256’s picture

I had to add the following line to my settings.php (at the bottom) to get Tika to work:

$conf = array(  'apachesolr_attachments_java' => 'C:/Progra~1/Java/jre7/bin/java.exe -Xms20m -Xmx64m',);

***** Just the line, I didn't need the PHP tags as shown above

Of course you will need to modify the path to your java.exe for your configuration.

-- Gene

jdu’s picture

I was having this same problem. Turns out, in OS X, I needed to redirect the output of the shell_exec()

In apachesolr_attachments.index.inc, somewhere around line 135, do this: return shell_exec($cmd.' 2>&1');

This made it work with my MAMP setup, and I have since moved the same code to a LAMP environment with no issues.

unqunq’s picture

EDIT: I got it to work on the AWS server by placing the app file just outside the site root folder. I thought I had it configured the same on my local machine but it looks like something was wrong locally. Now it passes the test and indexes all attachments.

I could not get it to work either.

I run a local Drupal7 instance on my Mac OS and I configured it to use local tika. The path is correct and if running java -jar tika-app-1.4.jar in terminal I get the Tika CLI:

Welcome to Apache Tika version 1.4!
To see what Tika can do, just drop a file or a URL to this window. Use the View menu to switch views.

Drupal still complains that it cannot extract:

Text can not be successfully extracted. Please check your settings

My java version:

java -version
java version "1.6.0_65"
Java(TM) SE Runtime Environment (build 1.6.0_65-b14-462-11M4609)
Java HotSpot(TM) 64-Bit Server VM (build 20.65-b04-462, mixed mode)
bart atlas’s picture

Panther256's fix in comment #11 worked for me on WAMP. Much obliged!

nibo’s picture

I had the same problem on my system (CentOS 6.3).

To find what the problem was, I made a dpm() of what shell_exec() in the apachesolr_attachments_extract_using_tika-function was returning. The result was

Error occurred during initialization of VM
Could not reserve enough space for code cache

But the problem was not the reserving of the memory space, but the SELinux module. Disabling it on my dev environment did the trick.

revathi.b’s picture

I have to give permission to the jar file.
sudo chmod -R 775 tika-app-1.12.jar , this command resolve my problem