I have solr and tika running for my Drupal 7.9 installation which is running on Windows 2003 Server under IIS. Text extraction is working correctly for text files. However pdf and doc file text extraction is not working. For every pdf and doc file I get a notice after the cron job runs.
Notice: Undefined index: filepath in apachesolr_attachments_add_documents() (line 133 of H:\Inetpub\wwwroot\drupal\modules\apachesolr_attachments\apachesolr_attachments.index.inc).
In the audit log I see...
php -> Notice: Undefined index: filepath in apachesolr_attachments_add_documents() (line 133 of H:\Inetpub\wwwroot\drupal\modules\apachesolr_attachments\apachesolr_attachments.index.inc).
Apache Solr Attachments warning -> Could not extract any indexable text from
Notice that nothing is after from, almost like there is an issue getting the file name. Very strange because like I said, text file text extraction is working. I've tested Tika from a command window and it does extract text from doc and pdf files correctly. Any ideas?
Comments
Comment #1
ds1964 CreditAttribution: ds1964 commentedDitto, having exactly the same problem (running on a remote LAMP shared host). Got to this point after applying a couple of other recent patches.
Thanks for any help with this.
Comment #2
pwolanin CreditAttribution: pwolanin commentedText files don't use Tika, they are just consumed directly.
Comment #3
jh81 CreditAttribution: jh81 commentedCorrect that text files don't use Tika so it is certainly seems like a problem with the Apache Solr Attachments module getting Word and PDF documents attached to nodes sent to Tika. I've spent hours on this and still have not been able to resolve it.
Comment #4
jh81 CreditAttribution: jh81 commentedIn apachesolr_attachments.index.inc, line 300, the PHP function mb_detect_encoding is used. This is where I was having an issue because it was erroring on this line and not logging the error. So the file attachment module would just stop processing on the first file it tried to extract text from. I figured that out by putting in some watchdog commands around various commands in the apachesolr_attachments_extract_using_tika function. When I did a search on the mb_detect_encoding command, I discovered it is not enabled by default. So I changed my PHP installation to inlcude the EXIF extension (http://php.net/manual/en/install.windows.extensions.php) and that seemed to get the file attachments processed but I was still receiving the "Could not extract any indexable text from" error. So I put in a watchdog command for the command to extract the document text with Tika and came up with this...
C:\Progra~1\Java\jre6\bin\java.exe "-Dfile.encoding=UTF8" -cp "C:\Program Files\Apache Software Foundation\Tomcat 5.5\webapps\apache-solr-3.4.0\WEB-INF\lib" -jar "C:\Program Files\Apache Software Foundation\Tomcat 5.5\webapps\apache-solr-3.4.0\WEB-INF\lib\tika-app-1.0.jar" -t "H:\Inetpub\wwwroot\drupal\sites\default\files\report.pdf"
But the command was still returning nothing, hence the "Could not extract any indexable text from" error. I copy and pasted that line into a command window on the server and the text content of the document was being outputted so Tika was working properly and all the paths were correct.
The command is run using the php shell_exec command. Did some looking around and found out in Windows you have to grant the permissions READ & EXECUTE, READ for IUSR_ to cmd.exe in order for the shell_exec command to work. I also grant the permissions READ & EXECUTE, READ for IUSR_ to the Tomcat folder, the Java folder, and my files folder under Drupal.
But still no luck. All the documents attached to pages are getting processed but the function apachesolr_attachments_extract_using_tika is still returning nothing for pdf and doc files. I created a test php page and ran the command above which echoed the results to the page and it works fine. So Tika is working, the paths are correct, and the php shell_exec command is running for an anonymous user. I'm guessing this is still related to permissions somehow but not sure what the solution is.
Comment #5
jh81 CreditAttribution: jh81 commentedTurns out this was not a permissions issue but a configuration issue.
The Apache Solr Attachments Java variable set up in settings.php was:
$conf = array(
'apachesolr_attachments_java' => 'C:\Progra~1\jre6\bin\java.exe'
);
in apachesolr_attachments,admin.inc after the line $cmd=escapeshellcmd($java .$ java_opts)....
the variable become 'C: Progra 1 jre6 bin java.exe'.
So it looks like the slashes and ~ was getting removed. I changed the path to use backslashes and moved my Java installation to the root of the drive.
Now my variable looks like this.
$conf = array(
'apachesolr_attachments_java' => 'C:/Java/jre6/bin/java.exe',
);
and it's working.
Check the page http://drupal.org/node/1162492 for where I found the solution. In Windows/IIS you must use backslashes when setting the java variable.
Comment #6
Nick_vh