Jump to:
| Project: | Apache Solr Attachments |
| Version: | 6.x-2.0-alpha2 |
| Component: | Documentation |
| Category: | task |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | needs review |
Issue Summary
New here: just installed a current version of solr 1.4.1 and have it running on D6.19.
I need to search attachments. Going through the documentation here, I'm confused as to what's required/recommended.
1. By default the searchsolr_attachments module UI offers the option of using Tika locally or Solr remotely. I don't care about speed as not much changes in this collection. If I choose Solr remote, is there anything left to do except celebrate after the next cron run?
2. The UI prints out [tika-0.3.jar] in the Tika Jar File text field. Why? Is it saying this is the version of Tika it has found bundled with the solr 1.4.1 distribution? Or is this just a default message provided for example? And where did it find this jar file? because....
3. Where is the Tika Directory path? Looking through the module documentation I did not get this as a clear picture. There's a tika file at /usr/local/solr/contrib/extraction/lib, but this is just a single jar, not a directory.
4. There's plenty of discussion in this module about higher versions of Tika...the main site is now up to 0.7 Are there significant advantages to the newer version that justify the maven build/configure process?
Thanks in advance...looking forward to some tasty attachment indexes...
Comments
#1
tika 0.3 was the release when I first wrote the module - certainly not the best choice at this point. Use tika 0.7+
Using Solr remote requires patching the solrconfig.xml and making sure the tika lib jars are in the right place (mostly an issue when using multiple cores). If that doesn't sound trivial, local Tika may be the better option for you.
#2
Howdy and thanks for the reply. I'm surmsing the best approach is to get through the local Tika install. Dealing with Maven sounds like an extra opportunity for problems. Is there a pre-built Tika 0.7 around, or is that simply not the way this works? IE, we always need to assemble an installation via Maven?
And just to clarify, when you guys are talking about multiple cores, what is that referring to? On any of my VPS environments I'm running half a dozen websites that might want to take advantage of Solr. If solr (and Tika) are installed someplace back in usr/local/* are there special considerations for running multiple independent lucene indexes?
If I can get this straight I'm happy to write a bit of newbie/clueless documentation to tack onto the existing efforts here.
Thanks for your assistance...
#3
A single Solr can have mutiple search indexes (cores). In this case the placement of the tika library files needs to differ as compared to a single search index setup.
I'm not sure there is a public pre-built tika app jar. They are annoying that way. It's 17MB, so can't really attach it here, but if you are desperate, drop me a note.
#4
I did not find any complete documentation for this. Here is the translation of a tutorial I made in french on http://techoop.insite.coop
The Apache Solr Attachments module allows to search in files attached to content using Upload optional core module, or using File Field additional module. It takes advantage of the powerful text extraction tool Tika, which can be installed on the local server (where Drupal is installed), or on the Solr server (which can be remote). Though needing network transfer of the attachments, the use of Tika on a remote Solr server avoid the need to install Tika on each local server that uses the Solr server, and makes it possible to contract out all search-related functions. So we will mainly describe here how to install Tika on a dedicated server, and how to use it with Apache Solr Attachments.
Warning : using Tika on a remote server needs network file transfer. To be avoided for bulky attachments.
1. Download Apache Solr Attachments module and extract it in the right directory
2. Put the patch provided by the module in the Solr directory conf
3. Apply the patch
root@solr:~# cd /usr/share/solr/cores/my-site/conf/
root@solr:/usr/share/solr/cores/my-site/conf# patch < solrconfig.tika.patch
patching file solrconfig.xml
root@solr:/usr/share/solr/cores/my-site/conf#
4. Create a folder lib in the Solr home (or in the core directory in case of multicore)
5. Download the last version of Solr and copy the file dist/apache-solr-cell-1.4.1.jar and all the libraries from contrib/extraction/lib in the new folder lib
6. Restart Tomcat
root@solr:/usr/share/solr/cores/my-site/conf# service tomcat6 restart
Stopping Tomcat servlet engine: tomcat6.
Starting Tomcat servlet engine: tomcat6.
root@solr:/usr/share/solr/cores/my-site/conf#
7. Check that extraction works with any file (e.g. a pdf one)
curl "http://localhost:8080/solr/my-site/extract/tika?literal.id=doc1&commit=true" -F "myfile=@path/to/file.pdf"
8. Activate Apache Solr Attachments
9. Go to the module’s administration interface
http://192.168.1.6/~my-site/?q=admin/settings/apachesolr/attachments
10. Select “Extract using Solr (remote server)” and save configuration
11. Reindex attachments if needed
Alternative : Tika on the same server as Drupal
1. Download the last version of Tika
svn export http://svn.apache.org/repos/asf/tika/trunk/ tika
2. Compile Tika (maven2 package is needed)
cd tika export MAVEN_OPTS="-Xmx1024m -Xms512m" mvn install
3. Check that extraction works on a file (e.g. a pdf one)
java -jar ./tika-app/target/tika-app-0.8-SNAPSHOT.jar -t [file-url]
4. In the module’s administration interface, select “Tika (local java application)”, fill in the jar file’s path and name and save configuration
5. Reindex attachments if needed
#5
#6
Barracuda installs 8 cores, in /opt/solr/
site_001
site_002
site_003
.. etc
While the first core is indexing site_001 PDF attachments etc using /opt/tomcat6/lib/tika-app-0.9.jar just fine (very tasty indeed!), site_002's attachments are not being indexed.
Could you please expand a bit on 'the placement of the tika library files needs to differ' when using multiple localhost cores?
Thanks in advance - one of my favorite modules by far :)
#7
The tika-app-0.10.jar file is now provided with no need to compile!
Hopefully they keep it that way in future.
See http://tika.apache.org/download.html