Seems like a no-brainer to switch to using tika as the default local extraction tool.

tika can extract many types of document content and can run on java 1.5+ (or even 1.4). see: http://lucene.apache.org/tika/formats.html

http://lucene.apache.org/tika/gettingstarted.html

The quickest pre-built tika is likely a Solr nightly such as:

http://people.apache.org/builds/lucene/solr/nightly/solr-2009-04-29.zip

look in apache-solr-nightly/example/solr/lib

I can, for example pull down a PDF file:

wget http://freesoftware.mit.edu/papers/lakhaniwolf.pdf

and extract the text as:

java -jar tika-0.3.jar -t lakhaniwolf.pdf

CommentFileSizeAuthor
#3 tika-attachments-449214-3.patch39.49 KBpwolanin

Comments

pwolanin’s picture

pwolanin’s picture

Also, seems that Java 5 on Mac OS 10.5 is missing some xml classes, so cannot decode docx files.

Switching to Java 6 seems to resolve this issue, and also seems to prevent the MAMP bug also.

pwolanin’s picture

Status: Active » Fixed
StatusFileSize
new39.49 KB

Seems to be basically working. Here's the diff to the branch point showing new code committed to HEAD.

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.