use tika

Seems like a no-brainer to switch to using tika as the default local extraction tool.

tika can extract many types of document content and can run on java 1.5+ (or even 1.4). see: http://lucene.apache.org/tika/formats.html

The quickest pre-built tika is likely a Solr nightly such as:

look in apache-solr-nightly/example/solr/lib

I can, for example pull down a PDF file:

and extract the text as:

java -jar tika-0.3.jar -t lakhaniwolf.pdf

Comment	File	Size	Author
#3	tika-attachments-449214-3.patch	39.49 KB	pwolanin

Comments

pwolanin commented 3 May 2009 at 03:16

pwolanin commented 3 May 2009 at 18:06

Also, seems that Java 5 on Mac OS 10.5 is missing some xml classes, so cannot decode docx files.

Switching to Java 6 seems to resolve this issue, and also seems to prevent the MAMP bug also.

pwolanin commented 3 May 2009 at 19:16

Status:

Active

» Fixed

Status	File	Size
new	tika-attachments-449214-3.patch	39.49 KB

Seems to be basically working. Here's the diff to the branch point showing new code committed to HEAD.

17 May 2009 at 19:20

Status:

Fixed

» Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.