tika can extract many types of document content and can run on java 1.5+ (or even 1.4). see: http://lucene.apache.org/tika/formats.html

http://lucene.apache.org/tika/gettingstarted.html

The quickest pre-built tika is likely a Solr nightly such as:

http://people.apache.org/builds/lucene/solr/nightly/solr-2009-04-29.zip

look in apache-solr-nightly/example/solr/lib

I can, for example pull down a PDF file:

wget http://freesoftware.mit.edu/papers/lakhaniwolf.pdf

and extract the text as:

java -jar tika-0.3.jar -t lakhaniwolf.pdf

Comments

thl’s picture

Assigned: Unassigned » thl
Status: Active » Closed (fixed)

If I would be able or willing to run Java on my server, I wouldn't need a PHP-based CMS like Drupal :-)

Anyway, as long a command line could be created taking a filename as an argument and extracting the text to stdout, it could serve as a helper application.

Thanks for the hint.