tika can extract many types of document content and can run on java 1.5+ (or even 1.4). see: http://lucene.apache.org/tika/formats.html
http://lucene.apache.org/tika/gettingstarted.html
The quickest pre-built tika is likely a Solr nightly such as:
http://people.apache.org/builds/lucene/solr/nightly/solr-2009-04-29.zip
look in apache-solr-nightly/example/solr/lib
I can, for example pull down a PDF file:
wget http://freesoftware.mit.edu/papers/lakhaniwolf.pdf
and extract the text as:
java -jar tika-0.3.jar -t lakhaniwolf.pdf
Comments
Comment #1
thl commentedIf I would be able or willing to run Java on my server, I wouldn't need a PHP-based CMS like Drupal :-)
Anyway, as long a command line could be created taking a filename as an argument and extracting the text to stdout, it could serve as a helper application.
Thanks for the hint.