Have you considered tika? [#449210]

tika can extract many types of document content and can run on java 1.5+ (or even 1.4). see: http://lucene.apache.org/tika/formats.html

http://lucene.apache.org/tika/gettingstarted.html

The quickest pre-built tika is likely a Solr nightly such as:

http://people.apache.org/builds/lucene/solr/nightly/solr-2009-04-29.zip

look in apache-solr-nightly/example/solr/lib

I can, for example pull down a PDF file:

wget http://freesoftware.mit.edu/papers/lakhaniwolf.pdf

and extract the text as:

java -jar tika-0.3.jar -t lakhaniwolf.pdf

Comments

Comment #1

thl commented 22 August 2009 at 23:50

Assigned:	Unassigned	» thl
Status:	Active	» Closed (fixed)

If I would be able or willing to run Java on my server, I wouldn't need a PHP-based CMS like Drupal :-)

Anyway, as long a command line could be created taking a filename as an argument and extracting the text to stdout, it could serve as a helper application.

Thanks for the hint.

Have you considered tika?

Comments

Comment #1

News items

Our community

Documentation

Drupal code base

Governance of community