If extractOnly is true, additional input parameters we can use is:
extractFormat=xml|text - Default is xml. Controls the serialization format of the extract content. xml format is actually XHTML, like passing the -x command to the tika command line application, while text is like the -t command.
see also https://issues.apache.org/jira/browse/SOLR-1274.
I had planned to include this for the last weeks since I knew my patch got into Solr, but forgot in my excitement of getting this module working at all with Solr in the last few days.
Probably doesn't matter much since we are stripping out all tags, but should give even greater consistency between using tika and Solr.
Comments
Comment #1
pwolanin commentedComment #2
pwolanin commentedcommitted