Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
Just had my first pdf-file indexed. Neat! But our national characters (åäö) are shown as questionmarks.
So I had two test files indexed. In the one with text as ISO-Latin-1 the national characters were dropped. The one that is UTF-8 coded shows correctly on a search.
Is this indicating that I should try a later version of tika? I'm using tika-0.3 as from the README for extraction.
/BoK
Comment | File | Size | Author |
---|---|---|---|
#8 | force-utf8-622508-8.patch | 1.45 KB | pwolanin |
#7 | force-utf8-622508-7.patch | 1.3 KB | pwolanin |
Comments
Comment #1
pwolanin CreditAttribution: pwolanin commentedI'm not sure where Tika get the encoding information - you can try 0.4, but it might be an issue with the encoding specified in the document (or lack thereof)
An alternative is to try using the extracting request handler, since we are sending the doc as form data and for this:
Each part may be encoded and the "Content-Transfer-Encoding" header supplied if the value of that part does not conform to the default (7BIT) encoding (see [RFC2045], section 6)
http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1
see: http://drupal.org/node/490078
Comment #2
cyberguyen CreditAttribution: cyberguyen commentedI don't think it's isolated to pdf, since I have the same problem with doc and dox files regarding danish characters æ ø å
The UTF-8 content in the drupal site is indexed without a problem, so I think is has something to do with Solr Attachments and the Tika output.
Im running the following on Windows 2003 server:
Comment #3
pwolanin CreditAttribution: pwolanin commentedTry extracting the documents with tika on the command line and look at the output.
It could be an issue with tika detecting the language or encoding. Trying locally, Tika 0.3 and Tike 0.4 report utf-8 text files as Content-Encoding: ISO-8859-1
It does not report any conent encoding for my PDFs.
see:
https://issues.apache.org/jira/browse/TIKA-209
Also, look at the java source:
apache-tika-0.4/tika-core/src/main/java/org/apache/tika/detect/TypeDetector.java
I just found one intersting fact - extracting with -t ruins utf-8 content while extracting with -x as xhtml preserves it.
So this could be a bug in the text serializer.
As a work around, try changing the code to use "-x" instead of "-t"
compare:
Comment #4
pwolanin CreditAttribution: pwolanin commentedYes - this is a tika bug for sure.
Attached to this issue https://issues.apache.org/jira/browse/TIKA-324 is a patch for tika 0.4 that fixes it.
Comment #5
pwolanin CreditAttribution: pwolanin commentedImportant note: using the extracting request handler in Solr, the content comes out correctly already:
$ curl 'http://localhost:8983/solr/extract/tika?extractFtest.txt=@test.txt"=1' -F "m
{
"responseHeader":{
"status":0,
"QTime":2},
"test.txt":"Iñtërnâtiônàlizætiøn\n\n\n",
"test.txt_metadata":[
"stream_source_info",["test.txt"],
"stream_content_type",["text/plain"],
"stream_size",["28"],
"Content-Encoding",["UTF-8"],
"stream_name",["test.txt"],
"Content-Type",["text/plain"]]}
Comment #6
pwolanin CreditAttribution: pwolanin commentedOk, see last comments on the Tika issue - it's as much asn issue with the way java handles the default system encoding.
We can fix the problem in the module by passing
-Dfile.encoding=UTF8
to java as part of the command line.Comment #7
pwolanin CreditAttribution: pwolanin commentedComment #8
pwolanin CreditAttribution: pwolanin commentedoops - working again the DEV version of apachesolr.module, one of the functions was removed.
A quick test looks good to me - committing this patch.