National characters in pdf-files
Bo Kleve - November 3, 2009 - 22:19
| Project: | Apache Solr Attachments |
| Version: | 6.x-2.0-alpha1 |
| Component: | Code |
| Category: | bug report |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | closed |
Description
Just had my first pdf-file indexed. Neat! But our national characters (åäö) are shown as questionmarks.
So I had two test files indexed. In the one with text as ISO-Latin-1 the national characters were dropped. The one that is UTF-8 coded shows correctly on a search.
Is this indicating that I should try a later version of tika? I'm using tika-0.3 as from the README for extraction.
/BoK

#1
I'm not sure where Tika get the encoding information - you can try 0.4, but it might be an issue with the encoding specified in the document (or lack thereof)
An alternative is to try using the extracting request handler, since we are sending the doc as form data and for this:
Each part may be encoded and the "Content-Transfer-Encoding" header supplied if the value of that part does not conform to the default (7BIT) encoding (see [RFC2045], section 6)
http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1
see: http://drupal.org/node/490078
#2
I don't think it's isolated to pdf, since I have the same problem with doc and dox files regarding danish characters æ ø å
The UTF-8 content in the drupal site is indexed without a problem, so I think is has something to do with Solr Attachments and the Tika output.
Im running the following on Windows 2003 server:
#3
Try extracting the documents with tika on the command line and look at the output.
It could be an issue with tika detecting the language or encoding. Trying locally, Tika 0.3 and Tike 0.4 report utf-8 text files as Content-Encoding: ISO-8859-1
It does not report any conent encoding for my PDFs.
see:
https://issues.apache.org/jira/browse/TIKA-209
Also, look at the java source:
apache-tika-0.4/tika-core/src/main/java/org/apache/tika/detect/TypeDetector.java
I just found one intersting fact - extracting with -t ruins utf-8 content while extracting with -x as xhtml preserves it.
So this could be a bug in the text serializer.
As a work around, try changing the code to use "-x" instead of "-t"
compare:
$ java -jar tika-app-0.4.jar -t ./test.txt
I?t?rn?ti?n?liz?ti?n
$ java -jar tika-app-0.4.jar -x ./test.txt
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
</head>
<body>
<p>Iñtërnâtiônàlizætiøn
</p>
</body>
</html>
#4
Yes - this is a tika bug for sure.
Attached to this issue https://issues.apache.org/jira/browse/TIKA-324 is a patch for tika 0.4 that fixes it.
#5
Important note: using the extracting request handler in Solr, the content comes out correctly already:
$ curl 'http://localhost:8983/solr/extract/tika?extractFtest.txt=@test.txt"=1' -F "m
{
"responseHeader":{
"status":0,
"QTime":2},
"test.txt":"Iñtërnâtiônàlizætiøn\n\n\n",
"test.txt_metadata":[
"stream_source_info",["test.txt"],
"stream_content_type",["text/plain"],
"stream_size",["28"],
"Content-Encoding",["UTF-8"],
"stream_name",["test.txt"],
"Content-Type",["text/plain"]]}
#6
Ok, see last comments on the Tika issue - it's as much asn issue with the way java handles the default system encoding.
We can fix the problem in the module by passing
-Dfile.encoding=UTF8to java as part of the command line.#7
#8
oops - working again the DEV version of apachesolr.module, one of the functions was removed.
A quick test looks good to me - committing this patch.
#9
Automatically closed -- issue fixed for 2 weeks with no activity.