National characters in pdf-files [#622508]

Just had my first pdf-file indexed. Neat! But our national characters (åäö) are shown as questionmarks.

So I had two test files indexed. In the one with text as ISO-Latin-1 the national characters were dropped. The one that is UTF-8 coded shows correctly on a search.

Is this indicating that I should try a later version of tika? I'm using tika-0.3 as from the README for extraction.

/BoK

Comment	File	Size	Author
#8	force-utf8-622508-8.patch	1.45 KB	pwolanin
#7	force-utf8-622508-7.patch	1.3 KB	pwolanin

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

Comment #1

pwolanin CreditAttribution: pwolanin commented 6 November 2009 at 02:26

I'm not sure where Tika get the encoding information - you can try 0.4, but it might be an issue with the encoding specified in the document (or lack thereof)

An alternative is to try using the extracting request handler, since we are sending the doc as form data and for this:

Each part may be encoded and the "Content-Transfer-Encoding" header supplied if the value of that part does not conform to the default (7BIT) encoding (see [RFC2045], section 6)

http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1

see: http://drupal.org/node/490078

Comment #2

cyberguyen CreditAttribution: cyberguyen commented 15 November 2009 at 12:16

I don't think it's isolated to pdf, since I have the same problem with doc and dox files regarding danish characters æ ø å

The UTF-8 content in the drupal site is indexed without a problem, so I think is has something to do with Solr Attachments and the Tika output.

Im running the following on Windows 2003 server:

Drupal 6.14

Apache/2.2.14(Win32) with mod_ssl

PHP 5.2.10

Apache Solr Attachments 6.x-2.0-alpha1

Tika 0.4

Solr Implementation Version: 1.4.0 833479

Comment #3

pwolanin CreditAttribution: pwolanin commented 15 November 2009 at 17:19

Try extracting the documents with tika on the command line and look at the output.

It could be an issue with tika detecting the language or encoding. Trying locally, Tika 0.3 and Tike 0.4 report utf-8 text files as Content-Encoding: ISO-8859-1

It does not report any conent encoding for my PDFs.

see:

https://issues.apache.org/jira/browse/TIKA-209

Also, look at the java source:

apache-tika-0.4/tika-core/src/main/java/org/apache/tika/detect/TypeDetector.java

I just found one intersting fact - extracting with -t ruins utf-8 content while extracting with -x as xhtml preserves it.

So this could be a bug in the text serializer.

As a work around, try changing the code to use "-x" instead of "-t"

compare:

$ java -jar tika-app-0.4.jar -t ./test.txt 
I?t?rn?ti?n?liz?ti?n

$ java -jar tika-app-0.4.jar -x ./test.txt 
<?xml version="1.0" encoding="UTF-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title/>
</head>
<body>
<p>Iñtërnâtiônàlizætiøn
</p>
</body>
</html>

Comment #4

pwolanin CreditAttribution: pwolanin commented 15 November 2009 at 17:54

Yes - this is a tika bug for sure.

Attached to this issue https://issues.apache.org/jira/browse/TIKA-324 is a patch for tika 0.4 that fixes it.

Comment #5

pwolanin CreditAttribution: pwolanin commented 15 November 2009 at 18:14

Important note: using the extracting request handler in Solr, the content comes out correctly already:

$ curl 'http://localhost:8983/solr/extract/tika?extractFtest.txt=@test.txt"=1' -F "m
{
"responseHeader":{
"status":0,
"QTime":2},
"test.txt":"Iñtërnâtiônàlizætiøn\n\n\n",
"test.txt_metadata":[
"stream_source_info",["test.txt"],
"stream_content_type",["text/plain"],
"stream_size",["28"],
"Content-Encoding",["UTF-8"],
"stream_name",["test.txt"],
"Content-Type",["text/plain"]]}