Just had my first pdf-file indexed. Neat! But our national characters (åäö) are shown as questionmarks.

So I had two test files indexed. In the one with text as ISO-Latin-1 the national characters were dropped. The one that is UTF-8 coded shows correctly on a search.

Is this indicating that I should try a later version of tika? I'm using tika-0.3 as from the README for extraction.

/BoK

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

pwolanin’s picture

I'm not sure where Tika get the encoding information - you can try 0.4, but it might be an issue with the encoding specified in the document (or lack thereof)

An alternative is to try using the extracting request handler, since we are sending the doc as form data and for this:

Each part may be encoded and the "Content-Transfer-Encoding" header supplied if the value of that part does not conform to the default (7BIT) encoding (see [RFC2045], section 6)

http://www.w3.org/TR/html401/interact/forms.html#h-17.13.4.1

see: http://drupal.org/node/490078

cyberguyen’s picture

I don't think it's isolated to pdf, since I have the same problem with doc and dox files regarding danish characters æ ø å

The UTF-8 content in the drupal site is indexed without a problem, so I think is has something to do with Solr Attachments and the Tika output.

Im running the following on Windows 2003 server:

  • Drupal 6.14
  • Apache/2.2.14(Win32) with mod_ssl
  • PHP 5.2.10
  • Apache Solr Attachments 6.x-2.0-alpha1
  • Tika 0.4
  • Solr Implementation Version: 1.4.0 833479
  • pwolanin’s picture

    Try extracting the documents with tika on the command line and look at the output.

    It could be an issue with tika detecting the language or encoding. Trying locally, Tika 0.3 and Tike 0.4 report utf-8 text files as Content-Encoding: ISO-8859-1

    It does not report any conent encoding for my PDFs.

    see:

    https://issues.apache.org/jira/browse/TIKA-209

    Also, look at the java source:

    apache-tika-0.4/tika-core/src/main/java/org/apache/tika/detect/TypeDetector.java

    I just found one intersting fact - extracting with -t ruins utf-8 content while extracting with -x as xhtml preserves it.

    So this could be a bug in the text serializer.

    As a work around, try changing the code to use "-x" instead of "-t"

    compare:

    $ java -jar tika-app-0.4.jar -t ./test.txt 
    I?t?rn?ti?n?liz?ti?n
    
    $ java -jar tika-app-0.4.jar -x ./test.txt 
    <?xml version="1.0" encoding="UTF-8"?>
    <html xmlns="http://www.w3.org/1999/xhtml">
    <head>
    <title/>
    </head>
    <body>
    <p>Iñtërnâtiônàlizætiøn
    </p>
    </body>
    </html>
    
    pwolanin’s picture

    Yes - this is a tika bug for sure.

    Attached to this issue https://issues.apache.org/jira/browse/TIKA-324 is a patch for tika 0.4 that fixes it.

    pwolanin’s picture

    Important note: using the extracting request handler in Solr, the content comes out correctly already:

    $ curl 'http://localhost:8983/solr/extract/tika?extractFtest.txt=@test.txt"=1' -F "m
    {
    "responseHeader":{
    "status":0,
    "QTime":2},
    "test.txt":"Iñtërnâtiônàlizætiøn\n\n\n",
    "test.txt_metadata":[
    "stream_source_info",["test.txt"],
    "stream_content_type",["text/plain"],
    "stream_size",["28"],
    "Content-Encoding",["UTF-8"],
    "stream_name",["test.txt"],
    "Content-Type",["text/plain"]]}

    pwolanin’s picture

    Ok, see last comments on the Tika issue - it's as much asn issue with the way java handles the default system encoding.

    We can fix the problem in the module by passing -Dfile.encoding=UTF8 to java as part of the command line.

    pwolanin’s picture

    Status: Active » Needs review
    FileSize
    1.3 KB
    pwolanin’s picture

    Status: Needs review » Fixed
    FileSize
    1.45 KB

    oops - working again the DEV version of apachesolr.module, one of the functions was removed.

    A quick test looks good to me - committing this patch.

    Status: Fixed » Closed (fixed)

    Automatically closed -- issue fixed for 2 weeks with no activity.