The indexing process was failing when trying to index a document that contained some 4-byte UTF-8 characters.

Watchdog would show errors like this:

"PDOException: SQLSTATE[HY000]: General error: 1366 Incorrect string value:"

The underlying issue is that MySQL doesn't support 4-byte UTF characters. See http://drupal.org/node/1314214

Attached is a patch to filter out all the 4-byte characters before inserting into the database.

Not necessarily the best way to deal with this issue. The other option though is that you can upgrade to MySQL to 5.5.3 and set the body field to "utf8mb4" encoding.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

milesw’s picture

After some testing we found the method for filtering out 4+ byte characters in the first patch is too inefficient. On a 10mb PDF it was taking 20+ minutes.

This patch just catches the database exception and logs it. The offending text gets sent to Solr just fine, but does not get cached to the db. Until we have a more performant cleaning method, this will keep things running.

kleinmp’s picture

The patch in #1 worked for me. The files that I was trying to index were too big, so I didn't attempt to use the original patch.

milesw’s picture

Status: Active » Needs review

Changing status.

morenstrat’s picture

Status: Needs review » Reviewed & tested by the community

Patch in #1 works for me, too.

dasha_v’s picture

It looks that patch #1 will excluding the whole file from being indexed, that is not ideal solution.
I'd propose to change type of the "body" field from longtext to longblog, that solved the issue for me.

db_query('ALTER TABLE {apachesolr_index_entities_file} MODIFY body longblob');

Any possible objections?

dasha_v’s picture

Status: Reviewed & tested by the community » Needs review
FileSize
1.52 KB

Alternative patch attached.

cinnamon’s picture

Status: Needs review » Reviewed & tested by the community

Patch from #6 works just fine.

That is: before cron would bail out on the apachesolr cronjob when there were 4-byte characters in a document, now it runs just fine.

milesw’s picture

@dasha_v: Good idea. I'll try and test soon.

FYI, patch #1 does not exclude the document from the search index, just the database cache.

MrJambi’s picture

I was having a similar PDO exception trying to index a PDF file:

"General error: 1366 Incorrect string value "\xF0\x9D . . ." for column 'body' . . ."

but in this case, it wasn't a 4-byte character that was the culprit. So far as I can tell, it was simply that neither of those first two bytes is valid UTF-8 (a UTF-8 one-byte character can only go up to 127, or 7F hex).

I didn't apply the patch in #6 but simply ran the suggestion in #5 directly in MySQL:

alter table apachesolr_index_entities_file modify body longblob;

Indexing now proceeds without a hitch. So confirming the idea works -- and thanks.

(Though I wonder if there's any downside -- I've read through http://dev.mysql.com/doc/refman/5.0/en/blob.html but can't make out the significance of the differences between longtext and longblog that are outlined.)

magtak’s picture

Issue summary: View changes

#6 works wonders. Please include it on master :)

Nick_vh’s picture

The patch did not apply. Redid it

Nick_vh’s picture

Status: Reviewed & tested by the community » Fixed

Committed

Nick_vh’s picture

Patch was valid, I was patching the wrong branch. doh

  • Nick_vh committed fe4c87c on 7.x-1.x authored by dasha_v
    Issue #1853836 by Nick_vh, dasha_v, milesw, drasgardian: Fixed failure...

  • Nick_vh committed 2eeb0a9 on master authored by dasha_v
    Issue #1853836 by Nick_vh, dasha_v, milesw, drasgardian: Fixed failure...

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.