Support for Drupal 7 is ending on 5 January 2025—it’s time to migrate to Drupal 10! Learn about the many benefits of Drupal 10 and find migration tools in our resource center.
The indexing process was failing when trying to index a document that contained some 4-byte UTF-8 characters.
Watchdog would show errors like this:
"PDOException: SQLSTATE[HY000]: General error: 1366 Incorrect string value:"
The underlying issue is that MySQL doesn't support 4-byte UTF characters. See http://drupal.org/node/1314214
Attached is a patch to filter out all the 4-byte characters before inserting into the database.
Not necessarily the best way to deal with this issue. The other option though is that you can upgrade to MySQL to 5.5.3 and set the body field to "utf8mb4" encoding.
Comment | File | Size | Author |
---|---|---|---|
#11 | body_field_longblob-1853836-11.patch | 1.68 KB | Nick_vh |
#6 | body_field_longblob-1853836-6.patch | 1.52 KB | dasha_v |
#1 | 1853836-1-apachesolr-attachments-4bytechars.patch | 1.09 KB | milesw |
apachesolr_attachments-4bytechar.patch | 940 bytes | drasgardian |
Comments
Comment #1
milesw CreditAttribution: milesw commentedAfter some testing we found the method for filtering out 4+ byte characters in the first patch is too inefficient. On a 10mb PDF it was taking 20+ minutes.
This patch just catches the database exception and logs it. The offending text gets sent to Solr just fine, but does not get cached to the db. Until we have a more performant cleaning method, this will keep things running.
Comment #2
kleinmp CreditAttribution: kleinmp commentedThe patch in #1 worked for me. The files that I was trying to index were too big, so I didn't attempt to use the original patch.
Comment #3
milesw CreditAttribution: milesw commentedChanging status.
Comment #4
morenstratPatch in #1 works for me, too.
Comment #5
dasha_v CreditAttribution: dasha_v commentedIt looks that patch #1 will excluding the whole file from being indexed, that is not ideal solution.
I'd propose to change type of the "body" field from longtext to longblog, that solved the issue for me.
Any possible objections?
Comment #6
dasha_v CreditAttribution: dasha_v commentedAlternative patch attached.
Comment #7
cinnamon CreditAttribution: cinnamon commentedPatch from #6 works just fine.
That is: before cron would bail out on the apachesolr cronjob when there were 4-byte characters in a document, now it runs just fine.
Comment #8
milesw CreditAttribution: milesw commented@dasha_v: Good idea. I'll try and test soon.
FYI, patch #1 does not exclude the document from the search index, just the database cache.
Comment #9
MrJambi CreditAttribution: MrJambi commentedI was having a similar PDO exception trying to index a PDF file:
"General error: 1366 Incorrect string value "\xF0\x9D . . ." for column 'body' . . ."
but in this case, it wasn't a 4-byte character that was the culprit. So far as I can tell, it was simply that neither of those first two bytes is valid UTF-8 (a UTF-8 one-byte character can only go up to 127, or 7F hex).
I didn't apply the patch in #6 but simply ran the suggestion in #5 directly in MySQL:
alter table apachesolr_index_entities_file modify body longblob;
Indexing now proceeds without a hitch. So confirming the idea works -- and thanks.
(Though I wonder if there's any downside -- I've read through http://dev.mysql.com/doc/refman/5.0/en/blob.html but can't make out the significance of the differences between longtext and longblog that are outlined.)
Comment #10
magtak CreditAttribution: magtak commented#6 works wonders. Please include it on master :)
Comment #11
Nick_vhThe patch did not apply. Redid it
Comment #12
Nick_vhCommitted
Comment #13
Nick_vhPatch was valid, I was patching the wrong branch. doh