Comments

kid_icarus’s picture

morbus iff’s picture

This is a bit of a known issue, though never really publicized.
There was some attempt to fix a similar issue long ago on line 82+:

http://drupalcode.org/project/bot.git/blob/refs/heads/7.x-1.x:/bot.smart...

But it didn't catch all of them, of course. I don't currently have a fix in mind.

morbus iff’s picture

For fire, the error I'm seeing is:

PDOException: SQLSTATE[HY000]: General error: 1366 Incorrect string value: '\xF0\x9F\x94\xA5' for column 'message' at row 1: INSERT INTO {bot_log} (type, timestamp, channel, nick, message) VALUES (:db_insert_placeholder_0, :db_insert_placeholder_1, :db_insert_placeholder_2, :db_insert_placeholder_3, :db_insert_placeholder_4); Array ( [:db_insert_placeholder_0] => 2 [:db_insert_placeholder_1] => 1357946360 [:db_insert_placeholder_2] => #drupal-colorado [:db_insert_placeholder_3] => BLadwin [:db_insert_placeholder_4] =>

morbus iff’s picture

Based on http://stackoverflow.com/questions/11936950/inserting-utf-8-encoded-stri...,
it seems that the version of MySQL that the bot server is running is at fault.

dave reid’s picture

Converting the tables and database that run the bot to utf8 charset and utf8_general_ci collation should fix it.

morbus iff’s picture

Dave: The tables and DB already are UTF8, as far as I can tell. SELECT character_set_name FROM information_schema.`COLUMNS` C WHERE table_schema = "druplicon_drupal" AND table_name = "bot_log" AND column_name = "message";" and SHOW CREATE TABLE bot_log both indicate utf8.

Why don't you think it's the four-byte-character problem above?

kid_icarus’s picture

Just to test, on a vanilla drupal 7 install with bot module, I tested the bot_log module and simply threw in a panda face into the logged channel. My bot crashed.

Then I verified that the db and tables were using the utf8 charset and utf8_general_ci collation.

I then ran the following query to convert the db and tables to 4-byte utf8:

ALTER DATABASE testdb CHARACTER SET = utf8mb4 COLLATE utf8mb4_unicode_ci;
ALTER TABLE bot_log CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

I then restarted the bot, inserted many panda faces, and the bot didn't crash.

I'm not sure what we can do to check if input is outside of the 3-byte utf8 range. The utf8mb4 character set is unfortunately only available to MySQL >= 5.5

kid_icarus’s picture

This is an interesting article for context: http://mathiasbynens.be/notes/mysql-utf8mb4

morbus iff’s picture

This is almost beginning to sound like a default worthy of core.

morbus iff’s picture

Covered in #1314214, it seems.

Drave Robber’s picture

Would brutally removing all astral plane characters - $data->message = preg_replace('/[\x{10000}-\x{10FFFF}]/u', '_', $data->message); - solve this?

(replacing with underscore instead of removing in order not to end up with empty messages)

On a second thought, this probably should be configurable - "turn this on if you cannot make your db utf8mb4 but still don't want wannabe vandals to take down your bot".

Drave Robber’s picture

Status: Active » Needs review
StatusFileSize
new2.22 KB

A patch to that effect attached.

Works for me™; whether the default should be ON or OFF is a potential matter of discussion.

sreynen’s picture

Status: Needs review » Needs work

#1314214: MySQL driver does not support full UTF-8 (emojis, asian symbols, mathematical symbols) looks like it will add a setting to change the collation from utf8 to utf8mb4, but we'll still have databases using utf8, so bot.module needs to deal with that. However, this doesn't seem like something that needs to be exposed to end users. Couldn't we just check the collation on the column and strip 4-byte characters automatically where we know they won't save?

Drave Robber’s picture

Makes sense.
I will need to read into PDO docs to figure out how to accomplish this.

kid_icarus’s picture

Status: Needs work » Needs review
StatusFileSize
new1.27 KB

So this seems to work, but it's limited to MySQL. I don't think all this is doable in PDO. Any thoughts?

kid_icarus’s picture

Whoops, one line was in the wrong place.

kid_icarus’s picture

That regex wasn't working, but this new one is.

See http://stackoverflow.com/a/10568589, essentially we're replacing all "Private Use Area" unicode characters. The original regex ended up replacing just about every character with '_'.

bgm’s picture

I also ran into this problem while writing a small module that allows me to post a new node from IRC.

Solved it by adding a try/catch statement around my function, ex:

  try {
     // main code here
  }
  catch (Exception $e) {
    bot_message($to, 'The operation was interrupted. See watchdog for more information.');
    watchdog('bot_post', 'An exception was caught, likely an SQL query problem, see: do#1886646. Error: !error', array('!error' => $e->getMessage()));
  }

This bug affects bot_factoid, bot_karma and bot_tell as well.

Drave Robber’s picture

Cross-linking for reference - Twitter also meets this issue: #1910376: SQL error when importing tweet with emoji

florisg’s picture

Confirmed #17 solves the issue.
php smartIRC does crash on any character above FFFD (in theory it could be FFFF but the last two are not defined.

So i replaced at line 107 of bot.smartirc.inc:

          $data->message = preg_replace('/[\x{FFFD}-\x{1F5FF}]/u', '_', $data->message);
          $data->rawmessage = preg_replace('/[\x{FFFD}-\x{1F5FF}]/u', '_', $data->rawmessage);

Tested and it works flawlesly
Great work kid_icarus for finding the solution
Cudo's to Morbus Iff

markhalliwell’s picture

Referencing https://drupal.org/node/2043439#comment-8213973:

function remove_emoji($text){
  return preg_replace('/([0-9|#][\x{20E3}])|[\x{00ae}|\x{00a9}|\x{203C}|\x{2047}|\x{2048}|\x{2049}|\x{3030}|\x{303D}|\x{2139}|\x{2122}|\x{3297}|\x{3299}][\x{FE00}-\x{FEFF}]?|[\x{2190}-\x{21FF}][\x{FE00}-\x{FEFF}]?|[\x{2300}-\x{23FF}][\x{FE00}-\x{FEFF}]?|[\x{2460}-\x{24FF}][\x{FE00}-\x{FEFF}]?|[\x{25A0}-\x{25FF}][\x{FE00}-\x{FEFF}]?|[\x{2600}-\x{27BF}][\x{FE00}-\x{FEFF}]?|[\x{2900}-\x{297F}][\x{FE00}-\x{FEFF}]?|[\x{2B00}-\x{2BF0}][\x{FE00}-\x{FEFF}]?|[\x{1F000}-\x{1F6FF}][\x{FE00}-\x{FEFF}]?/u', '', $text);
}

We should probably rename this to _bot_db_sanitize or something and wrap all the sub-modules that process the raw text before sending to the DB.

edit: The reason this is longer is because it includes most of the emojicon unicodes, not just a select few.

markhalliwell’s picture

Several of the sub-modules need this fix before it's saved in the DB.

Posting on behalf of Jeremy.

pwolanin’s picture

Just ran into this as well - Emoji seem to cause Druplicon to leave the channel (crash)

cafuego’s picture

Status: Needs review » Reviewed & tested by the community

Patch #23 works for me! :-)

a.ross’s picture

Err, yeah. I'm really sorry for crashing you today, Druplicon! Please do accept my apologies.

a.ross’s picture

+++ b/bot.module
@@ -435,3 +435,10 @@ function bot_settings() {
+function bot_strip_unwanted_characters($message) {
+  return preg_replace('/([0-9|#][\x{20E3}])|[\x{00ae}|\x{00a9}|\x{203C}|\x{2047}|\x{2048}|\x{2049}|\x{3030}|\x{303D}|\x{2139}|\x{2122}|\x{3297}|\x{3299}][\x{FE00}-\x{FEFF}]?|[\x{2190}-\x{21FF}][\x{FE00}-\x{FEFF}]?|[\x{2300}-\x{23FF}][\x{FE00}-\x{FEFF}]?|[\x{2460}-\x{24FF}][\x{FE00}-\x{FEFF}]?|[\x{25A0}-\x{25FF}][\x{FE00}-\x{FEFF}]?|[\x{2600}-\x{27BF}][\x{FE00}-\x{FEFF}]?|[\x{2900}-\x{297F}][\x{FE00}-\x{FEFF}]?|[\x{2B00}-\x{2BF0}][\x{FE00}-\x{FEFF}]?|[\x{1F000}-\x{1F6FF}][\x{FE00}-\x{FEFF}]?/u', '_', $message);
+}

Hmm, what if we add a module_exists('transliteration') here, and use that if it exists?

joelpittet’s picture

How do you get it to come back after it's killed? :( Sorry druplicon!

#mayo

cafuego’s picture

Someone with access needs to restart it on whatever host it runs on, I guess. Webmasters queue?

ciss’s picture

Shouldn't this be handled in bot_smartirc_wrapper::invoke()?

dave reid’s picture

Title: Fire kills bot » Fire (Non UTF8 characters) kills bot
dave reid’s picture

Priority: Normal » Major

Bumping priority.

morbus iff’s picture

See also https://www.drupal.org/node/2488180

At this point, I'm not interested in stripping characters - I'd rather the underlying MySQL drivers and database support the characters.

heddn’s picture

@Morbus Iff, any chance we can get #2488180: Support full UTF-8 (emojis, Asian symbols, mathematical symbols) on MySQL and other database drivers when they are configured to allow it applied to the bot infrastructure so we can get it RTBCed? Or #23 applied? It just went down (and restarted itself again).

rfay’s picture

And of course now, since Drupal 7.50, the bot Drupal instance "just" needs to be updated to 7.50 and the instructions over in Multi-byte UTF-8 support in Drupal 7 have to be completed. But YAY that the underlying issue has finally been solved.

ricardoamaro’s picture

note: to review

morbus iff’s picture

Status: Reviewed & tested by the community » Closed (won't fix)

Closing, as it's been fixed upstream in Drupal core.