Hi,
I added CJK diff support to l10n_server. here is my code.
(attached a file to test and screenshot)

Strings used to TEST:

Your server has been successfully tested to support this feature.
你的伺服器已經成功地通過測試,可以支援此功能。
サーバがこの機能をサポートすることが正常に検証されました。
서버가 해당 기능을 지원하는 것이 정상적으로 검증되었습니다.

CHANGED Strings:

Your CHANGE has been successfully tested to support this feature.
你的CHANGE已經中文地通過測試。可以支援此功能。
サーCHANGEこの日本語をサポートすることが正常に検証されました。
서버CHANGE당기한국어지원하는 것이 정상적으로 검증되었습니다.

Added follow to split characters

-\u3000-\u303F-\u3040-\u309F-\u30A0-\u30FF-\u4E00-\u9FFF-\uAC00-\uD7AF-\uF900-\uFAFF-\uFF00-\uFFEF

Only included:

CJK Symbols and Punctuation
3000-303F
Hiragana (Japanese)
3040-309F
Katakana (Japanese)
30A0-30FF
CJK Unified Ideographs
4E00-9FFF
Hangul Syllables (Japanese)
AC00-D7AF
CJK Compatibility Ideographs
F900-FAFF
Halfwidth and Fullwidth Forms
FF00-FFEF

ref:
http://triggertek.com/r/unicode/

Comments

droplet’s picture

StatusFileSize
new83.13 KB
new7.33 KB

attach files

gábor hojtsy’s picture

Status: Active » Needs review

This looks great. Any feedback from people actually speaking any of these languages? Would we need anything else?

droplet’s picture

correct my remark above:
Hangul Syllables is Korean alphabet.

and maybe missing this one:
3130 — 318F Hangul Compatibility Jamo

more nice ref:
http://unicode.org/charts/

(myself spoken Chinese, English, and know a bit Japanese)

droplet’s picture

OK. we can take some code from this patch #768040: truncate_utf8() only works for latin languages (and drupal_substr has a bug) when it reviewed and commited to head.

+define('PREG_CLASS_UNICODE_WORD_BOUNDARY',
+  '\x{0}-\x{2F}\x{3A}-\x{40}\x{5B}-\x{60}\x{7B}-\x{A9}\x{AB}-\x{B1}\x{B4}' .
+  '\x{B6}-\x{B8}\x{BB}\x{BF}\x{D7}\x{F7}\x{2C2}-\x{2C5}\x{2D2}-\x{2DF}' .
+  '\x{2E5}-\x{2EB}\x{2ED}\x{2EF}-\x{2FF}\x{375}\x{37E}-\x{385}\x{387}\x{3F6}' .
+  '\x{482}\x{55A}-\x{55F}\x{589}-\x{58A}\x{5BE}\x{5C0}\x{5C3}\x{5C6}' .
+  '\x{5F3}-\x{60F}\x{61B}-\x{61F}\x{66A}-\x{66D}\x{6D4}\x{6DD}\x{6E9}' .
+  '\x{6FD}-\x{6FE}\x{700}-\x{70F}\x{7F6}-\x{7F9}\x{830}-\x{83E}' .
+  '\x{964}-\x{965}\x{970}\x{9F2}-\x{9F3}\x{9FA}-\x{9FB}\x{AF1}\x{B70}' .
+  '\x{BF3}-\x{BFA}\x{C7F}\x{CF1}-\x{CF2}\x{D79}\x{DF4}\x{E3F}\x{E4F}' .
+  '\x{E5A}-\x{E5B}\x{F01}-\x{F17}\x{F1A}-\x{F1F}\x{F34}\x{F36}\x{F38}' .
+  '\x{F3A}-\x{F3D}\x{F85}\x{FBE}-\x{FC5}\x{FC7}-\x{FD8}\x{104A}-\x{104F}' .
+  '\x{109E}-\x{109F}\x{10FB}\x{1360}-\x{1368}\x{1390}-\x{1399}\x{1400}' .
+  '\x{166D}-\x{166E}\x{1680}\x{169B}-\x{169C}\x{16EB}-\x{16ED}' .
+  '\x{1735}-\x{1736}\x{17B4}-\x{17B5}\x{17D4}-\x{17D6}\x{17D8}-\x{17DB}' .
+  '\x{1800}-\x{180A}\x{180E}\x{1940}-\x{1945}\x{19DE}-\x{19FF}' .
+  '\x{1A1E}-\x{1A1F}\x{1AA0}-\x{1AA6}\x{1AA8}-\x{1AAD}\x{1B5A}-\x{1B6A}' .
+  '\x{1B74}-\x{1B7C}\x{1C3B}-\x{1C3F}\x{1C7E}-\x{1C7F}\x{1CD3}\x{1FBD}' .
+  '\x{1FBF}-\x{1FC1}\x{1FCD}-\x{1FCF}\x{1FDD}-\x{1FDF}\x{1FED}-\x{1FEF}' .
+  '\x{1FFD}-\x{206F}\x{207A}-\x{207E}\x{208A}-\x{208E}\x{20A0}-\x{20B8}' .
+  '\x{2100}-\x{2101}\x{2103}-\x{2106}\x{2108}-\x{2109}\x{2114}' .
+  '\x{2116}-\x{2118}\x{211E}-\x{2123}\x{2125}\x{2127}\x{2129}\x{212E}' .
+  '\x{213A}-\x{213B}\x{2140}-\x{2144}\x{214A}-\x{214D}\x{214F}' .
+  '\x{2190}-\x{244A}\x{249C}-\x{24E9}\x{2500}-\x{2775}\x{2794}-\x{2B59}' .
+  '\x{2CE5}-\x{2CEA}\x{2CF9}-\x{2CFC}\x{2CFE}-\x{2CFF}\x{2E00}-\x{2E2E}' .
+  '\x{2E30}-\x{3004}\x{3008}-\x{3020}\x{3030}\x{3036}-\x{3037}' .
+  '\x{303D}-\x{303F}\x{309B}-\x{309C}\x{30A0}\x{30FB}\x{3190}-\x{3191}' .
+  '\x{3196}-\x{319F}\x{31C0}-\x{31E3}\x{3200}-\x{321E}\x{322A}-\x{3250}' .
+  '\x{3260}-\x{327F}\x{328A}-\x{32B0}\x{32C0}-\x{33FF}\x{4DC0}-\x{4DFF}' .
+  '\x{A490}-\x{A4C6}\x{A4FE}-\x{A4FF}\x{A60D}-\x{A60F}\x{A673}\x{A67E}' .
+  '\x{A6F2}-\x{A716}\x{A720}-\x{A721}\x{A789}-\x{A78A}\x{A828}-\x{A82B}' .
+  '\x{A836}-\x{A839}\x{A874}-\x{A877}\x{A8CE}-\x{A8CF}\x{A8F8}-\x{A8FA}' .
+  '\x{A92E}-\x{A92F}\x{A95F}\x{A9C1}-\x{A9CD}\x{A9DE}-\x{A9DF}' .
+  '\x{AA5C}-\x{AA5F}\x{AA77}-\x{AA79}\x{AADE}-\x{AADF}\x{ABEB}' .
+  '\x{D800}-\x{F8FF}\x{FB29}\x{FD3E}-\x{FD3F}\x{FDFC}-\x{FDFD}' .
+  '\x{FE10}-\x{FE19}\x{FE30}-\x{FE6B}\x{FEFF}-\x{FF0F}\x{FF1A}-\x{FF20}' .
+  '\x{FF3B}-\x{FF40}\x{FF5B}-\x{FF65}\x{FFE0}-\x{FFFD}');
+
+/**

gábor hojtsy’s picture

I'd second @droplet's call for reviews. I'm still not knowledgeable in CJK languages, so looking for people who actually have experience there.

Antoine Lafontaine’s picture

@droplet
I guess that "where the changes were done" was arbitrary, right? The changes are overlapping words and articles... but I guess this is not the point of this test.

If so, I can confirm that the changes in the Japanese string are well identified. (from the screen capture)

gábor hojtsy’s picture

@droplet: so what do you suggest given @Antoine's verification? Commit as it is or wait to integrate more improvements from the core patch?

droplet’s picture

I'm getting headache after read some more Unicode docs.

comparing l10n_server with that core patch. l10n_server may includes more range(non-CJK) than the core one. I do not really know which one better. for the CJK, following are safely to add. it may not cover all CJK characters but it is enough for common usage.

CJK Symbols and Punctuation
3000-303F
Hiragana (Japanese)
3040-309F
Katakana (Japanese)
30A0-30FF
CJK Unified Ideographs
4E00-9FFF
Hangul Syllables (Korean)
AC00-D7AF
CJK Compatibility Ideographs
F900-FAFF
Halfwidth and Fullwidth Forms
FF00-FFEF
Hangul Compatibility Jamo
3130-318F

\u3000-\u303F\u3040-\u309F\u30A0-\u30FF\u4E00-\u9FFF\uAC00-\uD7AF\uF900-\uFAFF\uFF00-\uFFEF\u3130-\u318F

also, everyone can review this patch by copying #1 test strings.
Latin words split by SPACE & Punctuation
CJK split by each characters

I will test it one more time in these few days

dokumori’s picture

@droplet: Thanks for your effort on this. I looked at the charts 'CJK Unified Ideographs (Han)', 'CJK Compatibility Ideographs', 'Hiragana' and 'Katakana' that were available at http://unicode.org/ and can confirm you've covered (probably) all the characters we use in Japanese.

droplet’s picture

StatusFileSize
new8.77 KB
new7.53 KB

It's a new patch, same results as above. And I added some JS for better detect it when typing in IME. Need a jQuery expert improve my JS code.

Thanks !

kkaefer’s picture

cjk2.patch shouldn't duplicate the function. Instead, store the function in a variable and add them to both events. Unfortunately, I can't comment too much on whether this will work because I'm not very familiar with CJK languages. My understanding is that there are very few split characters since there are no spaces between most characters. Does it actually make sense to only use the space/comma characters for CJK languages?

droplet’s picture

sometimes may not have space/comma between CJK sentences. :)

gábor hojtsy’s picture

@droplet: why did you submit two different patches? What's the goal? Can you act on @kkaefer's feedback on the JS duplication if we need the functionality from cjk2.patch. I admit I'm not entirely understanding where are you leading us with this issue but would greatly love to submit improvements to add this functionality.

droplet’s picture

StatusFileSize
new68.26 KB
new9.68 KB

hmm. now it do not work well in Firefox when using IME (http://en.wikipedia.org/wiki/Input_method_editor). so I patched editor.js to make them same behavior on all well-known browsers. I do not how to explain it... see ime.jpg, you may get some ideas.

the NEW patch below add cache of the function.

thanks ALL.

gábor hojtsy’s picture

The search module might also be of useful input. It was recently updated in #493770: Search incorrectly splits some katakana words.

droplet’s picture

none for search modules but autocomplete #812354: Handle IME input in autocomplete better

droplet’s picture

Write a GM script. Use it if you can't waiting for LDO commits it :)


// ==UserScript==
// @name           localize
// @namespace      localize
// @include        http://localize.drupal.org/translate/languages/*/translate*
// ==/UserScript==

    
    var JScode = new Array();

    JScode.push('$.wordDiff.nonWord = ');
    JScode.push(' /(&.+?;|[\u0000-\u0040\u005B-\u0060\u007B-\u00A9\u00AB-\u00B4\u00B6-\u00B9\u00BB-\u00BF\u00D7\u00F7\u02C2-\u02C5\u02D2-\u02DF\u02E5-\u02EB\u02ED\u02EF-\u036F\u0375\u037E\u0384\u0385\u0387\u03F6\u0482-\u0489\u055A-\u055F\u0589\u058A\u0591-\u05C7\u05F3\u05F4\u0600-\u0603\u0606-\u061B\u061E\u061F\u064B-\u065E\u0660-\u066D\u0670\u06D4\u06D6-\u06E4\u06EA-\u06ED\u06F0-\u06F9\u06FD\u06FE\u0700-\u070D\u070F\u0711\u0730-\u074A\u07A6-\u07B0\u07C0-\u07C9\u07EB-\u07F3\u07F6-\u07F9\u0901-\u0903\u093C\u093E-\u094D\u0951-\u0954\u09E2\u0962-\u0970\u06E7-\u06E9\u0981-\u0983\u09BC\u09BE-\u09C4\u09C7\u09C8\u09CB-\u09CD\u09D7\u09E3\u09E6-\u09EF\u09F2-\u09FA\u0A01-\u0A03\u0A3C\u0A3E-\u0A42\u0A47\u0A48\u0A4B-\u0A4D\u0A51\u0A66-\u0A71\u0A75\u0A81-\u0A83\u0ABC\u0ABE-\u0AC5\u0AC7-\u0AC9\u0ACB-\u0ACD\u0AE2\u0AE3\u0AE6-\u0AEF\u0AF1\u0B01-\u0B03\u0B3C\u0B3E-\u0B44\u0B47\u0B48\u0B4B-\u0B4D\u0B56\u0B57\u0B62\u0B63\u0B66-\u0B70\u0B82\u0BBE-\u0BC2\u0BC6-\u0BC8\u0BCA-\u0BCD\u0BD7\u0BE6-\u0BFA\u0C01-\u0C03\u0C3E-\u0C44\u0C46-\u0C48\u0C4A-\u0C4D\u0C55\u0C56\u0C62\u0C63\u0C66-\u0C6F\u0C78-\u0C7F\u0C82\u0C83\u0CBC\u0CBE-\u0CC4\u0CC6-\u0CC8\u0CCA-\u0CCD\u0CD5\u0CD6\u0CE2\u0CE3\u0CE6-\u0CEF\u0CF1\u0CF2\u0D02\u0D03\u0D3E-\u0D44\u0D46-\u0D48\u0D4A-\u0D4D\u0D57\u0D62\u0D63\u0D66-\u0D75\u0D79\u0D82\u0D83\u0DCA\u0DCF-\u0DD4\u0DD6\u0DD8-\u0DDF\u0DF2-\u0DF4\u0E31\u0E34-\u0E3A\u0E3F\u0E47-\u0E5B\u0EB1\u0EB4-\u0EB9\u0EBB\u0EBC\u0EC8-\u0ECD\u0ED0-\u0ED9\u0F01-\u0F3F\u0F71-\u0F87\u0F90-\u0F97\u0F99-\u0FBC\u0FBE-\u0FCC\u0FCE-\u0FD4\u102B-\u103E\u1040-\u104F\u1056-\u1059\u105E-\u1060\u1062-\u1064\u1067-\u106D\u1071-\u1074\u1082-\u108D\u108F-\u1099\u109E\u109F\u10FB\u135F-\u137C\u1390-\u1399\u166D\u166E\u1680\u169B\u169C\u16EB-\u16F0\u1712-\u1714\u1732-\u1736\u1752\u1753\u1772\u1773\u17B4-\u17D6\u17D8-\u17DB\u17DD\u17E0-\u17E9\u17F0-\u17F9\u1800-\u180E\u1810-\u1819\u18A9\u1920-\u192B\u1930-\u193B\u1940\u1944-\u194F\u19B0-\u19C0\u19C8\u19C9\u19D0-\u19D9\u19DE-\u19FF\u1A17-\u1A1B\u1A1E\u1A1F\u1B00-\u1B04\u1B34-\u1B44\u1B50-\u1B7C\u1B80-\u1B82\u1BA1-\u1BAA\u1BB0-\u1BB9\u1C24-\u1C37\u1C3B-\u1C49\u1C50-\u1C59\u1C7E\u1C7F\u1DC0-\u1DE6\u1DFE\u1DFF\u1FBD\u1FBF-\u1FC1\u1FCD-\u1FCF\u1FDD-\u1FDF\u1FED-\u1FEF\u1FFD\u1FFE\u2000-\u2064\u206A-\u2070\u2074-\u207E\u2080-\u208E\u20A0-\u20B5\u20D0-\u20F0\u2100\u2101\u2103-\u2106\u2108\u2109\u2114\u2116-\u2118\u211E-\u2123\u2125\u2127\u2129\u212E\u213A\u213B\u2140-\u2144\u214A-\u214D\u214F\u2153-\u2182\u2185-\u2188\u2190-\u23E7\u2400-\u2426\u2440-\u244A\u2460-\u269D\u26A0-\u26BC\u26C0-\u26C3\u2701-\u2704\u2706-\u2709\u270C-\u2727\u2729-\u274B\u274D\u274F-\u2752\u2756\u2758-\u275E\u2761-\u2794\u2798-\u27AF\u27B1-\u27BE\u27C0-\u27CA\u27CC\u27D0-\u2B4C\u2B50-\u2B54\u2CE5-\u2CEA\u2CF9-\u2CFF\u2DE0-\u2E2E\u2E30\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u2FF0-\u2FFB\u3000-\u3004\u3007-\u3030\u3036-\u303A\u303D-\u303F\u3099-\u309C\u30A0\u30FB\u3190-\u319F\u31C0-\u31E3\u3200-\u321E\u3220-\u3243\u3250-\u32FE\u3300-\u33FF\u4DC0-\u4DFF\uA490-\uA4C6\uA60D-\uA60F\uA620-\uA629\uA66F-\uA673\uA67C-\uA67E\uA700-\uA716\uA720\uA721\uA789\uA78A\uA802\uA806\uA80B\uA823-\uA82B\uA874-\uA877\uA880\uA881\uA8B4-\uA8C4\uA8CE-\uA8D9\uA900-\uA909\uA926-\uA92F\uA947-\uA953\uA95F\uAA29-\uAA36\uAA43\uAA4C\uAA4D\uAA50-\uAA59\uAA5C-\uAA5F\uD800\uDB7F\uDB80\uDBFF\uDC00\uDFFF\uE000\uF8FF\uFB1E\uFB29\uFD3E\uFD3F\uFDFC\uFDFD\uFE00-\uFE19\uFE20-\uFE26\uFE30-\uFE52\uFE54-\uFE66\uFE68-\uFE6B\uFEFF\uFF01-\uFF20\uFF3B-\uFF40\uFF5B-\uFF65\uFFE0-\uFFE6\uFFE8-\uFFEE\uFFF9-\uFFFD\u3000-\u303F\u3040-\u309F\u30A0-\u30FF\u4E00-\u9FFF\uAC00-\uD7AF\uF900-\uFAFF\uFF00-\uFFEF\u3130-\u318F])/');
    JScode.push(';');
    
    var script = document.createElement('script'); 
    script.innerHTML = JScode.join('\n'); 
    delete JScode;
    document.getElementsByTagName('head')[0].appendChild(script); 


** haven't implement IME fixes.

SebCorbin’s picture

Version: 6.x-2.x-dev » 7.x-1.x-dev
Issue summary: View changes
StatusFileSize
new9.19 KB

Re-rolled against 7.x

Will commit if it doesn't break current worddiff

SebCorbin’s picture

Status: Needs review » Fixed

  • SebCorbin committed 19378b9 on 7.x-1.x
    Issue #754784 by droplet, SebCorbin: Add CJK diff support
    

Status: Fixed » Closed (fixed)

Automatically closed - issue fixed for 2 weeks with no activity.