I'm migrating an Arabic phpBB 2.x forum to drupal 4.7 using phpbb2drupal.module. The original phpBB forum is encoded in Windows-1256 (Arabic). I converted the data encoding into UTF-8 before being used by phpbb2drupal. While the migration process completes prettily, the right character encoding is lost.

Can you tell me what I shoud do?

Comments

beginner’s picture

Hello Hakeem,

How did you convert phpBB to utf8?
Your Drupal site is encoded in utf8, too (look at page header)?

phpbb2drupal currently doesn't handle encoding problems at all, but it should.
I'll be away from my computer during the next couple of days, so I won't be able to help, but I'll try to fix this problem next week.

Thanks for reporting.

beginner’s picture

Title: Migration is fine; encoding is lost! » Encoding change during the migration.
Assigned: hakeem » beginner
Status: Active » Fixed

I just updated the module.

Please try the latest version and test.

If everything works as expected, please let me know.

Reopen this issue if you find any problem with the encoding of any part of the forum (title, content, forum name, private messages, etc...). My test DB is in English, so it's difficult for me to make exhaustive tests.

thanks.

hakeem’s picture

Dear begginer,

I tested the module on another server and the encoding wasn't changed! May be because of somthing wrong in encoding-related settings of the environment (the database itself, or MySQL server).

Anyway, thanks alot for your support!

beginner’s picture

Category: bug » support
Priority: Critical » Normal
Status: Fixed » Active

Hakeem,

Can you describe precisely what you tried to do?
Maybe you didn't select the proper encodings in the settings.

Can you answer those questions:
Did you download the latest module, with the encoding support?
What mysql version do you use?
what is the encoding used in your phpBB board? (look at the headers). Can I have a link to the board?
What is the encoding used in your Drupal installation?
Do you have the mbstring module installed? (the setting page of the module should tell you that.
Were there any error message you saw in the setting page?

With my setup, I can only do limited tests. I don't have a test phpBB database encoded differently to test with.
If you think you did everything right, and selected the right FROM and TO encoding, and it still doesn't work, I need a copy of your database to test with.
http://drupal.org/user/23181/contact
Can you contact me above, if you are willing to give me a copy of the data to test with.

yours,
Beginner.

beginner’s picture

Status: Active » Fixed

At last, I used my own module on my own live web site:
http://www.reuniting.info/forum/
At the same time, I did a small test to check the encoding settings, and it worked as expected.

I cannot help without being given more details.

Anonymous’s picture

Status: Fixed » Closed (fixed)
beginner’s picture

Version: » master
Status: Closed (fixed) » Active

Hello Hakeem,
I just received your private message. I reply here in case whatever fix we find for your case will be helpful for other people.

Your situation:
charset=windows-1256 (= Arabic)
MySQL 4.1.19

Now, I just find out that your encoding is not supported by mbstring.
In the page below, you can see the list of all supported encodings:
http://php.net/mbstring

Since your encoding is not supported by mbstring, there is little I can do at the phpbb2drupal module level.

Now, there exists maybe another solution: you would have to convert the encoding on the phpBB data base itself BEFORE you attempt the migration. I have been searching the net for a solution, but have not found it yet. It doesn't mean that it doesn't exist.

Another solution would be to keep the encoding. Try installing Drupal, change the theme so that the page header is not utf8 but your arabic encoding. I don't know if Drupal arabic translation would work, then (actually, I am pretty sure it won't work, but you can try). If you don't mind being stuck with the 1256 encoding with English navigation, then you can go this way and circumvent completely the conversion problem.

Still, there must be a tool somewhere that should allow you to convert your DB before the migration.

beginner’s picture

I may have found an easy solution:
http://www.php.net/manual/en/function.iconv.php

which php version do you use?

Can you create a file named test.php with the following content:

 echo iconv("ISO-8859-1", "UTF-8", "This is a test.");
 

and try to load the file in the server and access it with your browser: do you get any error?
try both at home, and on the remote server.

KMG’s picture

the same thing happen to me, so what is the solution?
where i have to upload this test file.....
am using drupal5.2 phpbb2 and the language is Arabic

beginner’s picture

Assigned: beginner » Unassigned
naheemsays’s picture

Status: Active » Needs review
StatusFileSize
new1.59 KB
new577 bytes

Attached is a patch to use iconv instead of mbstring. I have also removed the mbstring check as iconv is available as standard - no need to check for it.

Finally, I have added a windows-1256 to the options for conversion.

@Beginner - any reason Windows1251 and 1252 have a (CP1252) and (CP1251) in that array? I have not added a corresponding (CP1256) to my addition as I have no idea what it is for.

'Windows-1251 (CP1251)' => 'Windows-1251 (CP1251)',
    'Windows-1252 (CP1252)' => 'Windows-1252 (CP1252)',
	'Windows-1256' => 'Windows-1252',
beginner’s picture

If I remember well, Windows1251 is an alternative name for CP1252.

naheemsays’s picture

StatusFileSize
new1.99 KB

Attached is the newupdated patch. (added CPC1256 to description, rerolled without the split I had planned for the module.)

naheemsays’s picture

Status: Needs review » Fixed

Patch has been committed to head and Drupal-5 branch.

beginner’s picture

The reason I hadn't made the change to icon() earlier, is that I was not sure it would be installed on every server.

The guy never replied to my question in #8.

I just tested iconv() on my computer and I get:
Fatal error: Call to undefined function: iconv() .

Should this issue be re-opened to get user feedback about the existance of iconv on their systems?

beginner’s picture

Status: Fixed » Active

Also, iconv() doesn't seem to support multibyte strings, which was the reason the earlier function was used. As such, the module wouldn't work for other users using JKC languages.

naheemsays’s picture

Yes, a few issues have cropped up.

1. According to php Manual page, some platforms call the function libiconv. (http://uk3.php.net/manual/en/function.iconv.php) It also shows a workaround, but this function should be available on all platforms in one form or another since php4.0.5. Is there elsewhere I can look for corroboration? I can add an option to use mbstring where available but that would leave the original bug of not encoding from like Arabic (CP1256).

2. From reading the php Manuals, iconv *should* handle multibyte strings. It is even used in some other examples to convert from multibyte to Unicode so that other functions can use the string.

3. I have also noticed another problem with the change. (It cuts off at the first character it cannot encode, thus losing the rest of that node data. I need to add //TRANSLIT after the output charset string to fix this and give the best match character.

4a. According to the manual, there is a bug or a feature where iconv will work even if the input charset is not defined. This will need further investigation, but if it works, I think removing the input encoding option would be a good thing.

4b. Does Drupal use other charsets apart from UTF-8? Just wondering if the output charset is needed as an option, or wether it can be fixed to UTF-8?

4c. The charset options may need to be changed to put "CP1256" etc instead of "Windows-1256 (CP1256)"

EDIT @ Beginner - What system are you using?

naheemsays’s picture

Just updated the drupal-5--3 branch to check if the functions iconv or libiconv exist, and to also use best match when an exact match is not available instead of cutting the string at the first illegal character.

beginner’s picture

Maybe you can postpone this issue until you get some feedback from the users.

If there is a problem, you can make a configuration setting, giving the choice between the two. A switch would use either one or the other function according to the setting.

Drupal uses uft8 by default everywhere, i.e. in the theme (see headers) and in the DB (see encoding setting). I don't see a good reason for people to change the default.

Libiconv was not installed by default on my development platform, but when I noticed it, it was easy to install the missing package (php-iconv).

I am using Mandriva but I plan to switch to Debian, when I can.

naheemsays’s picture

heh, just looking at the API, I found this:

http://api.drupal.org/api/function/drupal_convert_to_utf8/6

I have changed HEAD to use this. However, this function will bale out if it cannot convert a character using the iconv function, instead of finding next match (or even ignoring that character) and moving on. I will need to file a bug report to fix this.

naheemsays’s picture

Just been looking at phpbb3 to see how it changes from phpbb2(many encodings) to phpbb3 (utf-8) as I figure they would be the experts for their encodings.

Main encoding is iso-88559-1 (for english atleast.), but it forces the recoding to actually encode from cp1252.

It also has the following commented out section listing other encodings (includes/utf/utptools.php):

	/*static $lang_enc_array = array(
		'korean'						=> 'euc-kr',
		'serbian'						=> 'windows-1250',
		'polish'						=> 'iso-8859-2',
		'kurdish'						=> 'windows-1254',
		'slovak'						=> 'Windows-1250',
		'russian'						=> 'windows-1251',
		'estonian'						=> 'iso-8859-4',
		'chinese_simplified'			=> 'gb2312',
		'macedonian'					=> 'windows-1251',
		'azerbaijani'					=> 'UTF-8',
		'romanian'						=> 'iso-8859-2',
		'romanian_diacritice'			=> 'iso-8859-2',
		'lithuanian'					=> 'windows-1257',
		'turkish'						=> 'iso-8859-9',
		'ukrainian'						=> 'windows-1251',
		'japanese'						=> 'shift_jis',
		'hungarian'						=> 'ISO-8859-2',
		'romanian_no_diacritics'		=> 'iso-8859-2',
		'mongolian'						=> 'UTF-8',
		'slovenian'						=> 'windows-1250',
		'bosnian'						=> 'windows-1250',
		'czech'							=> 'Windows-1250',
		'farsi'							=> 'Windows-1256',
		'croatian'						=> 'windows-1250',
		'greek'							=> 'iso-8859-7',
		'russian_tu'					=> 'windows-1251',
		'sakha'							=> 'UTF-8',
		'serbian_cyrillic'				=> 'windows-1251',
		'bulgarian'						=> 'windows-1251',
		'chinese_traditional_taiwan'	=> 'big5',
		'chinese_traditional'			=> 'big5',
		'arabic'						=> 'windows-1256',
		'hebrew'						=> 'WINDOWS-1255',
		'thai'							=> 'windows-874',
		//'chinese_traditional_taiwan'	=> 'utf-8' // custom modified, we may have to do an include :-(
	);*/

The actual convert function after this is similar to what we have in Drupal (but it also has many manual recoders for cases where none of the functions we use [Iconv, mbstring and recode, in that order.] exist. )

phpbb is also under the GPL, Can I borrow the above table to replace the one we have now?

We also have an option to "automate" all this by getting the "default language" from the phpbb_config table. (maybe leave an option to encode or not for those who do not have any encoding functions available.)

beginner’s picture

That's great. Obviously, phpBB knows better how its own data is encoded. It's all GPL so you are free to borrow any code you like, if it can help a user migrate from phpBB2 to Drupal.

naheemsays’s picture

Not sure if is it connected, but I keep getting errors with "smart quotes". “ gets changed to “ and ” to â€. The - even borked the conversion as an illegal character!

I see there is a function to turn similar things into html characters. Probably need to see why it is not working.

naheemsays’s picture

the issue with #23 is the line $text = html_entity_decode($text, ENT_QUOTES); in the encoding function. Do we really need this? I think everything will work just as well without it?

beginner’s picture

What is the encoding of your data?
It is iso-8859-1 or windows 1252?
The latter is adding em-dash and fancy quotes that is not part of the basic iso-8859-1 encoding.

Anyhow, securitywise, I think it should be ok to remove that line of code, as long as the INSERT follow the proper Drupal API (they do).
Do check, though. _phpbb2drupal_text_encode() is called in many places.

beginner’s picture

about #24.
Here is how you can do some investigative work.

1) go to http://cvs.drupal.org/viewvc.py/drupal/contributions/modules/phpbb2drupa...
2) go to the 'annotate' view. http://cvs.drupal.org/viewvc.py/drupal/contributions/modules/phpbb2drupa...
3) check the line of code you wonder about:
1424 : augustin 1.41 $text = html_entity_decode($text, ENT_QUOTES);
you see that this line was added by myself in version 1.41.
http://cvs.drupal.org/viewvc.py/drupal/contributions/modules/phpbb2drupa...

You go back to the log view, and you see:

Revision 1.41 - (view) (download) (annotate) - [select for diffs] 
 Sat Jun 2 04:45:56 2007 UTC (7 months, 1 week ago) by augustin 
Branch: MAIN 
Changes since 1.40: +2 -1 lines 
Diff to previous 1.40 
#114451 Topics imported with HTML entities escaped. Thanks Webchick.

which gives you the reference to the issue:
http://drupal.org/node/114451

That's why it's important you always reference the issue with a # sign each time you commit some code: http://drupal.org/project/cvs/45403
You never know when you (or the next maintainer) will wonder about a change.

When I did my migration, I didn't have that line of code, and I remember that I did have problems with quotes in titles, etc.
And according to the issue referenced above, html_entity_decode() is necessary for node titles and comment titles... but apparently not for the body.

You need to test this (node title, comment title, body, with and without the extra code). Add quotation marks in your titles for testing.

naheemsays’s picture

Thanks for giving me more details on how to investigate. Very helpful.

I will take this issue to the right place so as not to pollute this topic:

http://drupal.org/node/114451

Evance’s picture

this module is so surprising...

i just need it !! but my forum is based on phpbb3 ..

how can i use it via making some modification ?

beginner’s picture

@Evance: this is the wrong issue. There is another issue about phpbb3. Don't add noise here.

phpbb3 is currently unsupported.

If you want support for phpbb3, you can either provide a patch (see patching guidelines in handbook), or pay nbz some money for him to support it soon.
Don't reply here but in the proper issue.

naheemsays’s picture

Status: Active » Fixed

I have reverted some changes and borrowed code from the drupal_convert_to_utf8 function as I needed it to function slightly differently(use //TRANSLIT for iconv encoding to avoid dataloss - I have supplied a patch for the main function in another issue - this also makes iconv function similar to mbstring.).

http://drupal.org/node/205406

Once/if that is fixed, I can go back to using the function directly.

briinums’s picture

Version: master » 5.x-2.0
Status: Fixed » Active

hello!

I have phpbb2, encoding i use on site is utf-8 (i write in latvian language). at the same time phpmyadmin says mysql encoding is UTF-8, collation of table field is "latin1_swedish_ci" (according to php.net it is windows-1252)
i tried to import data:
1) without encoding
2) encoding from isoblablabla-1
3) encoding from UTF-8
4) encoding from windows-1252

none of these worked :/ the non-english characters are screwed up..

my server HAS mbstring module.
i have no idea how to use the .patch files you posted above, so i havent tried them (for iconv)
i have no shell access to my server, just ftp

i hope you can help me..

beginner’s picture

Status: Active » Fixed

You need to fix your phpBB DB first.
See this: http://drupal.org/node/187689

This is a separate issue.

Anonymous’s picture

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for two weeks with no activity.