Again, this could be my DragonflyCMS thing being weird. If so, please won't fix.

Topics are being imported as:
"Smokin& #039; Aces: Theatrical Review"

They are actually stored this way in the database; however, since Drupal does filtering on output, they're being double-escaped.

This patch fixes the problem by decoding the titles before they're put in the db. Bodies don't have this problem, since they're handled by BBCode.

CommentFileSizeAuthor
#1 html-entities_0.patch622 byteswebchick
html-entities.patch700 byteswebchick

Comments

webchick’s picture

StatusFileSize
new622 bytes

Whoops. Forgot about comment titles as well. And this is probably a better approach.

beginner’s picture

Version: master » 4.7.x-1.0
Status: Needs review » Fixed

committed. thanks.

Anonymous’s picture

Status: Fixed » Closed (fixed)
naheemsays’s picture

Version: 4.7.x-1.0 » master
Status: Closed (fixed) » Active

In http://drupal.org/node/67068#comment-688227 I wrote:

Not sure if is it connected, but I keep getting errors with "smart quotes". “ gets changed to “ and ” to â€. The - even borked the conversion as an illegal character!

I see there is a function to turn similar things into html characters. Probably need to see why it is not working.

and:

the issue with #23 is the line $text = html_entity_decode($text, ENT_QUOTES); in the encoding function. Do we really need this? I think everything will work just as well without it?

Beginner replied:

What is the encoding of your data?
It is iso-8859-1 or windows 1252?
The latter is adding em-dash and fancy quotes that is not part of the basic iso-8859-1 encoding.

Anyhow, securitywise, I think it should be ok to remove that line of code, as long as the INSERT follow the proper Drupal API (they do).
Do check, though. _phpbb2drupal_text_encode() is called in many places.

about #24.
Here is how you can do some investigative work.

1) go to http://cvs.drupal.org/viewvc.py/drupal/contributions/modules/phpbb2drupa...
2) go to the 'annotate' view. http://cvs.drupal.org/viewvc.py/drupal/contributions/modules/phpbb2drupa...
3) check the line of code you wonder about:
1424 : augustin 1.41 $text = html_entity_decode($text, ENT_QUOTES);
you see that this line was added by myself in version 1.41.
http://cvs.drupal.org/viewvc.py/drupal/contributions/modules/phpbb2drupa...

You go back to the log view, and you see:

Revision 1.41 - (view) (download) (annotate) - [select for diffs]
Sat Jun 2 04:45:56 2007 UTC (7 months, 1 week ago) by augustin
Branch: MAIN
Changes since 1.40: +2 -1 lines
Diff to previous 1.40
#114451 Topics imported with HTML entities escaped. Thanks Webchick.

which gives you the reference to the issue:
http://drupal.org/node/114451

That's why it's important you always reference the issue with a # sign each time you commit some code: http://drupal.org/project/cvs/45403
You never know when you (or the next maintainer) will wonder about a change.

When I did my migration, I didn't have that line of code, and I remember that I did have problems with quotes in titles, etc.
And according to the issue referenced above, html_entity_decode() is necessary for node titles and comment titles... but apparently not for the body.

You need to test this (node title, comment title, body, with and without the extra code). Add quotation marks in your titles for testing.

I think the first approach from the two patches should fix this. I will need to test it later on.

naheemsays’s picture

Just tested a removing this patch but no luck. I think it may just be some corrupted data on my end.

beginner’s picture

Where do you have the problem?
In the node title or within the node body?
In Drupal, html entities are not allowed within the title.

naheemsays’s picture

Within the body.

I tried to change it so that only the html entities are escaped for the titles (like the first patch, but also for comments, polls etc), but not for the node body.

I do not get this everywhere, but one specific post in my test database (of about 110,000 posts... so there may be more, but this does not happen everywhere), It may just be data corruption.

naheemsays’s picture

Status: Active » Fixed

The original (committed) patch output the characters to an iso-8859-1 encoding by default - iso-8859-1 cannot correctly store special characters.

$text = html_entity_decode($text, ENT_QUOTES);

Should be

$text = html_entity_decode($text, ENT_QUOTES, 'utf-8');

(I also have moved the decoding of html entities to after the conversion of everything as this seems like a safer bet for other encodings.)

This did affect titles too. Fixed in both HEAD and drupal 5.x-3.x-dev branch.

Anonymous’s picture

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for two weeks with no activity.