WordPress often auto-converts (it would seem) normal quotes into smart quotes, and it seems this is tripping up WordPress Migrate a bit. For example, one of the imported posts has a title of: Sponsor Fill PDF's "Send as Email" Feature

Overall, the module did an impressive job of migrating! I think there might have been a comment order issue as well, but I'll research/report that separately if so.

If I get to it, I will try to patch this issue myself, but wanted to report it in case it was known or buried in another issue I couldn't find.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

wizonesolutions’s picture

Title: WordPress smart quotes are not translated in titles » Decode HTML entities in WordPress titles when importing
Status: Active » Needs review
FileSize
712 bytes

I've updated this title to better reflect the issue. Not only that; I've actually attached a patch!

Please review. I'm not sure if my solution is too liberal or would let someone put something naughty in on the WordPress side and cause issues Drupal-side. But this satisfies my use-case of what will be a one-time migration, after which the URL that currently hosts the WP site will host this site, and I've created all of the content, so I know what's there.

I think it's pretty solid.

wizonesolutions’s picture

Title: Decode HTML entities in WordPress titles when importing » Do not escape amphersands referencing real HTML entities

Further investigation shows that my previous patch was naïve. The actual underlying issue was this block of code in wordpress_migrate.pages.inc:

    // The excerpt namespace is sometimes omitted, stuff it in if necessary
    $wxr_string = file_get_contents($destination);
    $excerpt_ns = 'xmlns:excerpt="http://wordpress.org/export/1.0/excerpt/"';
    $excerpt_signature = 'xmlns:excerpt="http://wordpress.org/export/';
    $content_ns = 'xmlns:content="http://purl.org/rss/1.0/modules/content/"';
    if (!strpos($wxr_string, $excerpt_signature)) {
      $wxr_string = str_replace($content_ns, $excerpt_ns . "\n\t" . $content_ns, $wxr_string);
    }
    // Add the Atom namespace, in case it's referenced
    $atom_ns = 'xmlns:atom="http://www.w3.org/2005/Atom"';
    $wxr_string = str_replace($content_ns, $atom_ns . "\n\t" . $content_ns, $wxr_string);

    // Fix unencoded ampersands
    $wxr_string = str_replace('&', '&', $wxr_string);
    $wxr_string = str_replace('&', '&', $wxr_string);
    file_put_contents($destination, $wxr_string);

This does not take HTML entities into account (and in fact messes them up).

The previous patch is a stopgap for titles...but I'll try to figure out a way to achieve what this is trying to do without doing it so destructively.

@mikeryan or others...what is the purpose of this code? Do unencoded amphersands wreak havoc at some point in the process? I certainly can imagine that being the case...although running the import through Drush seems to work just fine, so the effects must be somewhat specific.

wizonesolutions’s picture

Priority: Normal » Major
FileSize
752 bytes

Here's a proposal of a better way to do this. It doesn't really try to address the initial str_replace that un-fixes any encoded amphersands, but it works pretty well for what it does.

Bumping to major because this is going to mess up a lot of people's imports if not addressed.

kimmel’s picture

FileSize
692 bytes

I am not sure the patch for #3 is correct. Why do you need to match the equal symbol or an empty space before the semicolon?

I have attached a patch for how I think it should be fixed.

wizonesolutions’s picture

That was just me engaging in overkill, as you mentioned on IRC. I don't see any flaws in the way you proposed, so I will concede to your patch.

The only hole in both of our approaches is that something like &fishandchips; or other fake entities would not be encoded. However, I don't think such strings will occur without manual editing to the export file because WordPress would escape them upon save, I'm pretty sure. So until that use case comes up...I think this fix should be good.

Nice one on getting rid of the initial str_replace.

I noticed that &_whatever; is a valid entity with your regex because you use \w instead of A-za-z0-9 - not sure if that actually IS valid in character classes. I'll leave this point to the maintainer or you if you want to tweak.

P.S. The logic behind the = and space was just adding in additional characters to serve as boundaries for the HTML entity...but they were actually unnecessary because the pattern matching fails if a non-numeric non-alphabetic (and non-underscore) non-# character is used anyway. To my credit, it was 2:30 AM :)

kimmel’s picture

Yeah we would need to follow the entity list here: http://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_refer... and that would further complicate this module.

A possible way to test if this could be a problem would be to run a parser over the input data from wordpress and verify that it is compliant. If the wordpress exporter's output is good then this kind of edge case should not be a problem. We would need multiple different wordpress exports to verify one way or the other.

wizonesolutions’s picture

@mikeryan, have you seen this issue? I know you don't have time to fix everything - just curious as there have been other commits since.

dalin’s picture

I think the better way might be to simply:

$wxr_string = htmlspecialchars($wxr_string, ENT_QUOTES | ENT_XML1, 'UTF-8', FALSE);

dalin’s picture

Status: Needs review » Reviewed & tested by the community

Actually we won't be able to use htmlspecialchars() until PHP 5.4. The patch in #4 is the best that we can do ATM.

mikeryan’s picture

Title: Do not escape amphersands referencing real HTML entities » Do not escape ampersands referencing real HTML entities

OK, this project is officially back on track. I've cut a new 7.x-2.x branch (to reflect compatibility with Migrate 7.x-2.x) and will start by committing the most promising patches people have submitted (I'm not doing thorough testing, just a quick code review for now). From there some refactoring will be in order - the plan is to have a solid RC release by 2/17.

So, this patch is committed, thanks!

mikeryan’s picture

Status: Reviewed & tested by the community » Fixed

Status: Fixed » Closed (fixed)

Automatically closed -- issue fixed for 2 weeks with no activity.