Option to Make Email Addresses spamspan Compatible. [#1124308]

I would like the option to convert the email addresses in the static file to a format compatible with the spamspan module. Here is code I am using in my tweaked version of import_html to do this:

// Needed because otherwise spamspan will generate illegal html if a name instead of an email address is in
// linked text. This customization removes the email address from the HTML code. If the linked text is other
// than the email address in the HTML code it puts this text before the email address. It is left to spamspan
// to create the code for the email address.
  $offset = 0;
  $insert = "";
  while( preg_match( '@<A[\s]*?href=[\s]*?\"?[\s]*?mailto:[\s]*?([\S]*?)[\s]*?\".*?>[\s]*?([\S].*?)[\s]*?</A>@mi', $xmlsource, $email, PREG_OFFSET_CAPTURE, $offset ) )
  {	if( $email[1][0] != $email[2][0] )
    { // The linked text is most likely not an email address so put them outside the link
      // and put the email address in its place:
      $insert = $email[2][0] . ' ' . $email[1][0];
    }else
    { $insert = $email[1][0];
    } 
    $xmlsource = substr_replace( $xmlsource, $insert, $email[0][1], strlen($email[0][0] ) );
    $offset = $email[0][1] + strlen( $insert );
  }
  $xmlsource = spamspan($xmlsource);

There are two problems I am aware of with the above code that will cause corrupt HTML that will fail to import:

If the element has a title attribute that has an email address in its text field corrupt html will be generated and the document will fail to import. Code needs to be added to delete titles containing the ampersand character.
If there are any elements within the text filed of the A element containing the email link. These need to be moved outside the A element. For example:
```
<A href="mailto:user@example.com"><font size="4">user@example.com</font></A>
```
needs to be changed by the code to:
```
<font size="4"><A href="mailto:user@example.com">user@example.com</A></font>
```

Comments

Comment #1

dman commented 4 March 2012 at 15:37

Status:

Active

» Closed (won't fix)

This sort of text processing falls outside of what import_html is good at.
Early on I did attempt to do almost every type of clean up or transformation in one go during the import. But since then I've found it's more sane to divide the tasks up and just use import_html to get text into the database, including *some* stripping of useless bits and some rewriting.

Fundamentally changing or repairing code however can be safely done later using either filters or node_action processes. This is one of those tasks.

I've had occasion to knock up a number of small processes like this. Image Ownage is not one of the small examples, but illustrates how a clean-up job can be separated from the import jobs.

The SMALL example I'm talking about looks a little like this:


/**
 * Implementation of hook_node_operations().
 * 
 * Let this process be done from the content management screen
 */
function custom_cleanups_node_operations() {
  $operations = array(
    'custom_cleanups_check_content' => array(
      'label' => t('Check that the page content is formatted right'),
      'callback' => 'custom_cleanups_check_content_operation',
    ),
  );
  return $operations;
}

/**
 * Callback for hook_nodes_operation.
 * 
 */
function custom_cleanups_check_content_operation($nids = array()) {
  foreach ($nids as $nid) {
    $node = node_load($nid);
    $modified = custom_cleanups_check_content_action($node);
    if ($modified) {
      node_save($node);
    }
  }
}

/**
 * Scan a node and check the content
 * 
 * This action does not include a re-save of the given node.
 * Save it yourself if the return flag is set.
  * (this allows it to be called as part of larger processes)
 */
function custom_cleanups_check_content_action(&$node, $settings = array()) {
  // 1. Check to see if you need to make a change
  $modified = preg_match('@bad@', $node->body);
  // 2. Make that change to the $node->body
  $node->body = preg_replace('@bad@', 'good', $node->body);
  // 3. Remember to rebuild the teaser or strange double-ups occur
  if ($modified) {
    $node->teaser = node->teaser($node->body);
  }
  return $modified;
}

Option to Make Email Addresses spamspan Compatible.

Comments

Comment #1

News items

Our community

Documentation

Drupal code base

Governance of community