I would like the option to convert the email addresses in the static file to a format compatible with the spamspan module. Here is code I am using in my tweaked version of import_html to do this:
// Needed because otherwise spamspan will generate illegal html if a name instead of an email address is in
// linked text. This customization removes the email address from the HTML code. If the linked text is other
// than the email address in the HTML code it puts this text before the email address. It is left to spamspan
// to create the code for the email address.
$offset = 0;
$insert = "";
while( preg_match( '@<A[\s]*?href=[\s]*?\"?[\s]*?mailto:[\s]*?([\S]*?)[\s]*?\".*?>[\s]*?([\S].*?)[\s]*?</A>@mi', $xmlsource, $email, PREG_OFFSET_CAPTURE, $offset ) )
{ if( $email[1][0] != $email[2][0] )
{ // The linked text is most likely not an email address so put them outside the link
// and put the email address in its place:
$insert = $email[2][0] . ' ' . $email[1][0];
}else
{ $insert = $email[1][0];
}
$xmlsource = substr_replace( $xmlsource, $insert, $email[0][1], strlen($email[0][0] ) );
$offset = $email[0][1] + strlen( $insert );
}
$xmlsource = spamspan($xmlsource);
There are two problems I am aware of with the above code that will cause corrupt HTML that will fail to import:
- If the element has a title attribute that has an email address in its text field corrupt html will be generated and the document will fail to import. Code needs to be added to delete titles containing the ampersand character.
- If there are any elements within the text filed of the A element containing the email link. These need to be moved outside the A element. For example:
<A href="mailto:user@example.com"><font size="4">user@example.com</font></A>needs to be changed by the code to:
<font size="4"><A href="mailto:user@example.com">user@example.com</A></font>
Comments
Comment #1
dman commentedThis sort of text processing falls outside of what import_html is good at.
Early on I did attempt to do almost every type of clean up or transformation in one go during the import. But since then I've found it's more sane to divide the tasks up and just use import_html to get text into the database, including *some* stripping of useless bits and some rewriting.
Fundamentally changing or repairing code however can be safely done later using either filters or node_action processes. This is one of those tasks.
I've had occasion to knock up a number of small processes like this. Image Ownage is not one of the small examples, but illustrates how a clean-up job can be separated from the import jobs.
The SMALL example I'm talking about looks a little like this: