digg_title (and other html) indexed in drupal search

jrguitar21 - January 9, 2009 - 23:07
Project:Digg this
Version:5.x-1.3
Component:Code
Category:bug report
Priority:normal
Assigned:yaph
Status:closed
Description

When I search for terms, if the terms happen to appear in a title of a node, then

1) they get placed near the top of the results page (good)
2) the "content" snippet that shows the emboldened terms ends up showing lots of escaped HTML (bad)

and, in particular the digg_title attribute usually comes in as part of the content snippet with the bold term i searched for, because its part of the title of the node.

Very ugly. Any ideas how to fix this?

#1

yaph - January 9, 2009 - 23:37
Assigned to:Anonymous» yaph

That's indeed ugly. The button should only displayed in a full page view not in search results according to the documentation of hook_nodeapi (http://api.drupal.org/api/function/hook_nodeapi/5).
Maybe using hook_view instead of hook_nodeapi in the Digg this code could fix this.

#2

yaph - January 10, 2009 - 00:30

I just created a new branch for the Drupal 5 version called DRUPAL-5--2 where I replaced hook_nodeapi with hook_view. Can you check out this branch and test whether the bug still occurrs?

#3

yaph - January 10, 2009 - 00:31

Forget my second comment. hook_view isn't called so this does not work at all. There has to be another solution.

#4

jrguitar21 - January 10, 2009 - 00:57

Hmm, good quick analysis, but Im not sure how moving the code to hook_view would work? (Oops i didnt see comment #2 or #3 when i wrote this) Perhaps another possible fix might be to create a handler for hook_nodeapi('update index') to grep out the "digg stuff" that was added during the 'view' case.

it looks like the flow goes something like this (i could be wrong here):

hook_view()
hook_nodeapi( 'view')
hook_nodeapi( 'update index')

Anyway just moving code around, or grepping out the content afterwords seems kinda hackish and processor intensive to me. The real problem to track down is why is the added digg html not getting cleaned out by the search indexer code? Somewhere its getting converted to escaped html and included in the page content body sent for indexing as plain text.

maybe theres something useful in the documentation of hook_update_index on how to properly send the digg div through the search_index.

--edited, flubbed the flow order...
--edited, to pardon my "attack" against something you already stated didnt work :/

#5

jrguitar21 - January 10, 2009 - 01:16

more leads: http://api.drupal.org/api/function/node_update_index/5

Looks like i was kinda wrong based on how the flow of node_update_index works...

Heres the key code snippet where important stuff happens:

<?php
    $node
= node_load($node->nid);

   
// Build the node body.
   
$node = node_build_content($node, FALSE, FALSE);
   
$node->body = drupal_render($node->content);

   
// Allow modules to modify the fully-built node.
   
node_invoke_nodeapi($node, 'alter');

   
$text = '<h1>'. check_plain($node->title) .'</h1>'. $node->body;

   
// Fetch extra data normally not visible
   
$extra = node_invoke_nodeapi($node, 'update index');
    foreach (
$extra as $t) {
     
$text .= $t;
    }

   
// Update index
   
search_index($node->nid, 'node', $text);
?>

From this i can see that you dont need to implement 'update index'... the content is already there and has been rendered. But now after examining your code, it looks like the problem is plain; because you're using <script> $digg_variables = "stuff" </script> the only thing i can deduce is that the <script> tags get stripped by the search_index but the stuff in the middle doesnt....

digg_url = '$url';
digg_title = $title;
digg_bodytext = $teaser;
$bg_string
digg_skin = '$skin';

is all left intact. It means that inline javascript isnt stripped by the search_index function....

I think I talked myself into the solution already, but basically its that you need to properly escape the inline javascript.

http://www.webdevout.net/articles/escaping-style-and-script-data

#6

jrguitar21 - January 10, 2009 - 01:40

I think a safe and sufficient approach might be:

<?php
  $digg_js
=<<<EOF
<script type="text/javascript">
//<![CDATA[
digg_url = '$url';
digg_title = $title;
digg_bodytext = $teaser;
$bg_string
digg_skin = '$skin';
//]]>
</script>
<script src="http://digg.com/tools/diggthis.js" type="text/javascript"></script>
EOF;
?>

Provided that none of the $variables you are using there have < or > in them, the search_index() function would strip everything but the "//" preceding <![CDATA

As for those $variables though, we must make sure they are either checked plain, properly html escaped or perhaps better yet hex encoded (if the digg api accepts or likes to be sent html markup).

I see that you properly strip all the variables in the code using strip_tags and drupal_to_js so its all good!

#7

jrguitar21 - January 10, 2009 - 01:47

Google seems to use an simpler format for its adsense code snippets, albeit not valid xhtml. If we do it their way, everything would be stripped:

<script type="text/javascript"><!--
google_ad_client = "pub-XXXXXXXXX";
/* Sidebar Link Items 200x90 */
google_ad_slot = "XXXXXXXX";
google_ad_width = 200;
google_ad_height = 90;
//-->
</script>

#8

yaph - January 10, 2009 - 20:36

Thanks a lot for your research and help James. I just did a new commit into the DRUPAL-5--2 branch where I added the HTML comments around the JS variables and also fixed some other bugs in the JS.

You can see the commit message for more info:
http://drupal.org/cvs?commit=163889

I haven't tested the code yet, because I need to set up a Drupal 5 installation first.

This bug is also on the DRUPAL-6 branch, I'll check whether this has the desired effect, before creating a new release.

#9

yaph - January 12, 2009 - 19:17

As far as I have tested the new code now, it looks as if adding the HTML comments prevents the JS variables from being displayed in search results. I created two new release candidates for Drupal 5 and Drupal 6.

Thanks again for reporting the bug and providing a solution!

#10

jrguitar21 - January 14, 2009 - 05:16

Great, i'm really glad to have helped.

As a follow up i'd like to note that I also tested on D5, on my live site. (not your release, but just the js comments in the module), marked the Search index to be reindexed and ran cron like a mad man for about an hour. All the nodes were updated and subsequent execution of a search for a term that i KNOW was a keyword in a page title doesnt show messy digg_title="....BOLD KEYWORD..." anymore in the search results!

Id say you're safe on D5 too.

#11

yaph - January 14, 2009 - 09:03

Yes, I also tested with a fresh D5 install, where I generated content with the devel module and the JS variables didn't show in the search results. I should have made it clear that I also tested D5, sorry for your additional efforts.

#12

yaph - January 16, 2009 - 22:26
Status:active» fixed

#13

System Message - January 30, 2009 - 22:30
Status:fixed» closed

Automatically closed -- issue fixed for 2 weeks with no activity.

 
 

Drupal is a registered trademark of Dries Buytaert.