Problem/Motivation

By default GSA indexes all page text. Many page layouts have common elements that would be better off not being indexed.

For example, I use taxonomy menu to generate a hierarchical menu of categories of content (and, of course, the terms are some of the most important and common terms on the whole site). The menu is displayed on all pages on my site. Searching for a term found in the menu returns nearly every page on the site.

This makes searching for common terms nearly meaningless. What I want is to only index the main content of each page, excluding text from the header, side-bar and footer regions.

Proposed resolution

GSA has the ability to exclude unwanted text from its index using html comments such as <!--googleoff: index-->. Perhaps a google_appliance module configuration option could specify page regions (css classes?) that would or would not be indexed.

To be honest, I'm not sure what the best implementation is. There are so many ways to modify page layout (templates, panels, delta, etc), and many ways to modify rendering (templates, render array) I get a bit lost.

Remaining tasks

tbd

User interface changes

A configuration text field, "Comma-separated list of css classes for which text will not be indexed".

There are other indexing options supported by the GSA indexer (googleoff:anchor, googleoff:snippet, googleoff:all), which could have similar configuration, but I think it would be best to solve this problem for text first.

API changes

Don't know.

Support from Acquia helps fund testing for Drupal Acquia logo

Comments

rt_davies’s picture

A small update, I've implemented a work-around by modifying my theme template files directly (I'm using Omega). I'm including code below in case it helps anyone who wants a quick hack. But I think an implementation in the google appliance module that is independent of a specific theme, and can be turned on and off with the module, would be better.

In php.tpl.php

at the top...
  <!--googleoff: index-->

...rest of page...

  <!--googleon: index-->
...end

In region--content.tpl.php

...just after the opening div tag...
    <?php $indexing_on = in_array('region-content', $classes_array); ?>
    <?php if($indexing_on): ?>
      <!--googleon: index-->
    <?php endif; ?>

...just before the closing div tag...
    <?php if($indexing_on): ?>
      <!--googleoff: index-->
    <?php endif; ?>

With this hack in place search results are exactly what I want! Only pages whose main content is relevant to the searched terms appear in the search results.

mpgeek’s picture

Interesting. I think that using the googleon/googleoff tags is content and theme specific. If you put those tags into such places, then the hidden-from-search content won't be indexed, nor will it be found. IMO, controlling what ends up in the index should be handled at the content level. Perhaps you can convince otherwise? Perhaps others have an opinion as well?

mpgeek’s picture

Status: Active » Postponed (maintainer needs more info)
rt_davies’s picture

IMO, controlling what ends up in the index should be handled at the content level

I tried to make clear at the end of the OP, I don't know enough about module design, content design, or theming best practices to have an opinion. Your assertion sounds right to me to the extent that I understand it.

My OP is merely a "report from the field" indicating that I really like the module, but it fell short of my needs in this specific way. And, my follow up was to simply post a hackish workaround, knowing full well that it was not a good long-term solution.

I'd be happy to provide more info, if I felt I had any, but I'm really hoping that those with more Drupal chops will be able to say whether this feature request is worthy of consideration, and if so, how best to implement it.

Thanks.

mpgeek’s picture

Status: Postponed (maintainer needs more info) » Active

Ah, i see. I'll leave this request open and see if we get any input. Thanks for contributing your ideas.

mpgeek’s picture

Status: Active » Postponed (maintainer needs more info)

Soliciting opinions on this one.

iamEAP’s picture

The Drupal 6 version of this module attempted to handle this to some extent by throwing an extra field on block edit forms (via form alter) that served as a flag. The implementation was pretty severely flawed, but the intention was to wrap the block content with the googleon/off tags based on the flag via a block preprocess.

See #1430656: Hiding blocks from GSA crawler has no effect

I suppose the same concept could be extended to cover fields/entities/menus/taxonomies as well. An improvement would be support for any combination of tags (index/anchor/snippet/all).

The question would be, do we want this module to try and cover all use cases, most normal use cases, or do we leave it as the responsibility of the user?

This isn't critical for me, but I'd be willing to help put this together.

mpgeek’s picture

I like to keep the things that apply to all use cases in the module, as we only cover the bases that are necessary for anyone to use the module. Extension is normally what I would do to cover corner cases. Feel free to put a patch together...

iamEAP’s picture

Version: 7.x-1.4 » 7.x-1.x-dev
Status: Postponed (maintainer needs more info) » Needs review
FileSize
7.09 KB
54.42 KB
75.17 KB

Adds some complexity, but the interface is pretty clean and the data storage is pretty clean too. Patch attached.

Interface Changes (Block Configuration Form)

Block Visibility Interface Changes

mpgeek’s picture

Status: Needs review » Active

@iamEAP, are the choices in the form mutually exclusive, i.e. would checking "disssociate anchor text" AND "exclude block contents from snippets" be a valid use case if there were checkboxes? Does it make more sense to use check boxes to progressively exclude content? Like so:

Google Appliance Crawler Visibility
[ ] Dissasociate anchor text from target pages within this block
[ ] Exclude block contets from search snippets
[ ] exlude block contents from index

iamEAP’s picture

Unfortunately, I was unable to find any documentation regarding the use of multiple tags around the same content. I can certainly see the use cases (and I like your order better), but I'm just not certain it's supported.

mpgeek’s picture

@iamEAP, getting closer on this one. If i apply the patch and try to change the setting for any given block, i get "Cannot use string offset as an array in /Users/eric/mamp-localhost/d7-sandbox/sites/all/modules/contrib/google_appliance/google_appliance.module on line 286". Something is fishy with accessing the settings['block_visibilty_settings'] array.

iamEAP’s picture

Status: Active » Needs review
FileSize
8 KB

Ah. The last patch attempted to store the block visibility settings as a serialized string because of the way the module's variable defaults are set up (in PHP, constants must be scaler), and it was missing an unserialize in the settings getter. On second thought, I'm not a big fan of storing all of those settings as a serialized string.

Here's more or less the same patch, that instead relies on PHP's dynamic typing to allow the default to be an empty string, but turn it into an empty array wherever necessary.

Note also that I'm moving the trim() that used to wrap all of the variable_get calls in the settings getter to the settings submit handler (because you can't call trim() on an array). There's a slim possibility this could break some sites because their GSA host had a trailing space. If we end up going with this, it may be worth adding an update function that trim()s all of the relevant variables with a note in the release notes.

mpgeek’s picture

Status: Needs review » Needs work

Yes, this one works much better. Just a few minor notes:

+++ b/google_appliance.moduleundefined
@@ -218,6 +218,130 @@ function google_appliance_block_form_submit($form, &$form_state) {
+        '<p>' . t('For more information, see') . ' ' . l($read_more_link . '#pagepart', $read_more_link, array(

Can the link text/label be a few linked words instead of a full URL?

+++ b/google_appliance.admin.incundefined
@@ -177,4 +177,4 @@ function google_appliance_admin_settings_submit($form, &$form_state) {
\ No newline at end of file

Add the newline.

+++ b/google_appliance.helpers.incundefined
@@ -139,4 +142,28 @@ function _google_appliance_log_search_error($search_keys = NULL, $error_string =
\ No newline at end of file

Add the newline

+++ b/google_appliance.moduleundefined
@@ -218,6 +218,130 @@ function google_appliance_block_form_submit($form, &$form_state) {
+  $read_more_link = 'https://developers.google.com/search-appliance/documentation/50/admin_crawl/Preparing';
...
+ * @see https://developers.google.com/search-appliance/documentation/50/admin_crawl/Preparing#pagepart

Does it make any sense to point to the most-recent version of the docs here? https://developers.google.com/search-appliance/documentation/64/admin_cr...

I was just getting ready to tag out a stable release, so if we can get this one in soon, that'd be groovy.

mpgeek’s picture

Oops, missed coding standards. Here's another small edit:

+++ b/google_appliance.moduleundefined
@@ -218,6 +218,130 @@ function google_appliance_block_form_submit($form, &$form_state) {
+          'fragment' => 'pagepart', 'attributes' => array('target' => '_blank',)

Stray comma here.

iamEAP’s picture

Status: Needs work » Needs review
FileSize
8.03 KB

Got rid of stray comma, using "Google Search Appliance documentation" instead of the full link as link text, and linking to the most recent documentation. Those newline warnings are due to the diff: the old ones didn't have them; the new ones do.

Looking forward to the new release.

mpgeek’s picture

Status: Needs review » Reviewed & tested by the community

@iamEAP, I always wondered why the huzzah those newlines came through on all my patches... thanks for the tip. Anyways, looks good, and committed to dev: http://drupalcode.org/project/google_appliance.git/commit/64e007a. Stable release shortly.

mpgeek’s picture

Status: Reviewed & tested by the community » Closed (fixed)

Fixed. See 7.x-1.7.