Add config option to restrict by subcollection

SnowyR - August 11, 2009 - 19:06
Project:Google Search Appliance
Version:6.x-2.x-dev
Component:Code
Category:feature request
Priority:normal
Assigned:Unassigned
Status:active
Description

Not sure how this works in other GoogleMini configurations (using multiple collections) and if this would assist with future changes to your module...

In my case I have one licenced collection and use the subcollections feature to restrict/filter search results for various websites so as not to show all the search results from unapproved/protected websites outside of our corporate intranet. This is also using one of the early GoogleMini models (MID-00235).

Modifying your 6.x-2.x-dev module (which works very well):
- introduced two site variables
google_appliance_default_subcollection
google_appliance_restrict_by_subcollection_enabled
- this update is geared/tested towards situations where we only have one collection in Google Mini and we retrict access by using the subcollection option
- enabling the 'restrict by subcollection' option adds the parameter
&restrict=subcollectionname
which filters the results to the subcollection named in the 'default_subcollection'
- if 'default_subcollection' is empty, the results are unaffected

Note:
- it might be helpful to add to the module notes that CURL and SimpleXML are dependencies

AttachmentSize
subcollection-modifications-to-6x-2xdev.txt7.3 KB

#1

larskleiner - August 12, 2009 - 09:54

I can't find a subcollection feature in my Google Mini M2 model. Could it be that this feature got replaced/renamed or maybe moved into GSA? If we need to address different features for different Google Mini (MID and M2) and potentially GSA versions I suggest to add a something like a select box into the admin area to specify the Mini/GSA version so that features can be enabled and disabled specific to that version.

It would be great if you can supply the changes as a patch so that we can apply and test it.

The branch tag will most likely stay at 6.x-2.x-dev unless your patch brings about really major changes that will justify a version number increment.

I'd rather leave

$payload = simplexml_load_string($resultXML);

in GoogleMini.php as it is because it's better for debugging. Maybe we should only leave it for debug levels 1 and 2 but change it to

@$payload = simplexml_load_string($resultXML);

when debug is disabled?

The curl and SimpleXML dependencies should also be added to the README.txt. This may be part of your patch as well.

#2

SnowyR - August 24, 2009 - 22:56

sure sounds good - I'll see if I can get that (patch file) in the next week or so.

I'm interested in what you mentioned about your Google Mini and not having the subcollections. (which will be a real drag if we have to replace the old server). I'm fairly sure it is visible from all google user account roles.

For interest, this is where I find it on my system. When I log into the Google Mini. From the 'Main' window, under 'Manage Existing Collections'
Select a collection and click 'View/Edit'
From this view I have four main tabs:
'Configure Crawl', 'Configure Serving', 'System Status', and 'Reporting'

Under the first tab 'Configure Crawl' I then see five links:
'URLs to Crawl', 'Crawler Parameters', 'Crawler Access', 'Subcollections', and 'Schedule'

Under the 'Subcollections' link it allows for searching within a subset of the collection which you can configure with various URL pattern matches.

#3

larskleiner - August 27, 2009 - 14:10

Completely different here on my Google Mini M2, see attached screen grab. I can match up the collections configured here with URL patterns specified under "Crawls URLs". Looks like Google dropped the concept of sub collections for a concept of multiple collections.

AttachmentSize
gm.gif 42.76 KB
 
 

Drupal is a registered trademark of Dries Buytaert.