Add config option to restrict by subcollection
| Project: | Google Search Appliance |
| Version: | 6.x-2.x-dev |
| Component: | Code |
| Category: | feature request |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | active |
Not sure how this works in other GoogleMini configurations (using multiple collections) and if this would assist with future changes to your module...
In my case I have one licenced collection and use the subcollections feature to restrict/filter search results for various websites so as not to show all the search results from unapproved/protected websites outside of our corporate intranet. This is also using one of the early GoogleMini models (MID-00235).
Modifying your 6.x-2.x-dev module (which works very well):
- introduced two site variables
google_appliance_default_subcollection
google_appliance_restrict_by_subcollection_enabled
- this update is geared/tested towards situations where we only have one collection in Google Mini and we retrict access by using the subcollection option
- enabling the 'restrict by subcollection' option adds the parameter
&restrict=subcollectionname
which filters the results to the subcollection named in the 'default_subcollection'
- if 'default_subcollection' is empty, the results are unaffected
Note:
- it might be helpful to add to the module notes that CURL and SimpleXML are dependencies
| Attachment | Size |
|---|---|
| subcollection-modifications-to-6x-2xdev.txt | 7.3 KB |

#1
I can't find a subcollection feature in my Google Mini M2 model. Could it be that this feature got replaced/renamed or maybe moved into GSA? If we need to address different features for different Google Mini (MID and M2) and potentially GSA versions I suggest to add a something like a select box into the admin area to specify the Mini/GSA version so that features can be enabled and disabled specific to that version.
It would be great if you can supply the changes as a patch so that we can apply and test it.
The branch tag will most likely stay at 6.x-2.x-dev unless your patch brings about really major changes that will justify a version number increment.
I'd rather leave
$payload = simplexml_load_string($resultXML);in GoogleMini.php as it is because it's better for debugging. Maybe we should only leave it for debug levels 1 and 2 but change it to
@$payload = simplexml_load_string($resultXML);when debug is disabled?
The curl and SimpleXML dependencies should also be added to the README.txt. This may be part of your patch as well.
#2
sure sounds good - I'll see if I can get that (patch file) in the next week or so.
I'm interested in what you mentioned about your Google Mini and not having the subcollections. (which will be a real drag if we have to replace the old server). I'm fairly sure it is visible from all google user account roles.
For interest, this is where I find it on my system. When I log into the Google Mini. From the 'Main' window, under 'Manage Existing Collections'
Select a collection and click 'View/Edit'
From this view I have four main tabs:
'Configure Crawl', 'Configure Serving', 'System Status', and 'Reporting'
Under the first tab 'Configure Crawl' I then see five links:
'URLs to Crawl', 'Crawler Parameters', 'Crawler Access', 'Subcollections', and 'Schedule'
Under the 'Subcollections' link it allows for searching within a subset of the collection which you can configure with various URL pattern matches.
#3
Completely different here on my Google Mini M2, see attached screen grab. I can match up the collections configured here with URL patterns specified under "Crawls URLs". Looks like Google dropped the concept of sub collections for a concept of multiple collections.