Posted by danithaca on July 6, 2009 at 2:03pm
| Project: | Drupal.org infrastructure |
| Component: | Drupal.org module |
| Category: | feature request |
| Priority: | normal |
| Assigned: | danithaca |
| Status: | needs review |
Issue Summary
Now pivots use its own indexer to detect module mentions in forum discussions. We'll migrate it to use ApacheSolr.
Two possible ways to do it:
1) Query the Solr server directly from the pivots program running on scratch.d.o. This will not affect d.o. performance. But it needs to read the params to connect to Solr and might introduce security concerns.
2) Query the Solr server using the ApacheSolr module API on d.o. This might add slightly more workload to d.o.
Comments? Thanks.
Comments
#1
Narayan and Damien know the setup-up better, but I think the security is just IP/firewall based. The main concern with adding it to scratch or some other server that might get new development code is that it would be better to prohibit updates. If the firewall can do path matching, that should be pretty easy by allowing only /solr/select
#2
Does solution 2 allow us to completely skip the pivot program when fetching results to be displayed on a project page? If so, let's do that!
#3
I wrote the code. On a project page, the code will send a query to ApacheSolr and display forum posts that mentioned the project. It looks exactly like the pivots "Related discussions" block, but uses ApacheSolr index.
#4
This is great news. Perhaps we can incorporate both Pivots and additional Solr techniques for improving recommendations.
#5
Indeed - we want to retain as great specificity as possible. We are indexing in Solr lots of data that the normal search may not be using - in particular the tags_a field in the schema might be useful to look for posts were a poster has actually pasted a link to the project.
#6
Thanks for the suggestion pwolanin. The thing is we are not only looking for links to modules, but also a couple of aliases to the modules (e.g. "ga module" for "google analytics" module). We thought those aliases would be helpful too to detect more module mentions in forum posts.
Attached is the latest code. I also pasted it to http://danithaca.pastebin.com/m3bad4da5 for easy viewing.
@pwolanin @DamZ: when you have time, would you please review the code? Then I'll added it to CVS and replace the "Related discussions" block on d.o. Thanks!
#7
Subscribe. This is awesome. I would much rather not run this on scratch, but on d.o proper.
#8
code working on http://staging8.drupal.org/project/cck.
solr index doesn't seem to work well on staging8. but the basic code should work.
after code review, I guess it can be deployed to d.o. to replace the current "related discussion" block.
#9
The index seems to be working at: http://staging8.drupal.org/admin/settings/apachesolr/index "Your site has contacted the Apache Solr server." "The search index is generated by running cron. 100% of the site content has been sent to the server. There are 0 items left to send. " Indicated Solr is configured correctly.
When I look at: http://drupal.org/project/cck there is a recommendations block with both related discussions and related projects. Let's get related projects and related documentation working on staging8.drupal.org before pushing to drupal.org.
#10
We are now waiting for the staging8.drupal.org settings.php file to be updated with database information to access the pivots database on master-other.drupal.org.
We should also make some estimates on the resources pivots using the Solr indexer will consume. In total switching from two indexers to one indexer should be a resource reduction. The pivots indexer runs on scratchvm and Solr runs on two servers. But it would be nice to know how many queries are currently being made against the existing pivots index.
Also, will the batch updating of the pivots database from the Solr index still require a JVM on stagingvm?
How much index resources does the updating of the database require today? I know that the pivots block currently implements caching so it should be manageable. We just want to avoid any performance or scalability surprises.
#11
@Amazon: The "related discussions" block runs entirely on the Solr index. It doesn't require JVM on stagingvm/scratchvm at all. It also has the "block cache" enabled so that it won't issue too many Solr queries.
Also, our research team proposed to separate the "Related discussions" block from the "Related projects" block. The "Related discussions" block would be a standalone module running on Solr index. The code is ready, and it's waiting for code review and test on scratch.d.o., and then it could be deployed on d.o.
The "Related projects" block would be implemented using the new algorithm described in #479812: Web 2.0 enhancement to the "Related Modules" block
#12
I paste the code here to facilitate code review.
The code is working on staging8, e.g., http://staging8.drupal.org/project/cck. It shows 2 blocks "Related discussions" and "Related documentation".
<?php
// $Id$
/**
* This module was developed with support from the National Science
* Foundatio under award IIS-0812042. Any opinions, findings, and conclusions
* or recommendations expressed or embodied in this software are those of the
* author(s) and do not necessarily reflect the views of the National Science
* Foundation.
*/
/**
* Implementation of hook_block()
*/
function pivots_discussion_block($op = 'list', $delta = 0, $edit = array()) {
switch ($op) {
case 'list':
$blocks[0]['info'] = t('Pivots: related discussions');
$blocks[0]['cache'] = BLOCK_CACHE_PER_PAGE; // cache per page regardless of users
$blocks[1]['info'] = t('Pivots: related documentation');
$blocks[1]['cache'] = BLOCK_CACHE_PER_PAGE; // cache per page regardless of users
return $blocks;
case 'view':
if ($delta == 0) {
$block['subject'] = t('Related discussions');
}
elseif ($delta == 1) {
$block['subject'] = t('Related documentation');
}
$block['content'] = pivots_discussion_output($delta);
return $block;
}
}
function pivots_discussion_output($delta = 0) {
// skip nid=3060, the drupal project
if (($node = project_get_project_from_menu()) && ($node->nid != 3060)) {
// fixed heuristics for a module/theme match
$search_terms = array(
"project/{$node->project['uri']}",
"node/{$node->nid}",
"{$node->title} module",
"module {$node->title}",
"{$node->title} theme",
"theme {$node->title}",
);
// module aliases manually added by researchers, eg "AdvForums" for the Advanced Forums module
// we use the module aliases as Solr search query terms as well.
db_set_active('pivots'); // NOTE: here we activate the pivots database.
$aliases = @db_query("SELECT DISTINCT alias FROM {pivots_alias} WHERE nid=%d", $node->nid);
db_set_active();
// add aliases to the search terms
while ($alias = @db_fetch_array($aliases)) {
$search_terms[] = $alias['alias'];
}
// compile search terms into query string
$keys = implode(' ', array_map('_pivots_discussion_wrap_term', $search_terms));
$params = array(
'fl' => 'id,nid,title',
'start' => 0,
'rows' => variable_get('pivots_discussion_max_display', 8),
'mm' => 1, // search terms would be 'OR', match one is fine.
);
if ($delta == 0) {
$params['fq'] = 'type:forum'; // limit items to forum posts only
$params['sort'] = 'last_comment_or_change desc'; // ordered by last_comment_or_change
}
elseif ($delta == 1) {
$params['fq'] = 'type:book'; // limit to handbook documentation only
// sort by relevancy
$params['sort'] = 'score desc'; // ordered by last_comment_or_change
}
// search the module/theme mentions in forum posts or documentation. errors with solr would be ignored.
try {
// search with solr
$solr = apachesolr_get_solr();
$response = $solr->search($keys, $params['start'], $params['rows'], $params);
// get responses from solr
$total = $response->response->numFound;
if ($total > 0) {
$docs = $response->response->docs;
}
}
catch (Exception $e) {
// anything go wrong with solr, just return nothing.
return;
}
// output results list.
$items = array();
if (!empty($docs)) {
foreach ($docs as $doc) {
$items[] = l($doc->title, "node/{$doc->nid}");
}
return theme('item_list', $items);
}
}
}
// wrap search terms in double quotes for phrase search
function _pivots_discussion_wrap_term($term) {
return '"'. check_plain($term) .'"';
}
?>
#13
#14
SVN access:
https://svn.drupal.org/drupal/redesign_modules_sandbox/staging8/custom/p...