Posted by wuwei23 on August 28, 2009 at 3:45am
| Project: | Apache Solr Attachments |
| Version: | 6.x-1.x-dev |
| Component: | Code |
| Category: | feature request |
| Priority: | normal |
| Assigned: | Unassigned |
| Status: | needs review |
Issue Summary
Hey everyone,
Fantastic module that is working rather nicely in our dev environment.
For this project, however, the clients want the site to search within attachments but to return the nodes they're attached to rather than direct links to the matching attachments.
I'm happy to take a shot at coding this myself, but thought I should check here first to make sure I haven't missed anything. Any pointers on where to start would likewise be appreciated :)
Thanks.
Comments
#1
this is pretty easy - you jsut need to alter each result to make a link from the nid, rather than using the link to the file.
#2
Hey pwolanin,
Thanks for replying. This is the approach I've settled on for the moment and will do for the short term.
Unfortunately, the simple approach results in duplicate entries, as Solr can match on both the node & the attachment if both contain the search terms. And as we're attaching a lot of metadata to the node, we need to be able to search both. So I'm currently stripping dupes from the results just prior to rendering, but this has the downside of making the numeric results on matches, terms matched etc incorrect.
I can't focus on this right now - I've got to demo the base functionality early next week - but I'm wondering if dealing with this at the point of indexing might not be better.
#3
Other people are taking the option of indexing all the extracted attachment text directly in a single document with the node text, so there is never more than one result.
#4
Ah cheers, this was the approach I'd originally planned to take until I discovered this module :) As it stands, there may be a new requirement for some of the content to be stored off-site but still searchable.
pwolanin, when you say "other people", are you referring to any public modules? Or do you mean hand-rolled, stand-alone code?
#5
The main person I know trying this is EclipseGC. I'm not sure about others - you might also talk to janusman.
#6
Thanks, pwolanin, I'll chase them up. I appreciate the feedback.
Should I close this? Or should I add what I learn to it?
#7
Fell free to add - then we can at least improve the documentation, etc.
#8
Subscribing
#9
subscribe
#10
Hi guys -- checking to see if there's any progress here or if you can provide more pointers on how to do indexing differently or somesuch. The problem I have on the site I'm developing is the client has different content types (say Article, Report, Editorial for instance) where the "full" versions of the items are actually in attached PDFs. They would like users to be able to narrow by type while searching. Unfortunately, if you choose the Report content type filter, for example, while searching for a given keyword, the PDFs aren't included. Any help, ideas for things to look at, etc. would be greatly appreciated.
Thanks!
#11
In the interest of being clear, here's an example:
I have an "Article" item with an attached file in a field called Content PDF. The title of the Article includes the word "foo" as does the PDF file. However, only the PDF file contains the word "bar".
If I search for "foo", I'll see two results -- one links to the Article item, and the other to the PDF (because both contain foo, and presumably they are indexed separately). While this is technically a duplicate result in our case (and I think the sort of thing that started this thread), we can live with that for now if we must.
However, if at this point we look at the facet links on the results page to "Filter by type", the "Article" filter link only shows a count of 1. If indeed we click that link, we'll only see the Article item (the PDF result is no longer included). This problem gets worse when we search for "bar" and choose to filter by the "Article" type: we get 0 results because the term isn't part of the non-attached item.
I hope this makes sense and better illustrates the problem in our particular case.
#12
subscribe
#13
For what it's worth, I worked around the issue by writing a module that implements hook_nodeapi() and responds to the "update index" operation to allow the contents of my attached files to be indexed as part of the node they're attached to.
From the manual (http://api.drupal.org/api/function/hook_nodeapi):
"'update index': The node is being indexed. If you want additional
information to be indexed which is not already visible through
nodeapi "view", then you should return it here."
I borrowed code from apachesolr_attachments_add_documents so that, in theory, I could disable indexing the attachments separately and resolve the "duplicate" issue as well (though we've not chosen to implement that as yet).
I suppose if we were looking at solving the problem generically, we could decide how attachments would be indexed on a per item type or even per field basis. I don't know if there's a better way to handle it in order to give more control over the weight the attachments would carry in determining search results.
In any case -- I hope this helps, and would appreciate feedback/concerns with this approach.
<?php
/*
* Implementation of hook_nodeapi() that returns contents of file attachments
* borrowed from apachesolr_attachments_add_documents
*/
function my_attachments_nodeapi( &$node, $op, $a3 = NULL, $a4 = NULL ){
$rv = '';
if( $op == 'update index' ){
include_once(drupal_get_path('module', 'apachesolr') .'/apachesolr.index.inc');
// Since there is no notification for an attachment being unassociated with a
// node (but that action will trigger it to be indexed again), we check for
// fids that were added before but no longer present on this node.
$fids = array();
$result = db_query("SELECT fid FROM {apachesolr_attachments_files} WHERE nid = %d", $node->nid);
while ($row = db_fetch_array($result)) {
$fids[$row['fid']] = $row['fid'];
}
$files = _asa_get_indexable_files($node);
// Find deleted files.
$missing_fids = array_diff_key($fids, $files);
if ($missing_fids) {
db_query("UPDATE {apachesolr_attachments_files} SET removed = 1 WHERE fid IN (". db_placeholders($missing_fids) .")", $missing_fids);
}
$new_files = array_diff_key($files, $fids);
// Add new files.
foreach ($new_files as $file) {
db_query("INSERT INTO {apachesolr_attachments_files} (fid, nid, removed, sha1) VALUES (%d, %d, 0, '')", $file->fid, $node->nid);
}
foreach ($files as $file) {
// error_log( "my_attachments_nodeapi extracting content from: " . $file->filepath );
$rv .= "\r\n " . _asa_get_attachment_text($file);
}
// $breakpoint = strlen($rv);
// if( $breakpoint > 100 ){
// $breakpoint = 100;
// }
// error_log( "my_attachments_nodeapi returning for node " . $node->nid . ": " . substr($rv, 0, $breakpoint) . "..." );
}
return $rv;
}
?>
#14
I'm also looking into a drupal-friendly way of having a single solr document to cover both the parent node and attached files.
Last night, I had to very quickly solve the problem outlined above where attachments disappear from search results when filtering by content type by using the modify query hook. I was already using it to provide OR functionality for type filters. I've left that in as it may also be useful to people. It's pretty rough and ready as I'm only using it so I can search using a querystring like this:
filters=tid:32&or_filters=type:news_article type:press_releasefunction modulename_apachesolr_modify_query(&$query, &$params, $caller) {
$or_filters = $_GET['or_filters'];
if (!empty($or_filters)) {
// this should be more generic rather than specifiying 'type' explicitly
$filters = Solr_Base_Query::filter_extract($or_filters, 'type');
$subquery = apachesolr_drupal_query();
foreach($filters as $filter) {
list($key, $val) = explode(':', $filter['#query']);
$subquery->add_filter($key, $val);
// this bit adds a new OR filter based on the mime facet which is used by solr documents created for attachments
if ($val == 'document') {
$subquery->add_filter('ss_filemime', '[* TO *]');
}
}
if (!empty($subquery)) {
$query->add_subquery($subquery, 'OR');
}
}
}
On the subject of consolidating attachments and parent nodes in the index, I can see how Brant's code above would work -- i'm considering something similar myself -- but what is a 'nice' way of suppressing the attachments module so it doesn't add the attachment to the index too but keeps the other bits?
It would be handy to have an admin setting (or maybe one per content type) to dictate whether attachments should be indexed separately or combined with the parent. Do you guys reckon this would be a good idea and worth me developing a patch for?
#15
We needed to add the attachments to each content type, not as separate entities, so here's an initial stab at adding a 'per content type indexing' setting.
This is against 6.x-1.0
#16
#17
opops - posted to wrong issue
#18
#19
Indexing will fail on cron because the apachesolr_clean_text() function is needed. Attached patch sorts this by including apachesolr.index.inc.
#20
I had the same requirement as the OP, but we 'remodeled' the returned attachment result to resemble a node result but with the attachment details beneath. We now have the duplicate result challenge, so really it makes sense in our use case to filter out the node results where an attachment results is present, or to index the attachment content with the node and return a result that contains a link and description of the attachment as it currently does (so that the process is shortcut from search to attachment download).
Thoughts?
#21
@Dave the Brave - patch #19 provides the means to index the attachments 'onto' the nodes. I'm not sure if it possible to ask solr to return the attachments to a node so that they could just be themed onto the search results, but that would be a way to give a direct link to the attachment, in the same way that you get a direct link to the parent node on attachment results today.
#22
I've been playing with this some and have cleaned it up some.
#23
Thanks for pushing this forward.
I'm not so fond of APACHESOLR_ATTACHMENTS_MODE_SEPARATE_ENTITY type constants
given that they are only used a couple times - maybe readable strings would make more sense
also, almost seems like people might want a per-node option. Not in the UI necessarily, but a hook so it would be possible?
also please fix up:
$rv .= "\r\n " . apachesolr_attachments_get_attachment_text($file);variable naming and why the \r?
#24
Ok, this should address the issues brought up ind #23.
Talked this through with pwolanin on in IRC there was some discussion of maybe storing it differently on the solr objects so there may be more work to come.
#25
oh and uninstall hook.
#26
#27
@swati_patel_8497 it looks like maybe you have confused modules. This issue is for apachesolr_attachments and it looks like you're using search_files . Also, you question sounds like a support request unrelated to this issue which is devloping a new feature.
#28
I applied the patch in #25 to version 6.x-1.0-beta2 and it didn't work - still displaying attachment file instead of node or both attachment and parent node.
I applied the patch manually on apachesolr_attachments.admin.inc due to conflicts but pretty sure applied it in right places. I deleted files from index and also deleted cached file text before reindexing.
Any idea?
#29
I'll see about providing an updated patch a see if it helps.
#30
I applied the patch in #25 and selected and set "Attachments as part of parent node" in admin/settings/apachesolr/attachments/content_type for a CCK content type and then I reindexed all the file attachments by clicking the button in admin/settings/apachesolr/attachments.
When I run cron manually I get a blank page (not even a redirect) and the warning message "Cron run exceeded the time limit and was aborted." appears in the logs. This happens very quickly, before the cron run should timeout.
I'm using tiki-0.3-standalone.jar to index files, although i've also tried tika-0.3.jar and tika-app-0.7.jar with the same results.
#31
@neclimdul it will be a great help! Thanks.
#32
Patch in #25 works great for me with 1.0-beta2.
@#28, After applying the patch, did you clear your caches, and configure the per-content type settings?
@#30, I don't think it's a problem with the patch, per se. You'll need to increase you server's PHP time limits, and/or reduce the number of items to index per cron run at admin/settings/apachesolr
#33
I found a solution to this with theming functions, and documented it here:
https://foss.stat.ubc.ca/ubc-dug/blog/mjoyce/customizing-apache-solr-sea...
If there's a better way to accomplish this please let me know.
#34
I am indexing nodes and attachments. Unfortunately I get duplicate results because (I believe) the hits are often for attachments AND nodes because they contain those matching attachments. Its 2 lines of code to cull the search results in the theme, but I am convinced this is the wrong place to do this since the filters still show the wrong counts. It needs to happen in a Solr-related module. I have installed the patch and it did not change anything... because I believe my issue is different from the one that is the basis for this thread. Is that correct? If so, should I start another issue?
#35
No, that does sound like the issue. My guess is you probably didn't rebuild your index. To clarify, the patch basically adds options on how solr indexes your attachments. One is to attach the documents to the node so you're searching both at the same time and only have one result. This sounds like what you want. You'll have to change this in the settings. After you do this change you'll probably want to drop and rebuild your index. You at /least/ want to rebuild but I don't remember what would happen to the dangling attachment items so a full drop of the index might be best.
#36
Is there a patch for the 6x.2x-dev branch to return the the node where the file is attached without displaying duplicate result for node and or attachement?
#37
#38
Reroll
@michael121 - no I haven't looked at doing that
@jpmckinney was there a reason you moved it down to needs review or just because?
#39
@neclimdul It takes more than one person to get from "needs work" to "reviewed & tested by the community" :) However, it looks like it had been incorrectly left as "needs work", when it should have been "needs review", so maybe it is "reviewed & tested by the community".
#40
Hi,
Is there a version of this patch for the 2011-May-26 6.x.1.0-beta3 release or what do you recommend ?
thanks!
#41
this isn't a bug with -beta3. it really needs to be rerolled against the latest -dev.
#42
Rerolled for 6.x-2.
#43
This feature is a must. I've ported it to Drupal 7, might need more testing, but it seems to be working.
#44
@franz when making a patch against an issue that's for a different branch, could you name the patch accordingly so its clear? http://drupal.org/patch/submit#patch_naming
#45
When I thought about it, it was too late...
#46
Had to fix some things. I tested using tika 0.10 pre-built, it is working fine so far.
#47
subscribe
#48
I created a solution for this some time ago, though I did not think I was solving an actual problem. I overrode theme_preprocess_search_results() to find when a result was not a bona fide node, and ignored it if it wasn't. Some may not think of this is a wholly correct solution, but it worked very well on quite a large (in the thousands) database of documents.
Just my 2 cents.
#49
It doesn't really work because as with anything when trimming at the theme layer you end up with odd page lengths and paging functionality.
#50
Yeah, I'm back here because d I spoke to soon. I just realized what you wrote above, went back to an old project and found the d6 patch I had applied. Thankfully, someone's already working on a d7 version :)
#51
nevermind...
I realized that the D7 patch gives the admin options on how to select what combination of node and/or attaches files are indexed.
If you apply this patch and its still not working, see: admin/config/search/apachesolr/attachments/content_type
#52
nevermind...
#53
Re rolled the patch for 6.x-1.x-dev if anyone is interested.
#54
Oops, missed a line that's sort of important. Round two.
#55
I don't see a 6.x-2.x-dev version of this module, so not sure why this thread is geared toward that. Maybe I'm missing something and someone can point it out for me.
I installed the patch (apachesolr-attachments-attach-to-node-56182-6x1x-01.patch) on the 6.x-1.x-dev version and everything appears to have run correctly but I still don't see an option to show the node that the attachment pertains to. Where can I find that setting or is there something else I need to do?
#56
As mentioned on the Apache Solr Search Integration module page, 2.x as been abandoned some time ago.
#57
Thanks, I guess I overlooked that.
Still, any idea why the patch might not be working for 6.x-1.x-dev version?
#58
That is that exact patch I'm using on my site right now, so not really. I applied it against the git branch 6.x-1.x, so if 6.x-1.x-dev is not the same as the current git repo then you may have issues. The other thing that I did (but shouldn't effect you) is make the changes discussed here: #1387240: File attached to multiple nodes causes failure.. I can provide that code if you'd like, but again, it shouldn't cause any errors for you unless you have one file attached to two or more nodes.
The only thing I'd suggest is to make sure that you clear your file AND node index. Now, in order to index a file, you must be indexing it's corresponding node, so you'll need to re-index all your nodes.