this is a stub, I am currently running a few sites using Apache Solr module ecosystem based on Solr core's generated from Nutch web crawls, and I want to explore Search API Solr integration
the Nutch module I currently use is
http://drupalcode.org/project/apachesolr_examples.git/tree/HEAD:/apaches...
based on an earlier sandbox
https://drupal.org/sandbox/cilefen/1858412
I am bouncing around with Apache Solr ecosystem and not getting much traction and although I guess the Nutch schema doesn't work oob with Search API I'd like to spend energy and time getting Search API working rather than digging deeper into ApacheSolr module
for one thing, Nutch has now a pluggable search backend, i.e. Elastic Search or Solr currently implemented
also, the ApacheSolr module seems to be a maze of implementations and legacy
my own research on this is at http://www.bigdatadrupal.com/ and http://niccolox.org/blog/big-data-drupal
will post more as I do further research in coming weeks
Comments
Comment #1
niccolox commentedthis is the juice, btw, solrindex-mapping.xml
http://drupalcode.org/project/apachesolr_examples.git/blob/HEAD:/apaches...
Comment #2
drunken monkeyHave you tried Sarnia? That's an extension of this module which can work with arbitrary Solr indexes.
Otherwise, if the schema is fixed, you could implement a datasource for adding a type with that schema to the Search API. For natively working with the Solr Search backend, though, you'd have to add some more
<copyField>directives to make all the data available in the fields in which the Solr backend expects it. Or you could implementhook_search_api_solr_field_mapping_alter(), which might be easier, and would certainly save memory.Comment #3
niccolox commentedpersonally I haven't tried Sarnia, but the original sandbox developer had this feedback
https://drupal.org/node/1910876#comment-7358620
thanks for your suggestions, will see what I can do
Comment #4
niccolox commentedbtw
this is the original sandboxed schema mapping
http://drupalcode.org/sandbox/cilefen/1858412.git/blob/dc446f05218147185...
Comment #5
niccolox commentedjust did an initial test with straight Search API and didn't get any results, no error messages either
just a note, the Nutch specific schema is actually the Drupal Solr schema so I was hoping it would "just work"
am I too optimistic, is the common schema project good for this?
Comment #6
niccolox commentedas a follow-up question
should would it be appropriate to create a sandbox project, something like Search API Nutch Solr ?
I am having a little trouble placing this as a Sarnia, Search API or other type project
Comment #7
niccolox commentedanother thought.
if I can get some traction on the Search API Nutch Solr path I am thinking it would be best to be able to use the full Search API ecosystem and that means going as far upstream as possible and making changes there
i.e. go back to the Solr 3.62 core and use a Search API Drupal specific Solr 3.6 conf set, adjusted for Nutch, using the Nutch Solr examples from cliefen and pwolan as starting points
then do Nutch crawls, and send those to a Search API Drupal Solr core which can be read as a native Solr core from Search API
personally, I have ZERO resources other than my spare time, which is limited, so I am trying for the biggest bang for my null bucks
I have only small indexes, approx 200 000 documents, which I could do again, this time on Hadoop
I really want the flexibility of Search API Views and as much of the Search API Solr ecosystem as possible
examples of working prototypes with Nutch Solr ApacheSolr suite at http://permaculture.coop
Comment #8
niccolox commentedok, I have created a sandbox project Search API Nutch Solr at https://drupal.org/sandbox/niccolo/2084191
not sure exactly what shape it will take, options open at this stage
might start with a configuration type project like the early Nutch Solr sandbox, i.e. simply config and readme, will see if that's all I need to do if I start with a Search API Solr conf core and send Nutch crawls to that index.
http://drupalcode.org/sandbox/cilefen/1858412.git/commit/dc446f052181471...
fingers crossed
will shift my thinking out loud to that sandbox projects issue list and update here when I have something substantive
cheers
-N
Comment #9
drunken monkeyHow exactly did you try it? With Sarnia? Or did you create a normal Search API Solr server – and what kind of index?
Sarnia should work in principle, so this would be my first try. If it doesn't work, just try to debug it, I don't see why it shouldn't.
For using a "real" index (which would of course offer more features than Sarnia), as said, you'd have to write your own datasource controller defining an item type for Nutch-indexed Solr documents and use
hook_search_api_solr_field_mapping_alter()to map the Search API fields to the correct fields in the Solr index.But yes, please report back here if you find anything interesting!
Comment #10
niccolox commentedvery quickly tried both Sarnia and Search API Solr both connecting to the same Solr 3.6.2 core which was set-up for ApacheSolr module integration... as far as I know, the Nutch config maps directly to the ApacheSolr Drupal schema with no alterations
Sarnia gave this Dismax error
SearchApiException while executing a search: An error occurred while trying to search with Solr: "400" Status: unknown handler: dismax: unknown handler: dismax<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/> <title>Error 400 unknown handler: dismax</title> </head> <body><h2>HTTP ERROR 400</h2> <p>Problem accessing /solr/permaculturebigcrawl30aug2013/select. Reason: <pre> unknown handler: dismax</pre></p><hr /><i><small>Powered by Jetty://</small></i><br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> <br/> </body> </html> . in SearchApiSolrService->search() (line 948 of /home/drupalpro/websites/pcoop4.dev/sites/all/modules/search_api_solr/includes/service.inc).Comment #11
niccolox commentedso, my next steps are to try again, more carefully, with the same core, for Search API Solr and Sarnia
I am assuming, maybe incorrectly, that its better to use Search API Solr than Sarnia, for access to full range of Search API modules. Also assuming that the config/schema differences between Search API/ApacheSolr are small and so any changes to mapping from Nutch to ApacheSolr to Nutch to Search API Solr will be small also
then, I'll try to do a new Solr core, using the Search API schema and try to map Nutch to that, using mapping above, probably https://drupal.org/node/2083017#comment-7835711
I could siimply be doing silly mistakes here, need to work on this when I have more time to focus, grabbing moments between other offline duties
Comment #12
niccolox commentedI am thinking I might not need to do this
I need to check the terminology and reference more closely, but its my understanding that the Nutch Solr config uses the Drupal ApacheSolr module schema without modification
so, Nutch indexes sent to Solr cores are actually not new datasources
I was assuming the common schema between Search API / Apache Solr was close enough to allow a pretty direct port
I'll try to set-up semi-public working environments so I am not just talking
cheers
Comment #13
niccolox commentedthe Sarnia index
READ ONLY mode
Indexing these terms
added a Node type Search Index which gave that dismax error
Comment #14
niccolox commentedtried again, created a Sarnia based Search page and got 400 dismax and am looking into https://drupal.org/node/1409198
Comment #15
drunken monkeyIt seems the Sarnia module is out of date there. The new configs ("new" – soon a year old, already) don't have a handler named "dismax" anymore, it's called "pinkPony" now (for perfectly valid reasons) and is already the default, so passing this shouldn't be necessary. If they require "dismax" specifically, and not "edismax", passing it as the
defTypewould be the way to go.There's actually already an issue for that – #1409198: '400' Status: Bad Request within View. However, it seems Sarnia is virtually unmaintained at the moment, so if you wanted to work with it, taking over maintainership (at least temporarily, until it's compatible with current versions of the Search API again) would probably be necessary.
Sorry, I hadn't noticed it was in such a bad state, or I wouldn't have suggested you use it. However, it might still be easier to fix Sarnia and use it (there are patches for most issues already, it seems) than to go another route.
Yes, you should probably get a bit more familiar with the Search API architecture and terminology – this handbook page might help (I really hope). While the compatible schema (and, even more important, solrconfig.xml) of course helps with the integration with Solr, the Search API itself doesn't know about any of this. To use the Search API, you'll have to create an index, which defines the fields that can be used for searching, filtering, sorting, etc. And since you want to use the fields and data from Nutch, not from nodes or other entities on the site itself, you'll have to define a new item type which defines appropriate fields for this. Solr then translates these fields into Solr fields with its own mapping – and since that will probably differ from the mapping used by Nutch (because nearly all fields use dynamic field prefixes), you'll have to use
hook_search_api_solr_field_mapping_alter()to change it. Only with both of these, you'll have proper communication between Search API and a Solr server with external data (in this case, indexed by Nutch).Sarnia solves this by defining a new entity type (instead of a Search API-specific item type, which would be the better solution here) which has all the fields that it sees stored in the Solr index it is pointed to.
Since it can't have the metadata of what these Solr fields actually mean, though, the mapping is necessarily only very basic. Customizing this to the data (structure) indexed by Nutch would improve the mapping and offer better-suited functionality for the individual fields.
Comment #16
niccolox commentedJust had a thought to also try Sarnia with Cloudera Solr
Thanks for feedback will be onto this later in week
Comment #17
jason.fisher commentedI am using Nutch 2.x/Solr/Drupal, but I have written a custom iterator for HBase to inject sites from Drupal and custom recrawl batches with custom Solr calls to create the index as I want. I then use Sarnia to create a View that browses this index. At this point, I basically just use Nutch for the crawling threads and HTML capture, then pre-filter results based on predefined keywords (taxonomy terms), compare with previously crawled deltas, and inject that into Solr.
It needs quite a bit of work to be community-friendly in any way, but the workflow wasn't difficult once you see what the fields are used for. Using Nutch with a MySQL backend would make this even easier. Set your own batchID using an UPDATE statement, call Nutch against it a few times, then index from it into Solr. This also allowed me to avoid the Solr index being completely rewritten, so I can use ajax callbacks from the View to update fields in the Solr index.
This workflow along with the new Gora-backend for Nutch 2.x let me ignore most of the features that were lost from 1.6.
Regards,
Jason Fisher
Comment #18
niccolox commentedok, that's fascinating, will explore in that direction once I get my new server cluster online and get the basic hadoop nutch1.6 > solr set-up working again
thanks for the inspiration Jason
Comment #19
OanaIlea commentedThis issue was closed due to lack of activity over a long period of time. If the issue is still acute for you, feel free to reopen it and describe the current state.