Nutch 1.6 Solr 3.6.2 Search API Solr Integrations [#2083017]

this is a stub, I am currently running a few sites using Apache Solr module ecosystem based on Solr core's generated from Nutch web crawls, and I want to explore Search API Solr integration

the Nutch module I currently use is
http://drupalcode.org/project/apachesolr_examples.git/tree/HEAD:/apaches...

based on an earlier sandbox
https://drupal.org/sandbox/cilefen/1858412

I am bouncing around with Apache Solr ecosystem and not getting much traction and although I guess the Nutch schema doesn't work oob with Search API I'd like to spend energy and time getting Search API working rather than digging deeper into ApacheSolr module

for one thing, Nutch has now a pluggable search backend, i.e. Elastic Search or Solr currently implemented

also, the ApacheSolr module seems to be a maze of implementations and legacy

my own research on this is at http://www.bigdatadrupal.com/ and http://niccolox.org/blog/big-data-drupal

will post more as I do further research in coming weeks

Comments

Comment #1

niccolox commented 5 September 2013 at 23:40

this is the juice, btw, solrindex-mapping.xml

http://drupalcode.org/project/apachesolr_examples.git/blob/HEAD:/apaches...

  17         <fields>
  18                 <field dest="content" source="content"/>
  19                 <field dest="label" source="title"/>
  20                 <field dest="site" source="host"/>
  21     <copyField source="host" dest="bundle"/>
  22                 <field dest="ss_nutch_segment" source="segment"/>
  23                 <field dest="boost" source="boost"/>
  24                 <field dest="id" source="digest"/>
  25                 <copyField source="digest" dest="ss_nutch_bundle"/>
  26                 <field dest="timestamp" source="tstamp"/>
  27                 <field dest="url" source="url"/>
  28                 <field dest="entity_type" source="entity_type"/>
  29                 <field dest="ts_metatag_description" source="metatag.description"/>
  30                 <field dest="sm_vid_metatag.keywords" source="metatag.keywords"/>
  31         </fields>
  32         <uniqueKey>id</uniqueKey>

Comment #2

drunken monkey

he/him

German

Vienna, Austria

commented 6 September 2013 at 07:43

Have you tried Sarnia? That's an extension of this module which can work with arbitrary Solr indexes.

Otherwise, if the schema is fixed, you could implement a datasource for adding a type with that schema to the Search API. For natively working with the Solr Search backend, though, you'd have to add some more <copyField> directives to make all the data available in the fields in which the Solr backend expects it. Or you could implement hook_search_api_solr_field_mapping_alter(), which might be easier, and would certainly save memory.

Comment #3

niccolox commented 6 September 2013 at 22:11

personally I haven't tried Sarnia, but the original sandbox developer had this feedback
https://drupal.org/node/1910876#comment-7358620

thanks for your suggestions, will see what I can do

Comment #4

niccolox commented 6 September 2013 at 22:19

btw

this is the original sandboxed schema mapping

  33         <fields>
  34                 <field dest="content" source="content"/>
  35                 <field dest="title" source="title"/>
  36                 <field dest="host" source="host"/>
  37                 <field dest="segment" source="segment"/>
  38                 <field dest="boost" source="boost"/>
  39                 <field dest="digest" source="digest"/>
  40                 <field dest="tstamp" source="tstamp"/>
  41                 <field dest="id" source="url"/>
  42                 <copyField source="url" dest="url"/>
  43                 <field dest="entity_type" source="subcollection"/>
  44                 <copyField source="title" dest="label"/>
  45         </fields>
  46         <uniqueKey>id</uniqueKey>

http://drupalcode.org/sandbox/cilefen/1858412.git/blob/dc446f05218147185...

Comment #5

niccolox commented 7 September 2013 at 23:34

just did an initial test with straight Search API and didn't get any results, no error messages either

just a note, the Nutch specific schema is actually the Drupal Solr schema so I was hoping it would "just work"

am I too optimistic, is the common schema project good for this?

Comment #6

niccolox commented 7 September 2013 at 23:47

as a follow-up question

should would it be appropriate to create a sandbox project, something like Search API Nutch Solr ?

I am having a little trouble placing this as a Sarnia, Search API or other type project

Comment #7

niccolox commented 8 September 2013 at 00:02

another thought.

if I can get some traction on the Search API Nutch Solr path I am thinking it would be best to be able to use the full Search API ecosystem and that means going as far upstream as possible and making changes there

i.e. go back to the Solr 3.62 core and use a Search API Drupal specific Solr 3.6 conf set, adjusted for Nutch, using the Nutch Solr examples from cliefen and pwolan as starting points

then do Nutch crawls, and send those to a Search API Drupal Solr core which can be read as a native Solr core from Search API

personally, I have ZERO resources other than my spare time, which is limited, so I am trying for the biggest bang for my null bucks

I have only small indexes, approx 200 000 documents, which I could do again, this time on Hadoop

I really want the flexibility of Search API Views and as much of the Search API Solr ecosystem as possible

examples of working prototypes with Nutch Solr ApacheSolr suite at http://permaculture.coop

Comment #8

niccolox commented 8 September 2013 at 03:23

ok, I have created a sandbox project Search API Nutch Solr at https://drupal.org/sandbox/niccolo/2084191

not sure exactly what shape it will take, options open at this stage

might start with a configuration type project like the early Nutch Solr sandbox, i.e. simply config and readme, will see if that's all I need to do if I start with a Search API Solr conf core and send Nutch crawls to that index.
http://drupalcode.org/sandbox/cilefen/1858412.git/commit/dc446f052181471...

fingers crossed

will shift my thinking out loud to that sandbox projects issue list and update here when I have something substantive

cheers

-N

Comment #9

drunken monkey

he/him

German

Vienna, Austria

commented 8 September 2013 at 19:37

just did an initial test with straight Search API and didn't get any results, no error messages either

How exactly did you try it? With Sarnia? Or did you create a normal Search API Solr server – and what kind of index?
Sarnia should work in principle, so this would be my first try. If it doesn't work, just try to debug it, I don't see why it shouldn't.
For using a "real" index (which would of course offer more features than Sarnia), as said, you'd have to write your own datasource controller defining an item type for Nutch-indexed Solr documents and use hook_search_api_solr_field_mapping_alter() to map the Search API fields to the correct fields in the Solr index.

But yes, please report back here if you find anything interesting!

Comment #10

niccolox commented 8 September 2013 at 23:10

very quickly tried both Sarnia and Search API Solr both connecting to the same Solr 3.6.2 core which was set-up for ApacheSolr module integration... as far as I know, the Nutch config maps directly to the ApacheSolr Drupal schema with no alterations

Sarnia gave this Dismax error

SearchApiException while executing a search: An error occurred while trying to search with Solr: "400" Status: unknown handler: dismax: unknown handler: dismax<html> <head> <meta http-equiv="Content-Type" content="text/html; charset=ISO-8859-1"/> <title>Error 400 unknown handler: dismax</title> </head> <body><h2>HTTP ERROR 400</h2> Problem accessing /solr/permaculturebigcrawl30aug2013/select. Reason: <pre> unknown handler: dismax</pre><hr />Powered by Jetty:// </body> </html> . in SearchApiSolrService->search() (line 948 of /home/drupalpro/websites/pcoop4.dev/sites/all/modules/search_api_solr/includes/service.inc).

Comment #11

niccolox commented 8 September 2013 at 23:16

so, my next steps are to try again, more carefully, with the same core, for Search API Solr and Sarnia

I am assuming, maybe incorrectly, that its better to use Search API Solr than Sarnia, for access to full range of Search API modules. Also assuming that the config/schema differences between Search API/ApacheSolr are small and so any changes to mapping from Nutch to ApacheSolr to Nutch to Search API Solr will be small also

then, I'll try to do a new Solr core, using the Search API schema and try to map Nutch to that, using mapping above, probably https://drupal.org/node/2083017#comment-7835711

I could siimply be doing silly mistakes here, need to work on this when I have more time to focus, grabbing moments between other offline duties

Comment #12

niccolox commented 8 September 2013 at 23:24

your own datasource controller defining an item type for Nutch-indexed Solr documents and use hook_search_api_solr_field_mapping_alter() to map the Search API fields to the correct fields in the Solr index.

I am thinking I might not need to do this

I need to check the terminology and reference more closely, but its my understanding that the Nutch Solr config uses the Drupal ApacheSolr module schema without modification

so, Nutch indexes sent to Solr cores are actually not new datasources

I was assuming the common schema between Search API / Apache Solr was close enough to allow a pretty direct port

I'll try to set-up semi-public working environments so I am not just talking

cheers

Comment #13

niccolox commented 8 September 2013 at 23:42

the Sarnia index

READ ONLY mode

somethingbigcrawl30aug2013 nutch

Status
enabled (disable)
Machine name
somethingbigcrawl30aug2013_nutch
Item type
Node
Server
server.server.coop
Read only
This index is read-only.
Configuration status
Custom

Indexing these terms

The main body text » Text	body:value	
The main body text » Summary	body:summary	
The main body text » Text format	body:format

added a Node type Search Index which gave that dismax error

Comment #14

niccolox commented 8 September 2013 at 23:59

tried again, created a Sarnia based Search page and got 400 dismax and am looking into https://drupal.org/node/1409198

Comment #15

drunken monkey

he/him

German

Vienna, Austria

commented 9 September 2013 at 10:03

Sarnia gave this Dismax error

It seems the Sarnia module is out of date there. The new configs ("new" – soon a year old, already) don't have a handler named "dismax" anymore, it's called "pinkPony" now (for perfectly valid reasons) and is already the default, so passing this shouldn't be necessary. If they require "dismax" specifically, and not "edismax", passing it as the defType would be the way to go.
There's actually already an issue for that – #1409198: '400' Status: Bad Request within View. However, it seems Sarnia is virtually unmaintained at the moment, so if you wanted to work with it, taking over maintainership (at least temporarily, until it's compatible with current versions of the Search API again) would probably be necessary.
Sorry, I hadn't noticed it was in such a bad state, or I wouldn't have suggested you use it. However, it might still be easier to fix Sarnia and use it (there are patches for most issues already, it seems) than to go another route.

I am thinking I might not need to do this

I need to check the terminology and reference more closely, but its my understanding that the Nutch Solr config uses the Drupal ApacheSolr module schema without modification

so, Nutch indexes sent to Solr cores are actually not new datasources

I was assuming the common schema between Search API / Apache Solr was close enough to allow a pretty direct port

Yes, you should probably get a bit more familiar with the Search API architecture and terminology – this handbook page might help (I really hope). While the compatible schema (and, even more important, solrconfig.xml) of course helps with the integration with Solr, the Search API itself doesn't know about any of this. To use the Search API, you'll have to create an index, which defines the fields that can be used for searching, filtering, sorting, etc. And since you want to use the fields and data from Nutch, not from nodes or other entities on the site itself, you'll have to define a new item type which defines appropriate fields for this. Solr then translates these fields into Solr fields with its own mapping – and since that will probably differ from the mapping used by Nutch (because nearly all fields use dynamic field prefixes), you'll have to use hook_search_api_solr_field_mapping_alter() to change it. Only with both of these, you'll have proper communication between Search API and a Solr server with external data (in this case, indexed by Nutch).

Sarnia solves this by defining a new entity type (instead of a Search API-specific item type, which would be the better solution here) which has all the fields that it sees stored in the Solr index it is pointed to.
Since it can't have the metadata of what these Solr fields actually mean, though, the mapping is necessarily only very basic. Customizing this to the data (structure) indexed by Nutch would improve the mapping and offer better-suited functionality for the individual fields.

Comment #16

niccolox commented 9 September 2013 at 23:59

Just had a thought to also try Sarnia with Cloudera Solr

Thanks for feedback will be onto this later in week

Comment #17

jason.fisher commented 19 September 2013 at 05:12

I am using Nutch 2.x/Solr/Drupal, but I have written a custom iterator for HBase to inject sites from Drupal and custom recrawl batches with custom Solr calls to create the index as I want. I then use Sarnia to create a View that browses this index. At this point, I basically just use Nutch for the crawling threads and HTML capture, then pre-filter results based on predefined keywords (taxonomy terms), compare with previously crawled deltas, and inject that into Solr.

It needs quite a bit of work to be community-friendly in any way, but the workflow wasn't difficult once you see what the fields are used for. Using Nutch with a MySQL backend would make this even easier. Set your own batchID using an UPDATE statement, call Nutch against it a few times, then index from it into Solr. This also allowed me to avoid the Solr index being completely rewritten, so I can use ajax callbacks from the View to update fields in the Solr index.

This workflow along with the new Gora-backend for Nutch 2.x let me ignore most of the features that were lost from 1.6.

Regards,
Jason Fisher

Comment #18

niccolox commented 19 September 2013 at 17:04

ok, that's fascinating, will explore in that direction once I get my new server cluster online and get the basic hadoop nutch1.6 > solr set-up working again

thanks for the inspiration Jason

Comment #19

OanaIlea commented 8 August 2019 at 12:07

Issue summary:	View changes
Status:	Active	» Closed (outdated)

This issue was closed due to lack of activity over a long period of time. If the issue is still acute for you, feel free to reopen it and describe the current state.

Nutch 1.6 Solr 3.6.2 Search API Solr Integrations

Comments

Comment #1

Comment #2

Comment #3

Comment #4

Comment #5

Comment #6

Comment #7

Comment #8

Comment #9

Comment #10

Comment #11

Comment #12

Comment #13

Comment #14

Comment #15

Comment #16

Comment #17

Comment #18

Comment #19

News items

Our community

Documentation

Drupal code base

Governance of community