I've been working on setting up Solr and Nutch for the past couple days and I seem to finally have the nutch crawler data working with Drupal.

My problem is that I can't seem to get the admin UI working for the Nutch crawler. I have execute the crawl commands manually through the Linux terminal. Is there any special configuration setup I need to have? When I hit start crawl in debug mode it says everything completed successfully but no data was fetched.

I've set the proper paths to Nutch and Java.

I merged the schema.xml required for Nutch and the schema.xml required for Drupal. Setup some copyfields to store the data in the same way that Drupal's nodes are and changed the uniqueKey to be the URL because Nutch doesn't send and id. Is there anything else required?

Also, when you view the search results the url for the data retrieved from Nutch has NUTCH_VIRTUAL_NODE_PATH instead of...what's supposed to be there?

Thanks

CommentFileSizeAuthor
#64 seeds.patch810 bytesrobertdouglass

Comments

dstuart’s picture

Assigned: Unassigned » dstuart
Category: support » bug
Priority: Normal » Critical

Hi,

Yep looks like a bug I will try get to that soon. Re NUTCH_VIRTUAL_NODE_PATH that is supposed to be the url of the site you have crawled that a bug too so will try and fix these

Cheers,

Dave

karljohann’s picture

"I merged the schema.xml required for Nutch and the schema.xml required for Drupal. Setup some copyfields to store the data in the same way that Drupal's nodes are and changed the uniqueKey to be the URL because Nutch doesn't send and id. Is there anything else required?"

Are these things required to make this work? Is there any documentation on the way? Just very basic and/or raw instructions would do.

dnett123’s picture

I couldn't make it work properly without merging the two schema.xml files. I'm not sure if I did it properly but I was able to make it work to the point that nutch is working with solr and the search results are showing up in Drupal.

I had another question is it possible to have those nutch pages created as their own content type (external for example)?

karljohann’s picture

It seems to be working now that I've merged the schema.xml files. Thanks for that.

I guess I'm just gonna have to study Nutch a bit more to get an understanding of what it is that I'm doing though :)

dstuart’s picture

Hey I wrote a patch for Nutch awhile back that has been released into Nutch 1.1, you can use the conf/solrindex-mapping.xml in the Nutch file mine looks like. This means that you don't have to use the copyfield which means you can have a merged Nutch and Drupal index

<mapping>
        <!-- Simple mapping of fields created by Nutch IndexingFilters
             to fields defined (and expected) in Solr schema.xml.

             Any fields in NutchDocument that match a name defined
             in field/@source will be renamed to the corresponding
             field/@dest.
             Additionally, if a field name (before mapping) matches
             a copyField/@source then its values will be copied to 
             the corresponding copyField/@dest.

             uniqueKey has the same meaning as in Solr schema.xml
             and defaults to "id" if not defined.
         -->
        <fields>
                <field dest="site" source="site"/>
                <field dest="title" source="title"/>
                <field dest="host" source="host"/>
                <field dest="segment" source="segment"/>
                <field dest="boost" source="boost"/>
                <field dest="digest" source="digest"/>
                <field dest="tstamp" source="tstamp"/>
                <field dest="id" source="url"/>
                <field dest="body" source="content"/>
                <copyField source="url" dest="url"/>
        </fields>
        <uniqueKey>id</uniqueKey>
</mapping>
dstuart’s picture

"I had another question is it possible to have those nutch pages created as their own content type (external for example)?"

I've beening thinking about this one awhile now and think you could do this one of two ways.

- You could have a node which is literally a reference to a Solr Document and would hold little information other than it id.
- You could write a import from Solr to effectively reverse the indexing process to create nodes I had to do this one time when my database got corrupted and re imported all of the data for my solr instance

karljohann’s picture

One question, is there any way of knowing whether the module itself is working? When I press "Start crawl" I get just get

* Starting Nutch Crawl.
* 0

Which isn't very informative and I assume means that it didn't work. When I however use the code from the dry run/debug crawl it works fine. (Except for one thing, the segments folder from --- Beginning crawl at depth 2 of 2 --- is the same as 1 of 2, but if I correct that then it's fine.)

dstuart’s picture

You should see the Nutch process in your processlist but ill have a chance to look at this tomorrow and try and roll a bunch of bug fixes i've been doing.

Regards,

David

ataneja’s picture

HI!

Can anyone please tell me exactly what needs to be done for showing nutch results in drupal.

How do I need to merge the schema files, do I need to copy paste the above patch provided by David Stuart in to nutch schema.xml?
What else needs to be done?

I have one more doubt that is we have to change the schema of solr server twice.
Firstly, we need to copy the nutch schema to solr schema so that the index from nutch gets transfered to solr.
secondly, we need to copy the schema of solr module to schema of solr server so that solr module can connect to solr server.

Is there any other way out. Please Reply

Thanx

ataneja’s picture

Hey karljohann and dnett123..Can you exactly tell what you have done to get it work?

PLease REply. Thanx

dstuart’s picture

Hi ataneja,

Could you outline what you have done so far?

The rough steps are (and assumes you are running this on linux)

Get Nutch 1.1 from nutch.apache.org and unpack it to a location of your choice
In the nutch source under the config directory copy the schema above into the solrindex-mapping.xml this allows you to do mapping against the solr schema
e.g. the nutch content field data will be copied into the body field in solr
Obviously you will have to have solr installed and the Drupal schema.xml installed all of the fields are in the schema except for those listed below which you will have to add

  <field dest="host" source="host"/>
  <field dest="segment" source="segment"/>
  <field dest="boost" source="boost"/>
  <field dest="digest" source="digest"/>
  <field dest="tstamp" source="tstamp"/

If you can provide a little more information on whats going wrong then ill can expand this set by set and add it to the module install

Regards,

David

ataneja’s picture

First of all I must say that you are doing a wonderful job in developing the nutch module for drupal.

coming to my problem.

See I have installed nutch 1.1, crawled some websites.
Sent the index to solr server (which I have installed at a dedicated server).
Then I installed solr module in drupal which was able to communicate with the above solr server.

Now, the problem is that the index which was sent from the nutch is not showing in drupal search results via solr module.
And I believe the problem is the way nutch and solr indexes their data is different.
So, I dont know how to make them compatible. I guess I need to merge the schemas and add some copyfields. But, I dont know what exactly needs to be done. Please tell me as soon as possible.

Thanx in advance

dstuart’s picture

hi Abhishek,

it looks as though you dont have the right schema file in the conf dir of solr when you downloaded the apachesolr drupal module there should be a schema.xml you need to put that in the apache solr folder and restart the solr server you also need to add at just before the tag all of the fields mentioned here http://drupal.org/node/811062#comment-3240566

Regards,

Dave

dstuart’s picture

For completeness I have posted the final resolution

After following the steps above disabling solr node access option in Drupal is also required at the moment we can work around it but Apache Solr seems to like to specifically namespace things thats its indexed with node access on which is quite limiting in my option. Also a quick fix to the hook_apachesolr_process_results

the nutch.module

 /**
  * Implementation of hook_apachesolr_process_results().
  */
  function nutch_apachesolr_process_results(&$results){
  	
    foreach($results as $i => $result){
    	 if(isset($result['node']->teaser)){
    	 	 drupal_set_message(strip_tags($result['node']->teaser));
    	   $results[$i]['snippet'] = strip_tags($result['node']->teaser);
    	   $results[$i]['node']->teaser = strip_tags($result['node']->teaser);
    	 }
    	 if(isset($result['node']->digest)){
    	 	/*
    	 	 TODO
    	 	 Hard code at first but in the nutch settings I have to allow the ability to specify
    	 	 a view that is to handle the nutch virtual node status. That will allow you to grab the
    	 	 path settings out of the view and prepend to the link 
    	 	 */
  	     $results[$i]['link'] .= NUTCH_VIRTUAL_NODE_PATH . $result['node']->digest;
    	 }    	
    }
  }


with

 /**
  * Implementation of hook_apachesolr_process_results().
  */
  function nutch_apachesolr_process_results(&$results){
  	
    foreach($results as $i => $result){
    	 if(isset($result['node']->teaser)){
    	 	 drupal_set_message(strip_tags($result['node']->teaser));
    	   $results[$i]['snippet'] = strip_tags($result['node']->teaser);
    	   $results[$i]['node']->teaser = strip_tags($result['node']->teaser);
    	 }
    	 if(isset($result['node']->url)){
  	     $results[$i]['link'] = $result['node']->url;
    	 }    	
    }
  }

Hopefully that should sort the problem

Regards,

Dave

savannah_beckett’s picture

Category: bug » support

I am trying to merge the schema.xml that is the solr/nutch setup with the one from drupal apache solr module. I encounter a field that is not mergeable.
From drupal module:

From solr/nutch setup:
required="true"/>

I am not sure if there are any more stuff like this that is not mergeable.

Is there a easy way to deal with schema.xml?
Thanks.

savannah_beckett’s picture

I reread your comment 11. I already had nutch/solr setup working. Does your comment mean keep drupal module's solrconfig.xml and remove the one in my solr/nutch setup? And remove schema.xml in my solr/nutch setup and keep drupal modules' schema.xml and add corresponding to the following?





So after this, no need to merge the schema.xml further?

dstuart’s picture

Yes that is correct you can use Drupal's solrconfig.xml and schema.xml add in the fields described (or map them to other fields using the solrindex-mapping.xml) and away you go. The url field can be of type string unless you really need url validation (which I imagine Drupal would mess up)

savannah_beckett’s picture

Does this module support faceted search? or do I have to download another module called Apache Solr Facet Builder module? I want to use several custom fields that I defined in solr index as part of my faceted search. I tried to get Apache Solr Facet Builder module to work for a long time, and I played around with Views module, but so far no result. There is no instructions available for custom fields in solr index.

savannah_beckett’s picture

I am able to get the search result from index with this module, but the url of each search result points to the homepage of my drugal site. Why?

scotjam’s picture

Hi all

Can anyone suggest web links to tutorials that help with 1) installing nutch on windows and 2) get nutch and apache solr working together?

There's plenty instructions online but I'm not sure which one to follow. I don't know what steps are generic to nutch and what needs to done specifically for apache solr and drupal.

Which instructions have worked for you?

e.g. Found this one. Do I follow every step here? Or does a drupal setup of apache solr using the nutch module need different steps? http://wiki.apache.org/nutch/RunningNutchAndSolr

cheers
scotjam

dstuart’s picture

Hi scotjam,

This is a good article about it http://wiki.apache.org/nutch/GettingNutchRunningWithWindows

Note: If you're just interested in a basic installation on Windows and are not interested in knowing the details of how it is done, you might want check and see if theWhelanLabs SearchEngine Manager (http://www.whelanlabs.com/content/SearchEngineManager.htm) fits your needs. It is a free installer for Nutch on Windows.

Regards,

Dave

dstuart’s picture

Hi Savannah,

See this comment http://drupal.org/node/811062#comment-3251604

Hope it helps

Regards,

Dave

karljohann’s picture

By the way, I got #7 fixed. The nutch/crawl folder didn't have the right permissions.

nitinbh77’s picture

Hi,

I am trying to integrate Nutch 1.1 / solr 1.4 and drupal 6. I am able to fetch the nutch and drupal results and been able to view it from solr admin screen. However when I try to search the Nutch contents from Drupal solr search module it shows no results.
I am able to use solarindex-mapping to map the fields. I hope I did it correct as I can see the Nutch results in solr. I am not getting any idea why it is not showing me the results in drupal.

please help

Regards
Nitin

karljohann’s picture

Does it show no results or no results from Nutch? Did you change the solr/example/conf/schema.xml and solrconfig.xml for the Drupal Apachesolr module ones?

nitinbh77’s picture

It shows me the Nutch results when I query it from solr admin screen. I did copied the schema.xml and solrconfig.xml from Drupal Apachesolr module.

The problem is if I create some content in Drupal and index it I am able to search it via Drupal Apachesolr module. However If I crawled some website using Nutch I cannot query it from Drupal Apachesolr module. The Nutch results are present in solr though.

Thanks
Nitin

karljohann’s picture

I actually had the same problem but I honestly can't remember how I fixed it. Just try going through all the steps again, copy the schema.xml and solrconfig.xml files and change the solrindex-mapping file like so. I can't remember if you have to add these fields to the schema.xml but I have them there anyway.

If you are, however, using the Apache Solr Views integration then I'm still having that problem and really am of no use.

nitinbh77’s picture

I believe I am missing on some fields mapping into the solrindex-mapping file. I added some fields into mapping file but seems Drupal solr module is looking for some more while searching. I am not using the views integration but it would be great if you can paste your mapping file here for a quick reference. Thanks again for the prompt response.

Regards
Nitin

karljohann’s picture

Mine is identical to this one from dstuart

nitinbh77’s picture

Ok here are the steps I have done again

1. Copy the solrconfig.xml and schema.xml provided with Drupal apachesolr module into the Solr1.4/conf

2. Copy the solrconfig.xml and schema.xml provided with Drupal apachesolr module into the Nutch1.1/conf

3. Edit solrindex-mapping.xml in the Nutch1.1/conf and added like this
http://drupal.org/node/811062#comment-3154622

4. Edit Nutch1.1/schema.xml and addded
http://drupal.org/node/811062#comment-3240566

Post that I restarted solr; crawled the website; sent the data to solr.

I can still see the crawled data indexed in solr but if I search it from Drupal using apachesolr it is still not visible.

dstuart’s picture

Hi nitinbh77,

Have you got the apache solr access control module on? The current version of the nutch module doesn't support this feature turn it off and try a search. If that doesn't works

Regards
Dave

nitinbh77’s picture

Hi Dave,

you exactly got the nerve of the issue. I just disabled the access control module and I am able to view the nutch search results on the drupal now.

Thanks a ton for your help.

Regards
Nitin Bhardwaj

dstuart’s picture

Hi nitinbh77,

Have you got the apache solr access control module on? The current version of the nutch module doesn't support this feature turn it off and try a search. If that doesn't works

Regards
Dave

suneethark’s picture

Hi all,

I installed nutch module "6.x" version with nutch 1.0 and solr 1.4. I was able to run nutch on my windows machine successfully. But I am not able to crawl external sites using nutch module. When I press "Start crawl" I get just get

* Starting Nutch Crawl.
* 0

It didn't work. When I select dry run/debug crawl, it didn't work.
I outputed as follows:

* http://www.9isolutions.com
* Starting Nutch Crawl.
* C:/xampp/htdocs/dprsearch/sites/all/modules/contrib/nutch/runbot -n "C:/cygwin/home/nutch-1.0" -j "C:/Program Files/Java/jdk1.6.0_21" -s "http://localhost:8983/solr" -u "http://www.9isolutions.com" -c "1" -f "100" -d "1"

Further it is not doing anything. I guess "exec" command is not calling the external program. I am unable to trace where I am going wrong. Anyone Please suggest how to crawl external sites with this module.... Am I doing in a right way?

Regards,
Suneetha.

robertdouglass’s picture

ApacheSolr 6.x-2.x has an entity field. Does a syntax like the following work?
EDiT: meant "nutch"
<field dest="entity">nutch</field>

dstuart’s picture

No it currently doesn't it requires a source and dest. You could add a default value in your field definition in your solr config

From:
<field name="entity" type="string" indexed="true" stored="true"/>
To:
<field name="entity" type="string" indexed="true" stored="true" default="node"/>
robertdouglass’s picture

I'm trying to make it so that I can recognise the nutch documents as such. If I made entity default to "nutch", or something else that identifies them as a like group, this would superficially solve the problem, as the nutch documents are the only docs without an entity value. However it would also open the door to having other documents mislabelled as being from "nutch" if they somehow omit the entity field value.

I guess this is going to merit a patch to Nutch to allow for default values in the configuration xml.

dstuart’s picture

Hey Robert,

As I wrote the original mapping patch for nutch, I'll take a stab at the change, it should be quite minor but it may not be in a stable release for a while

Regards

Dave

robertdouglass’s picture

Dave, thanks. Would you mind posting your work here as well? I can see already that I'm going to have to extend it even further to be a Drupal specific Solr writer so that we can be compatible with apachesolr_multisite. The hash field, for example, has to be computed.

dstuart’s picture

Hey Robert,

On reflection in respect to #36 I think the creation of a new mapping field in Nutch's solrindex-mapping.xml

<staticField dest="entity">nutch</staticField>

Here is a link the jira issue and the patch https://issues.apache.org/jira/browse/NUTCH-924 i haven't posted the patch here as I wasn't sure about licensing.

Regards,

Dave

broncomania’s picture

Hallo,

I am also struggling with Drupal Nutch and Solr. I try to get them run some days and several hours. I found some problems with the Nutch Runbot and specially with the missing documentation. I'm not a professinonal progger but I really need to configure Nutch and Solr. I see here a lot of people like who this module like me and can't get it run. I really think it's time to collect all the informations around this topic to create at least one documentation that really works. From this point is it much more easier for other to get a step in the problems and also for the people with their different knowledges.

My point is can someone who get it run give an step by step explanation what kind of changes are neccessary in the nutch module. Specially the conf section with the schema.xml. The next thing is the solr schema.xml. Just copy the apache_solr_module files to the nutch schema and the solr? Or just only to the solr system????

The next thing is the mapping. Here are some code examples, but where must I put these infos? In the schema.xml in the folder of nutch and solr or only in the solr folder? Next where to put this mappings? Between which tags? A lot of simple questions but without knowing this it's really hard to get it work.

I am really shure someone knows this and It's time to collect these infos.
So I am willing to spend some time and investigate in problems to find a way making a working nutch / solr system.

Thx in advance for this cool module.

maxmmize’s picture

Posted by karljohann on August 19, 2010 at 9:48am
By the way, I got #7 fixed. The nutch/crawl folder didn't have the right permissions.

Err, whats the right permissions?

The nutch/crawl folder

Err, where is this folder at?

broncomania’s picture

@ maxmmize You have to create this folder look here http://drupal.org/node/950766 Hope this helps a little bit.

So I try to merge the solr schema.xml with the nutch schema.xml and extend it with the mapping what is posted in this thread. Is this code right?? If yes, maybe it could be copied in the nutch module as a working starting base for others. I mean NUTCH got a schema and SOLR got a schema why should't we make also one?

I choose the schema from the apachesolr 2.0 version. Solr works like a charm with this. Now it's about nutch and just to get shure that this config is right.

<?xml version="1.0" encoding="UTF-8" ?>
<!-- $Id: schema.xml,v 1.1.2.1.2.32.2.6 2010/06/11 21:59:52 pwolanin Exp $ -->

<!--
 This is the Solr schema file. This file should be named "schema.xml" and
 should be in the conf directory under the solr home
 (i.e. ./solr/conf/schema.xml by default)
 or located where the classloader for the Solr webapp can find it.

 For more information, on how to customize this file, please see
 http://wiki.apache.org/solr/SchemaXml
-->

<schema name="drupal-1.9.6" version="1.2">
    <!-- attribute "name" is the name of this schema and is only used for display purposes.
         Applications should change this to reflect the nature of the search collection.
         version="1.2" is Solr's version number for the schema syntax and semantics.  It should
         not normally be changed by applications.
         1.0: multiValued attribute did not exist, all fields are multiValued by nature
         1.1: multiValued attribute introduced, false by default 
         1.2: omitTermFreqAndPositions attribute introduced, true by default except for text fields.
       -->
  <types>
    <!-- field type definitions. The "name" attribute is
       just a label to be used by field definitions.  The "class"
       attribute and any other attributes determine the real
       behavior of the fieldType.
         Class names starting with "solr" refer to java classes in the
       org.apache.solr.analysis package.
    -->

    <!-- The StrField type is not analyzed, but indexed/stored verbatim.
       - StrField and TextField support an optional compressThreshold which
       limits compression (if enabled in the derived fields) to values which
       exceed a certain size (in characters).
    -->
    <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>

    <!-- boolean type: "true" or "false" -->
    <fieldType name="boolean" class="solr.BoolField" sortMissingLast="true" omitNorms="true"/>

    <!-- The optional sortMissingLast and sortMissingFirst attributes are
         currently supported on types that are sorted internally as strings.
       - If sortMissingLast="true", then a sort on this field will cause documents
         without the field to come after documents with the field,
         regardless of the requested sort order (asc or desc).
       - If sortMissingFirst="true", then a sort on this field will cause documents
         without the field to come before documents with the field,
         regardless of the requested sort order.
       - If sortMissingLast="false" and sortMissingFirst="false" (the default),
         then default lucene sorting will be used which places docs without the
         field first in an ascending sort and last in a descending sort.
    -->


    <!-- numeric field types that store and index the text
         value verbatim (and hence don't support range queries, since the
         lexicographic ordering isn't equal to the numeric ordering) -->
    <fieldType name="integer" class="solr.IntField" omitNorms="true"/>
    <fieldType name="long" class="solr.LongField" omitNorms="true"/>
    <fieldType name="float" class="solr.FloatField" omitNorms="true"/>
    <fieldType name="double" class="solr.DoubleField" omitNorms="true"/>


    <!--
      Note:
      These should only be used for compatibility with existing indexes (created with older Solr versions)
      or if "sortMissingFirst" or "sortMissingLast" functionality is needed. Use Trie based fields instead.

      Numeric field types that manipulate the value into
      a string value that isn't human-readable in its internal form,
      but with a lexicographic ordering the same as the numeric ordering,
      so that range queries work correctly.
    -->
    <fieldType name="sint" class="solr.SortableIntField" sortMissingLast="true" omitNorms="true"/>
    <fieldType name="slong" class="solr.SortableLongField" sortMissingLast="true" omitNorms="true"/>
    <fieldType name="sfloat" class="solr.SortableFloatField" sortMissingLast="true" omitNorms="true"/>
    <fieldType name="sdouble" class="solr.SortableDoubleField" sortMissingLast="true" omitNorms="true"/>

    <!--
     Numeric field types that index each value at various levels of precision
     to accelerate range queries when the number of values between the range
     endpoints is large. See the javadoc for NumericRangeQuery for internal
     implementation details.

     Smaller precisionStep values (specified in bits) will lead to more tokens
     indexed per value, slightly larger index size, and faster range queries.
     A precisionStep of 0 disables indexing at different precision levels.
    -->
    <fieldType name="tint" class="solr.TrieIntField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/>
    <fieldType name="tfloat" class="solr.TrieFloatField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/>
    <fieldType name="tlong" class="solr.TrieLongField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/>
    <fieldType name="tdouble" class="solr.TrieDoubleField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/>


    <!-- The format for this date field is of the form 1995-12-31T23:59:59Z, and
         is a more restricted form of the canonical representation of dateTime
         http://www.w3.org/TR/xmlschema-2/#dateTime
         The trailing "Z" designates UTC time and is mandatory.
         Optional fractional seconds are allowed: 1995-12-31T23:59:59.999Z
         All other components are mandatory.

         Expressions can also be used to denote calculations that should be
         performed relative to "NOW" to determine the value, ie...

               NOW/HOUR
                  ... Round to the start of the current hour
               NOW-1DAY
                  ... Exactly 1 day prior to now
               NOW/DAY+6MONTHS+3DAYS
                  ... 6 months and 3 days in the future from the start of
                      the current day

         Consult the DateField javadocs for more information.
      -->
    <fieldType name="date" class="solr.DateField" sortMissingLast="true" omitNorms="true"/>

    <!-- A Trie based date field for faster date range queries and date faceting. -->
    <fieldType name="tdate" class="solr.TrieDateField" omitNorms="true" precisionStep="6" positionIncrementGap="0"/>

    <!-- solr.TextField allows the specification of custom text analyzers
         specified as a tokenizer and a list of token filters. Different
         analyzers may be specified for indexing and querying.

         The optional positionIncrementGap puts space between multiple fields of
         this type on the same document, with the purpose of preventing false phrase
         matching across fields.

         For more info on customizing your analyzer chain, please see
         http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters
     -->

    <!-- One can also specify an existing Analyzer class that has a
         default constructor via the class attribute on the analyzer element
    <fieldType name="text_greek" class="solr.TextField">
      <analyzer class="org.apache.lucene.analysis.el.GreekAnalyzer"/>
    </fieldType>
    -->

    <!-- A text field that only splits on whitespace for exact matching of words -->
    <fieldType name="text_ws" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
      </analyzer>
    </fieldType>

    <!-- A text field that uses WordDelimiterFilter to enable splitting and matching of
        words on case-change, alpha numeric boundaries, and non-alphanumeric chars,
        so that a query of "wifi" or "wi fi" could match a document containing "Wi-Fi".
        Synonyms and stopwords are customized by external files, and stemming is enabled.
        Duplicate tokens at the same position (which may result from Stemmed Synonyms or
        WordDelim parts) are removed.
        -->
    <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
      <analyzer type="index">
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <!-- in this example, we will only use synonyms at query time
        <filter class="solr.SynonymFilterFactory" synonyms="index_synonyms.txt" ignoreCase="true" expand="false"/>
        -->
        <!-- Case insensitive stop word removal.
          add enablePositionIncrements=true in both the index and query
          analyzers to leave a 'gap' for more accurate phrase queries.
        -->
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.WordDelimiterFilterFactory"
                generateWordParts="1"
                generateNumberParts="1"
                catenateWords="1"
                catenateNumbers="1"
                catenateAll="0"
                splitOnCaseChange="1"
                preserveOriginal="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
      <analyzer type="query">
        <charFilter class="solr.MappingCharFilterFactory" mapping="mapping-ISOLatin1Accent.txt"/>
        <tokenizer class="solr.WhitespaceTokenizerFactory"/>
        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>
        <filter class="solr.StopFilterFactory"
                ignoreCase="true"
                words="stopwords.txt"
                enablePositionIncrements="true"
                />
        <filter class="solr.WordDelimiterFilterFactory"
                generateWordParts="1"
                generateNumberParts="1"
                catenateWords="0"
                catenateNumbers="0"
                catenateAll="0"
                splitOnCaseChange="1"
                preserveOriginal="1"/>
        <filter class="solr.LowerCaseFilterFactory"/>
        <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt"/>
        <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>


    <!-- Edge N gram type - for example for matching against queries with results 
        KeywordTokenizer leaves input string intact as a single term.
        see: http://www.lucidimagination.com/blog/2009/09/08/auto-suggest-from-popular-queries-using-edgengrams/
   -->
    <fieldType name="edge_n2_kw_text" class="solr.TextField" positionIncrementGap="100">
     <analyzer type="index">
       <tokenizer class="solr.KeywordTokenizerFactory"/>
       <filter class="solr.LowerCaseFilterFactory"/>
       <filter class="solr.EdgeNGramFilterFactory" minGramSize="2" maxGramSize="25" />
     </analyzer>
     <analyzer type="query">
       <tokenizer class="solr.KeywordTokenizerFactory"/>
       <filter class="solr.LowerCaseFilterFactory"/>
     </analyzer>
    </fieldType>
   <!--  Setup simple analysis for spell checking -->
    
   <fieldType name="textSpell" class="solr.TextField" positionIncrementGap="100">
     <analyzer>
       <tokenizer class="solr.StandardTokenizerFactory" />
       <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt"/>
       <filter class="solr.LengthFilterFactory" min="4" max="20" />
       <filter class="solr.LowerCaseFilterFactory" /> 
       <filter class="solr.RemoveDuplicatesTokenFilterFactory" /> 
     </analyzer>
   </fieldType>
  
    <!-- This is an example of using the KeywordTokenizer along
         With various TokenFilterFactories to produce a sortable field
         that does not include some properties of the source text
      -->
    <fieldType name="sortString" class="solr.TextField" sortMissingLast="true" omitNorms="true">
      <analyzer>
        <!-- KeywordTokenizer does no actual tokenizing, so the entire
             input string is preserved as a single token
          -->
        <tokenizer class="solr.KeywordTokenizerFactory"/>
        <!-- The LowerCase TokenFilter does what you expect, which can be
             when you want your sorting to be case insensitive
          -->
        <filter class="solr.LowerCaseFilterFactory" />
        <!-- The TrimFilter removes any leading or trailing whitespace -->
        <filter class="solr.TrimFilterFactory" />
        <!-- The PatternReplaceFilter gives you the flexibility to use
             Java Regular expression to replace any sequence of characters
             matching a pattern with an arbitrary replacement string,
             which may include back refrences to portions of the orriginal
             string matched by the pattern.

             See the Java Regular Expression documentation for more
             infomation on pattern and replacement string syntax.

             http://java.sun.com/j2se/1.5.0/docs/api/java/util/regex/package-summary.html

        <filter class="solr.PatternReplaceFilterFactory"
                pattern="(^\p{Punct}+)" replacement="" replace="all"
        />
        -->
      </analyzer>
    </fieldType>

    <!-- A random sort type -->
    <fieldType name="rand" class="solr.RandomSortField" indexed="true" />

    <!-- since fields of this type are by default not stored or indexed, any data added to
         them will be ignored outright
     -->
    <fieldtype name="ignored" stored="false" indexed="false" class="solr.StrField" />
    <!-- NUTCH -->
    <fieldType name="url" class="solr.TextField" positionIncrementGap="100">
      <analyzer>
         <tokenizer class="solr.StandardTokenizerFactory"/>
         <filter class="solr.LowerCaseFilterFactory"/>
         <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1"/>
         <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
      </analyzer>
    </fieldType>
    <!-- NUTCH -->
 </types>


 <fields>
   <!-- Valid attributes for fields:
     name: mandatory - the name for the field
     type: mandatory - the name of a previously defined type from the <types> section
     indexed: true if this field should be indexed (searchable or sortable)
     stored: true if this field should be retrievable
     compressed: [false] if this field should be stored using gzip compression
       (this will only apply if the field type is compressable; among
       the standard field types, only TextField and StrField are)
     multiValued: true if this field may contain multiple values per document
     omitNorms: (expert) set to true to omit the norms associated with
       this field (this disables length normalization and index-time
       boosting for the field, and saves some memory).  Only full-text
       fields or fields that need an index-time boost need norms.
   -->

<!-- The document id is derived from a site-spcific key (hash) and the node ID like:
     $document->id = $hash . '/node/' . $node->nid; -->

   <field name="id" type="string" indexed="true" stored="true" required="true" />

<!-- These are the fields that correspond to a Drupal node. The beauty of having
     Lucene store title, body, type, etc., is that we retrieve them with the search
     result set and don't need to go to the database with a node_load. -->

   <field name="site" type="string" indexed="true" stored="true"/>
   <field name="hash" type="string" indexed="true" stored="true"/>
   <field name="url" type="string" indexed="true" stored="true"/>
   <field name="title" type="text" indexed="true" stored="true" termVectors="true" omitNorms="true"/>
   <field name="sort_title" type="sortString" indexed="true" stored="false"/>
   <field name="body" type="text" indexed="true" stored="true" termVectors="true"/>
   <field name="teaser" type="text" indexed="false" stored="true"/>
   <!-- entity is 'node', 'file', 'user', or some other Drupal object type -->
   <field name="entity" type="string" indexed="true" stored="true"/>
   <!-- type is a node type, or can be used flexibly for other entity types -->
   <field name="type" type="string" indexed="true" stored="true"/>
   <field name="type_name" type="string" indexed="true" stored="true"/>
   <field name="path" type="string" indexed="true" stored="true"/>
   <field name="path_alias" type="text" indexed="true" stored="true" termVectors="true"/>
   <field name="uid"  type="integer" indexed="true" stored="true"/>
   <field name="name" type="text" indexed="true" stored="true" termVectors="true"/>
   <field name="sname" type="string" indexed="true" stored="false"/>
   <field name="sort_name" type="sortString" indexed="true" stored="false"/>
   <field name="created" type="date" indexed="true" stored="true"/>
   <field name="changed" type="date" indexed="true" stored="true"/>
   <field name="last_comment_or_change" type="date" indexed="true" stored="true"/>
   <field name="nid"  type="integer" indexed="true" stored="true"/>
   <field name="status" type="boolean" indexed="true" stored="true"/>
   <field name="promote" type="boolean" indexed="true" stored="true"/>
   <field name="moderate" type="boolean" indexed="true" stored="true"/>
   <field name="sticky" type="boolean" indexed="true" stored="true"/>
   <field name="tnid"  type="integer" indexed="true" stored="true"/>
   <field name="translate" type="boolean" indexed="true" stored="true"/>
   <field name="language" type="string" indexed="true" stored="true"/>
   <field name="comment_count" type="integer" indexed="true" stored="true"/>
   <field name="tid"  type="integer" indexed="true" stored="true" multiValued="true"/>
   <field name="vid"  type="integer" indexed="true" stored="true" multiValued="true"/>
   <field name="taxonomy_names" type="text" indexed="true" stored="false" termVectors="true" multiValued="true" omitNorms="true"/>
   <!-- The string version of the title is used for sorting -->
   <copyField source="title" dest="sort_title"/>
   <!-- The string versions of the name used for sorting/multi-site facets -->
   <copyField source="name" dest="sname"/>
   <copyField source="name" dest="sort_name"/>
   <!-- Copy terms to a single field that contains all taxonomy term names -->
   <copyField source="ts_vid_*" dest="taxonomy_names"/>
  
   <!-- A set of fields to contain text extracted from tag contents which we
        can boost at query time. -->
   <field name="tags_h1" type="text" indexed="true" stored="false" omitNorms="true"/>
   <field name="tags_h2_h3" type="text" indexed="true" stored="false" omitNorms="true"/>
   <field name="tags_h4_h5_h6" type="text" indexed="true" stored="false" omitNorms="true"/>
   <field name="tags_a" type="text" indexed="true" stored="false" omitNorms="true"/>
   <!-- Inline tags are typically u, b, i, em, strong -->
   <field name="tags_inline" type="text" indexed="true" stored="false" omitNorms="true"/>

   <!-- Here, default is used to create a "timestamp" field indicating
        when each document was indexed.-->
   <field name="timestamp" type="date" indexed="true" stored="true" default="NOW" multiValued="false"/>

	<!-- This field is used to build the spellchecker index -->
   <field name="spell" type="textSpell" indexed="true" stored="true" multiValued="true"/>
  
  <!-- copyField commands copy one field to another at the time a document
        is added to the index.  It's used either to index the same field differently,
        or to add multiple fields to the same field for easier/faster searching.  -->
   <copyField source="title" dest="spell"/>
   <copyField source="body" dest="spell"/>

   <!-- Dynamic field definitions.  If a field name is not found, dynamicFields
        will be used if the name matches any of the patterns.
        RESTRICTION: the glob-like pattern in the name attribute must have
        a "*" only at the start or the end.
        EXAMPLE:  name="*_i" will match any field ending in _i (like myid_i, z_i)
        Longer patterns will be matched first.  if equal size patterns
        both match, the first appearing in the schema will be used.  -->

   <dynamicField name="is_*"  type="integer" indexed="true"  stored="true" multiValued="false"/>
   <dynamicField name="im_*"  type="integer" indexed="true"  stored="true" multiValued="true"/>
   <dynamicField name="sis_*" type="sint"    indexed="true"  stored="true" multiValued="false"/>
   <dynamicField name="sim_*" type="sint"    indexed="true"  stored="true" multiValued="true"/>
   <dynamicField name="sm_*"  type="string"    indexed="true"  stored="true" multiValued="true"/>
   <dynamicField name="tm_*"  type="text"    indexed="true"  stored="true" multiValued="true" termVectors="true"/>
   <dynamicField name="ss_*"  type="string"    indexed="true"  stored="true" multiValued="false"/>
   <dynamicField name="ts_*"  type="text"    indexed="true"  stored="true" multiValued="false" termVectors="true"/>
   <dynamicField name="tsen2k_*" type="edge_n2_kw_text" indexed="true" stored="true" multiValued="false" omitNorms="true" omitTermFreqAndPositions="true" />
   <dynamicField name="ds_*" type="date"    indexed="true"  stored="true" multiValued="false"/>
   <dynamicField name="dm_*" type="date"    indexed="true"  stored="true" multiValued="true"/>
   <dynamicField name="tds_*" type="tdate"    indexed="true"  stored="true" multiValued="false"/>
   <dynamicField name="tdm_*" type="tdate"    indexed="true"  stored="true" multiValued="true"/>
   <dynamicField name="bm_*"  type="boolean" indexed="true"  stored="true" multiValued="true"/>
   <dynamicField name="bs_*"  type="boolean" indexed="true"  stored="true" multiValued="false"/>
   <dynamicField name="fs_*"  type="sfloat"  indexed="true"  stored="true" multiValued="false"/>
   <dynamicField name="fm_*"  type="sfloat"  indexed="true"  stored="true" multiValued="true"/>
   <dynamicField name="ps_*"  type="sdouble" indexed="true"  stored="true" multiValued="false"/>
   <dynamicField name="pm_*"  type="sdouble" indexed="true"  stored="true" multiValued="true"/>

   <dynamicField name="tis_*"  type="tint"  indexed="true"  stored="true" multiValued="false"/>
   <dynamicField name="tim_*"  type="tint"  indexed="true"  stored="true" multiValued="true"/>
   <dynamicField name="tls_*"  type="tlong" indexed="true"  stored="true" multiValued="false"/>
   <dynamicField name="tlm_*"  type="tlong" indexed="true"  stored="true" multiValued="true"/>
   <dynamicField name="tfs_*"  type="tfloat"  indexed="true"  stored="true" multiValued="false"/>
   <dynamicField name="tfm_*"  type="tfloat"  indexed="true"  stored="true" multiValued="true"/>
   <dynamicField name="tps_*"  type="tdouble" indexed="true"  stored="true" multiValued="false"/>
   <dynamicField name="tpm_*"  type="tdouble" indexed="true"  stored="true" multiValued="true"/>
   <!-- Sortable version of the dynamic string field -->
   <dynamicField name="sort_ss_*" type="sortString" indexed="true" stored="false"/>
   <copyField source="ss_*" dest="sort_ss_*"/>
  <!-- A random sort field -->
   <dynamicField name="random_*" type="rand" indexed="true" stored="true"/>
   <!-- This field is used to store node access records, as opposed to CCK field data -->
   <dynamicField name="nodeaccess*" type="integer" indexed="true" stored="false" multiValued="true"/>

   <!-- The following causes solr to ignore any fields that don't already match an existing
        field name or dynamic field, rather than reporting them as an error.
        Alternately, change the type="ignored" to some other type e.g. "text" if you want
        unknown fields indexed and/or stored by default -->
   <dynamicField name="*" type="ignored" multiValued="true" />
   
   
   
   <!-- BACKWARDS COMPATIBILITY -->
   <!-- Here is where we store fields which are no longer used -->
   
   <!-- Fields previously used for sorting -->
   <field name="stitle" type="string" indexed="true" stored="true"/>
   <field name="title_sort" type="sortString" indexed="true" stored="false"/>

   <field name="name_sort" type="sortString" indexed="true" stored="false"/>
    
   <!-- NUTCH -->
   <field name="segment" type="string" stored="true" indexed="false"/>
   <field name="digest" type="string" stored="true" indexed="false"/>
   <field name="boost" type="float" stored="true" indexed="false"/>
   <field name="host" type="url" stored="false" indexed="true"/>
   <field name="content" type="text" stored="false" indexed="true"/>
   <field name="cache" type="string" stored="true" indexed="false"/>
   <field name="tstamp" type="long" stored="true" indexed="false"/>
   <field name="anchor" type="string" stored="true" indexed="true" multiValued="true"/>
   <field name="contentLength" type="long" stored="true" indexed="false"/>
   <field name="lastModified" type="long" stored="true" indexed="false"/>
   <field name="date" type="string" stored="true" indexed="true"/>
   <field name="lang" type="string" stored="true" indexed="true"/>
   <field name="subcollection" type="string" stored="true" indexed="true" multiValued="true"/>
   <field name="author" type="string" stored="true" indexed="true"/>
   <field name="tag" type="string" stored="true" indexed="true"/>
   <field name="feed" type="string" stored="true" indexed="true"/>
   <field name="publishedDate" type="string" stored="true" indexed="true"/>
   <field name="updatedDate" type="string" stored="true" indexed="true"/>
   <!-- NUTCH -->
   <!-- /BACKWARDS COMPATIBILITY -->
<mapping>
        <!-- Simple mapping of fields created by Nutch IndexingFilters
             to fields defined (and expected) in Solr schema.xml.

             Any fields in NutchDocument that match a name defined
             in field/@source will be renamed to the corresponding
             field/@dest.
             Additionally, if a field name (before mapping) matches
             a copyField/@source then its values will be copied to
             the corresponding copyField/@dest.

             uniqueKey has the same meaning as in Solr schema.xml
             and defaults to "id" if not defined.
         -->
        <fields>
                <field dest="site" source="site"/>
                <field dest="title" source="title"/>
                <field dest="host" source="host"/>
                <field dest="segment" source="segment"/>
                <field dest="boost" source="boost"/>
                <field dest="digest" source="digest"/>
                <field dest="tstamp" source="tstamp"/>
                <field dest="id" source="url"/>
                <field dest="body" source="content"/>
                <copyField source="url" dest="url"/>
        </fields>
        <uniqueKey>id</uniqueKey>
</mapping>
 </fields>

 <!-- Field to use to determine and enforce document uniqueness.
      Unless this field is marked with required="false", it will be a required field
   -->
 <uniqueKey>id</uniqueKey>

 <!-- field for the QueryParser to use when an explicit fieldname is absent -->
 <defaultSearchField>body</defaultSearchField>

 <!-- SolrQueryParser configuration: defaultOperator="AND|OR" -->
 <solrQueryParser defaultOperator="AND"/>

</schema>
karljohann’s picture

@broncomania: In all fairness this module is still in alpha. Although it has perhaps been so for a while now. You can essentially get all the information you need to install this module successfully in this very thread. If you have any problems than you should be very specific in what the problem is and you'll probably get help fixing it.

@maxmmize: The nutch/crawl folder is in the nutch installation folder. For example /usr/local/nutch/crawl. The right permissions would be any that allow the runbot to write into the folder, so if you're running apache then setting the owner to apache is probably your best bet and then setting the permissions to, for example, 755.

broncomania’s picture

Yes, I know that this is alpha. I give also my very best to fix problems and find solutions. I mean I research google now for several days for small problems during the installation and I thought that others get the same problems. I am also willing help to develop this modul to get maybe the second alpha version. See here http://drupal.org/node/950766 or here http://drupal.org/node/950722 or my question above. These are my problems for the moment... I am sure the next problems will come. See my posting above. I think I wrote it in the moment as you wrote your comment.

By the way karljohann I read http://drupal.org/node/811062#comment-3153582 that you already got it working maybe you can approve my posted scheme.xml.

Obviouly I am completly confused now. I read this thread again and again and I didn't get it.
I will start what I understand and what I did.

1. Solr is running with the original scheme.xml provided by the Drupal ApacheSolr modul.
Then i make the changes in the xml postet above. Is this right??? Or did I only use the original ApacheSolr scheme without any changes?

2. Now after reading this complete posting again and again I think I have to extend the Nutch scheme with the mapping posted here by dstuart. I am completly confused. If am right I just have to copy the solrindex-mapping under the the nutsch scheme infront of the

Just to make it clear. Can someone post here the changed Nutch and ApacheSolr schemes? Just to see where to make the changes and get an understanding for why you did this.

I hope my problem is clear now.

Thanx in advance

dstuart’s picture

Hey all,

As this thread is getting really long and will probably have
Conflicting information i will try and consolidate the howto bits into
A readme file with the module. Further to that I will try and make the
Module a little note user friendly with some basic checks on your
Nutch setup. I have some time tomorrow so let's see where I can get to.

Broncomania hopefully I'll answer all your questions in the readme

Cheers

Dave

broncomania’s picture

Oh Man great!! Hope you see my other Issues and you can integrate the solved notice and last but not least the

#!/bin/sh
#!/bin/bash

problem in the runbot.

Did I already said thank you?? Really a lovely modul. I hope I can help a little bit in the development process.

maxmmize’s picture

@broncomania - geez, thanks! How did I miss that post!?

I added the values in step one to the xml. Added the folders and files. chown and chmod etc.

What do I have to put in for the http.agent.name and the other values?

maxmmize’s picture

Consolidated my post to below. Sorry.

broncomania’s picture

Okay update: I get it run!! It's so easy if u know what are doing. I will explain it further more in my post. Here http://drupal.org/node/950766

@dstuart Yes the documentation is a really helpful idea. Now after I get it run I see my mistakes.

maxmmize’s picture

For search purposes I am leaving my issues here. Other may benefit. For my Outstanding Errors, any help would be appreciated.



Fixed Errors:

ERROR crawl.Injector - Injector: java.io.IOException: Not a file: file:/home/xxxx/lib/nutch/seed/urls
fixed by rm -r urls and creating a file called urls

Input path does not exist: file:/home/nolosear/lib/nutch/crawl/linkdb/current
fixed by creating dir current

[Sun Oct 24 21:27:57 2010] [error] [client X.x.x.x] sh: /home/xxxx/public_html/modules/nutch/runbot: Permission denied, referer: http://xxxxx.com/admin/settings/nutch/crawl
Fixed chmod 755

[Sun Oct 24 21:45:49 2010] [error] [client xxxx] sh: /home/xxxx/public_html/modules/nutch/runbot: /bin/bash^M: bad interpreter: No such file or directory, referer: http://xxxx.com/admin/settings/nutch/crawl
Edit file in vi. :set fileformat=unix press enter then :wq! press enter


Debunking Madness:
Unable to search crawled and indexed content in Solr
Turn off Apache Solr node access under Modules



Perplexing Solutions:

2010-10-24 23:15:01,551 WARN regex.RegexURLNormalizer - can&#039;t find rules for scope &#039;inject&#039;, using default
Add more than one URL into your Nutch Module for crawling



Outstanding Errors:

2010-10-24 22:56:41,324 ERROR crawl.Generator - Generator: java.io.IOException: lock file /home/xxxxx/lib/nutch/crawl/crawldb/.locked already exists.

and

2010-10-24 22:56:48,026 ERROR solr.SolrIndexer - org.apache.hadoop.mapred.InvalidInputException: Input Pattern file:/home/xxxx/lib/nutch/crawl/segments/*/crawl_fetch matches 0 files

and

2010-10-24 22:56:45,882 ERROR crawl.LinkDb - LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input Pattern file:/home/xxxx/lib/nutch/crawl/segments/*/parse_data matches 0 files

and

ls: /home/xxxx/lib/nutch/crawl/segments/*: No such file or directory, referer: http://xxxxx.com/admin/settings/nutch/crawl

and

/home/xxxx/public_html/modules/nutch/runbot: line 88: /home/nolosear/lib/nutch/seed/urls: Permission denied, referer: http://xxxxx.com/admin/settings/nutch/crawl

Any help would be appreciated.

broncomania’s picture

[Sun Oct 24 22:08:28 2010] [error] [client xxxx] /home/xxxx/public_html/modules/nutch/runbot: line 88: /home/xxxx/lib/nutch/seed/urls: Is a directory, referer: http://nolosearch.com/admin/settings/nutch/crawl

This looks like u created a folder named urls ? This is not what i described in my posting its a file

/opt/nutch-1.2/seed/urls << the urls is a file !!!!

Can u post the command that u use for calling the runbot?

if you use the commandline it should like like this
/home/YOURUSER/public_html/sites/all/modules/nutch/runbot -n '/opt/nutch' -j '/usr/lib/jvm/java-6-sun' -s 'http://localhost:8080/solr' -u 'http://www.example.com!http://www.example1.com' -c '1' -f '100' -d '0'

broncomania’s picture

I got also this error messages!! I didn't find a solution for this problems !!
My solution was rm -fr /nutch-1.2
and then tar xzf apache-nutch-1.2-bin.tar.gz start from the scratch until I didn't make any faults during the configurations. Sorry but that was the way for me... maybe someone else can give you and me a hint!! At least it was working for me. Don't make mistakes :-)

PS: I updated my posting with all the configurations who are nessesary for a running ubuntu system. Maybe this help you again

maxmmize’s picture

Hm, well. I made the changes as noted and added a second URL to the nutch and it crawled. I crawled 1000 URLS in an hour.

Solr now has 4535 documents in index but I can't get them to display in the search. :-\

dstuart’s picture

Hey maxmmize,

As per comment #33 do you have solr access control turned on?

Rgds,

Dave

maxmmize’s picture

@dstuart Thank you very much. I did. I do remember seeing that post and I should have made a note to revisit it. I turned it off and now I can search the engine.

broncomania’s picture

So after I get it run now under ubuntu 10.04 http://drupal.org/node/950766 I will come now to my needs and questions.

1.: I developed a module in which users can add a one or several domains to their profile that should get crawled after adding. Also I created my own node type called domain. Is it possible to tag the crawled domains with the uid of drupal? So that i can only search websites of this user? I have to build a connection between the domain and the user. Okay I can grab the search result url and compare it with my stored infos in the db to the owner of the domain, but I prefer the way to tag the domain. Is this possible???

2.: I read already here #6 that some ask if its possible to add the domain to the node type, in my case domain. How can I do this exactly dstuart. Have you got a code example for this?

3.: If a user creates an account and add one domain as example. How can I make an addition to the seed/urls so that this domain get instantly crawled or if the crawler is running put this in the waiting qeue?

4.: If a user deletes his domain how can I delete this domain from the search index?

Help and ideas are needed.

Thank you for reading :-)

maxmmize’s picture

After I received a memory error I increased my mem_limit from 200megs to 2gigs.

My crawler stopped working. My logs show:

[Mon Oct 25 11:49:51 2010] [error] [client xxxx] /home/nolosear/public_html/modules/nutch/runbot: line 88: /home/nolosear/lib/nutch/seed/urls: Permission denied, referer: http://nolosearch.com/admin/settings/nutch/crawl

I ran ls -la

drwxrwxrwx  2 nolosear nolosear 4096 Oct 24 23:09 ./
drwxrwxrwx 12 nolosear nolosear 4096 Oct 25 03:40 ../
-rwxr-xr-x  1 root     root       24 Oct 24 23:40 urls*

Should URLS be nolosear nolosear? Do I have the right permissions? Do you have any idea why my crawler would just stop working?

Below is my debug:

http://www.oyez.org/
Starting Nutch Crawl.
/home/nolosear/public_html/modules/nutch/runbot -n '/home/nolosear/lib/nutch' -j '/usr' -s 'http://localhost:8983/solr' -u 'http://www.oyez.org/' -c '1' -f '50' -d '1'
Nutch Home: /home/nolosear/lib/nutch
JAVA HOME: /usr
URLs to Crawl: 50
Commit to Solr: 1
Solr URL: http://localhost:8983/solr
Seed URL: http://www.oyez.org/
Debug: 1
-- Seed Urls --
http://www.oyez.org/
runbot: /home/nolosear/public_html/modules/nutch/runbot found environment variable NUTCH_HOME=/home/nolosear/lib/nutch
----- Inject (Step 1 of 5) -----
/home/nolosear/lib/nutch/bin/nutch inject /home/nolosear/lib/nutch/crawl/crawldb /home/nolosear/lib/nutch/seed
----- Generate, Fetch, Parse, Update (Step 2 of 5) -----
--- Beginning crawl at depth 1 of 2 ---
/home/nolosear/lib/nutch/bin/nutch generate /home/nolosear/lib/nutch/crawl/crawldb /home/nolosear/lib/nutch/crawl/segments --topN -adddays 5
/home/nolosear/lib/nutch/bin/nutch fetch -threads 50
/home/nolosear/lib/nutch/bin/nutch updatedb /home/nolosear/lib/nutch/crawl/crawldb
--- Beginning crawl at depth 2 of 2 ---
/home/nolosear/lib/nutch/bin/nutch generate /home/nolosear/lib/nutch/crawl/crawldb /home/nolosear/lib/nutch/crawl/segments --topN -adddays 5
/home/nolosear/lib/nutch/bin/nutch fetch -threads 50
/home/nolosear/lib/nutch/bin/nutch updatedb /home/nolosear/lib/nutch/crawl/crawldb
----- Merge Segments (Step 3 of 5) -----
/home/nolosear/lib/nutch/bin/nutch mergesegs /home/nolosear/lib/nutch/crawl/MERGEDsegments /home/nolosear/lib/nutch/crawl/segments/*
mv -v /home/nolosear/lib/nutch/crawl/MERGEDsegments/* /home/nolosear/lib/nutch/crawl/segments
rmdir /home/nolosear/lib/nutch/crawl/MERGEDsegments
----- Invert Links (Step 4 of 5) -----
/home/nolosear/lib/nutch/bin/nutch invertlinks /home/nolosear/lib/nutch/crawl/linkdb /home/nolosear/lib/nutch/crawl/segments/*
----- Index and send to solr (Step 5 of 5) -----
/home/nolosear/lib/nutch/bin/nutch solrindex http://localhost:8983/solr /home/nolosear/lib/nutch/crawl/crawldb /home/nolosear/lib/nutch/crawl/linkdb /home/nolosear/lib/nutch/crawl/segments/*
runbot: FINISHED: Crawl completed!
maxmmize’s picture

So, I chwon the file urls to nolosear:nolosear and I don't receive that error anymore. I still get error:

[Mon Oct 25 19:36:33 2010] [error] [client] rmdir: /home/nolosear/lib/nutch/crawl/MERGEDsegments: No such file or directory, referer: http://xxxxx.com/admin/settings/nutch/crawl
[Mon Oct 25 19:36:33 2010] [error] [client] mv: cannot stat `/home/nolosear/lib/nutch/crawl/MERGEDsegments/*': No such file or directory, referer: http://xxxx.com/admin/settings/nutch/crawl

maxmmize’s picture

Filter URLs: *

I see the example but what exactly doe is it do in the nutch module?

broncomania’s picture

Has someone found out how to map the nutch ngram language information to the solr index?
I try to add this informations in the mapping,but solr ignore this informations!

maxmmize’s picture

 <field dest="host" source="host"/>
  <field dest="segment" source="segment"/>
  <field dest="boost" source="boost"/>
  <field dest="digest" source="digest"/>
  <field dest="tstamp" source="tstamp"/

Where exactly do these go in the schema.xml fle?

robertdouglass’s picture

FIXED: ERROR solr.DrupalSolrIndexer - org.apache.hadoop.mapred.InvalidInputException: Input Pattern file:/home/robert/lib/nutch_1_2/crawl/segments/*/crawl_fetch matches 0 files

This error was solved for me by checking my seed URLs (q=admin/settings/nutch/seed) and making sure they are prefixed with http://

For example, when I had example.com I got the above error, but updating that to http://example.com fixed it and I was able to crawl.

I'm going to add some text to the documentation to clarify this.

robertdouglass’s picture

StatusFileSize
new810 bytes

Here's the documentation change I just committed.

robertdouglass’s picture

Take that back ... hasn't been committed yet.

dstuart’s picture

Hey Robert,

Have commited patch #64

Regards,

Dave

broncomania’s picture

I got nutch working right now and I find a problem. It's about the the type field like node,story, ... nutch didn't fill it with a default value or something similar. This is not a problem if you use the standard installation. I upgrade it with the apachesolr multilanguage module. It is still working but if i change the language from german to english i got several mistakes about check_plain and that it is an array. So I started a research and the reason was that nutch didn't set the type.

So I extend the solr schema.xml. In the type field I added this value default="nutch" and the errors are gone. I think now this is not the smartest way to fix this. Is it possible to extend the solr-mapping with the field type and submit a standard value like "nutch" or something else? This should be much better as set a default value.

Any ideas are welcome

Bronco

d0t101101’s picture

Thanks broncomania,

Your suggestion here (#67) was helpful. I ran into the same problem, but I am only using Apache_solr and Nutch modules (not the multilanguage module you had mentioned). I had to customize the schema.xml further for nutch's solr indexing step to successfully complete, it kept complaining about non multiValue fields.

In my case, specifying the 'default' value for this field wasn't enough, I also had to change the type field to allow multiple values "multiValue="true"". I'm uncertain what other implications this change may have though.

The data is now being index by solr, but not searchable via drupal. To help me diagnose this, would you kindly share your schema.xml your using today? Is it much different than your earlier post (#43)?

Best regards,
.

ssedume’s picture

Hi buddy did you ever manage to make this this work?

mac_perlinski’s picture

As i can see NUTCH_VIRTUAL_NODE_PATH is followed by the value of digest field. I also read your comment in the code that this part needs to be done.

If we want to create NUTCH_VIRTUAL_NODE_PATH for example:
http://example.com/nutchnode/59a5ec5b86f2fd6552f8433bba963089 where NUTCH_VIRTUAL_NODE_PATH is substituted by nutchnode callback we need to change digest field to be indexed so we can retrieve document with certain digest key.

Whats the status of this feature ?

avpaderno’s picture

Category: Support request » Bug report
Issue summary: View changes
Status: Active » Closed (outdated)

I am closing this issue, since Drupal 6 isn't supported anymore.