I've been working on setting up Solr and Nutch for the past couple days and I seem to finally have the nutch crawler data working with Drupal.
My problem is that I can't seem to get the admin UI working for the Nutch crawler. I have execute the crawl commands manually through the Linux terminal. Is there any special configuration setup I need to have? When I hit start crawl in debug mode it says everything completed successfully but no data was fetched.
I've set the proper paths to Nutch and Java.
I merged the schema.xml required for Nutch and the schema.xml required for Drupal. Setup some copyfields to store the data in the same way that Drupal's nodes are and changed the uniqueKey to be the URL because Nutch doesn't send and id. Is there anything else required?
Also, when you view the search results the url for the data retrieved from Nutch has NUTCH_VIRTUAL_NODE_PATH instead of...what's supposed to be there?
Thanks
| Comment | File | Size | Author |
|---|---|---|---|
| #64 | seeds.patch | 810 bytes | robertdouglass |
Comments
Comment #1
dstuart commentedHi,
Yep looks like a bug I will try get to that soon. Re NUTCH_VIRTUAL_NODE_PATH that is supposed to be the url of the site you have crawled that a bug too so will try and fix these
Cheers,
Dave
Comment #2
karljohann commented"I merged the schema.xml required for Nutch and the schema.xml required for Drupal. Setup some copyfields to store the data in the same way that Drupal's nodes are and changed the uniqueKey to be the URL because Nutch doesn't send and id. Is there anything else required?"
Are these things required to make this work? Is there any documentation on the way? Just very basic and/or raw instructions would do.
Comment #3
dnett123 commentedI couldn't make it work properly without merging the two schema.xml files. I'm not sure if I did it properly but I was able to make it work to the point that nutch is working with solr and the search results are showing up in Drupal.
I had another question is it possible to have those nutch pages created as their own content type (external for example)?
Comment #4
karljohann commentedIt seems to be working now that I've merged the schema.xml files. Thanks for that.
I guess I'm just gonna have to study Nutch a bit more to get an understanding of what it is that I'm doing though :)
Comment #5
dstuart commentedHey I wrote a patch for Nutch awhile back that has been released into Nutch 1.1, you can use the conf/solrindex-mapping.xml in the Nutch file mine looks like. This means that you don't have to use the copyfield which means you can have a merged Nutch and Drupal index
Comment #6
dstuart commented"I had another question is it possible to have those nutch pages created as their own content type (external for example)?"
I've beening thinking about this one awhile now and think you could do this one of two ways.
- You could have a node which is literally a reference to a Solr Document and would hold little information other than it id.
- You could write a import from Solr to effectively reverse the indexing process to create nodes I had to do this one time when my database got corrupted and re imported all of the data for my solr instance
Comment #7
karljohann commentedOne question, is there any way of knowing whether the module itself is working? When I press "Start crawl" I get just get
* Starting Nutch Crawl.
* 0
Which isn't very informative and I assume means that it didn't work. When I however use the code from the dry run/debug crawl it works fine. (Except for one thing, the segments folder from --- Beginning crawl at depth 2 of 2 --- is the same as 1 of 2, but if I correct that then it's fine.)
Comment #8
dstuart commentedYou should see the Nutch process in your processlist but ill have a chance to look at this tomorrow and try and roll a bunch of bug fixes i've been doing.
Regards,
David
Comment #9
ataneja commentedHI!
Can anyone please tell me exactly what needs to be done for showing nutch results in drupal.
How do I need to merge the schema files, do I need to copy paste the above patch provided by David Stuart in to nutch schema.xml?
What else needs to be done?
I have one more doubt that is we have to change the schema of solr server twice.
Firstly, we need to copy the nutch schema to solr schema so that the index from nutch gets transfered to solr.
secondly, we need to copy the schema of solr module to schema of solr server so that solr module can connect to solr server.
Is there any other way out. Please Reply
Thanx
Comment #10
ataneja commentedHey karljohann and dnett123..Can you exactly tell what you have done to get it work?
PLease REply. Thanx
Comment #11
dstuart commentedHi ataneja,
Could you outline what you have done so far?
The rough steps are (and assumes you are running this on linux)
Get Nutch 1.1 from nutch.apache.org and unpack it to a location of your choice
In the nutch source under the config directory copy the schema above into the solrindex-mapping.xml this allows you to do mapping against the solr schema
e.g. the nutch content field data will be copied into the body field in solr
Obviously you will have to have solr installed and the Drupal schema.xml installed all of the fields are in the schema except for those listed below which you will have to add
If you can provide a little more information on whats going wrong then ill can expand this set by set and add it to the module install
Regards,
David
Comment #12
ataneja commentedFirst of all I must say that you are doing a wonderful job in developing the nutch module for drupal.
coming to my problem.
See I have installed nutch 1.1, crawled some websites.
Sent the index to solr server (which I have installed at a dedicated server).
Then I installed solr module in drupal which was able to communicate with the above solr server.
Now, the problem is that the index which was sent from the nutch is not showing in drupal search results via solr module.
And I believe the problem is the way nutch and solr indexes their data is different.
So, I dont know how to make them compatible. I guess I need to merge the schemas and add some copyfields. But, I dont know what exactly needs to be done. Please tell me as soon as possible.
Thanx in advance
Comment #13
dstuart commentedhi Abhishek,
it looks as though you dont have the right schema file in the conf dir of solr when you downloaded the apachesolr drupal module there should be a schema.xml you need to put that in the apache solr folder and restart the solr server you also need to add at just before the tag all of the fields mentioned here http://drupal.org/node/811062#comment-3240566
Regards,
Dave
Comment #14
dstuart commentedFor completeness I have posted the final resolution
After following the steps above disabling solr node access option in Drupal is also required at the moment we can work around it but Apache Solr seems to like to specifically namespace things thats its indexed with node access on which is quite limiting in my option. Also a quick fix to the hook_apachesolr_process_results
Hopefully that should sort the problem
Regards,
Dave
Comment #15
savannah_beckett commentedI am trying to merge the schema.xml that is the solr/nutch setup with the one from drupal apache solr module. I encounter a field that is not mergeable.
From drupal module:
From solr/nutch setup:
required="true"/>
I am not sure if there are any more stuff like this that is not mergeable.
Is there a easy way to deal with schema.xml?
Thanks.
Comment #16
savannah_beckett commentedI reread your comment 11. I already had nutch/solr setup working. Does your comment mean keep drupal module's solrconfig.xml and remove the one in my solr/nutch setup? And remove schema.xml in my solr/nutch setup and keep drupal modules' schema.xml and add corresponding to the following?
So after this, no need to merge the schema.xml further?
Comment #17
dstuart commentedYes that is correct you can use Drupal's solrconfig.xml and schema.xml add in the fields described (or map them to other fields using the solrindex-mapping.xml) and away you go. The url field can be of type string unless you really need url validation (which I imagine Drupal would mess up)
Comment #18
savannah_beckett commentedDoes this module support faceted search? or do I have to download another module called Apache Solr Facet Builder module? I want to use several custom fields that I defined in solr index as part of my faceted search. I tried to get Apache Solr Facet Builder module to work for a long time, and I played around with Views module, but so far no result. There is no instructions available for custom fields in solr index.
Comment #19
savannah_beckett commentedI am able to get the search result from index with this module, but the url of each search result points to the homepage of my drugal site. Why?
Comment #20
scotjam commentedHi all
Can anyone suggest web links to tutorials that help with 1) installing nutch on windows and 2) get nutch and apache solr working together?
There's plenty instructions online but I'm not sure which one to follow. I don't know what steps are generic to nutch and what needs to done specifically for apache solr and drupal.
Which instructions have worked for you?
e.g. Found this one. Do I follow every step here? Or does a drupal setup of apache solr using the nutch module need different steps? http://wiki.apache.org/nutch/RunningNutchAndSolr
cheers
scotjam
Comment #21
dstuart commentedHi scotjam,
This is a good article about it http://wiki.apache.org/nutch/GettingNutchRunningWithWindows
Note: If you're just interested in a basic installation on Windows and are not interested in knowing the details of how it is done, you might want check and see if theWhelanLabs SearchEngine Manager (http://www.whelanlabs.com/content/SearchEngineManager.htm) fits your needs. It is a free installer for Nutch on Windows.
Regards,
Dave
Comment #22
dstuart commentedHi Savannah,
See this comment http://drupal.org/node/811062#comment-3251604
Hope it helps
Regards,
Dave
Comment #23
karljohann commentedBy the way, I got #7 fixed. The nutch/crawl folder didn't have the right permissions.
Comment #24
nitinbh77 commentedHi,
I am trying to integrate Nutch 1.1 / solr 1.4 and drupal 6. I am able to fetch the nutch and drupal results and been able to view it from solr admin screen. However when I try to search the Nutch contents from Drupal solr search module it shows no results.
I am able to use solarindex-mapping to map the fields. I hope I did it correct as I can see the Nutch results in solr. I am not getting any idea why it is not showing me the results in drupal.
please help
Regards
Nitin
Comment #25
karljohann commentedDoes it show no results or no results from Nutch? Did you change the solr/example/conf/schema.xml and solrconfig.xml for the Drupal Apachesolr module ones?
Comment #26
nitinbh77 commentedIt shows me the Nutch results when I query it from solr admin screen. I did copied the schema.xml and solrconfig.xml from Drupal Apachesolr module.
The problem is if I create some content in Drupal and index it I am able to search it via Drupal Apachesolr module. However If I crawled some website using Nutch I cannot query it from Drupal Apachesolr module. The Nutch results are present in solr though.
Thanks
Nitin
Comment #27
karljohann commentedI actually had the same problem but I honestly can't remember how I fixed it. Just try going through all the steps again, copy the schema.xml and solrconfig.xml files and change the solrindex-mapping file like so. I can't remember if you have to add these fields to the schema.xml but I have them there anyway.
If you are, however, using the Apache Solr Views integration then I'm still having that problem and really am of no use.
Comment #28
nitinbh77 commentedI believe I am missing on some fields mapping into the solrindex-mapping file. I added some fields into mapping file but seems Drupal solr module is looking for some more while searching. I am not using the views integration but it would be great if you can paste your mapping file here for a quick reference. Thanks again for the prompt response.
Regards
Nitin
Comment #29
karljohann commentedMine is identical to this one from dstuart
Comment #30
nitinbh77 commentedOk here are the steps I have done again
1. Copy the solrconfig.xml and schema.xml provided with Drupal apachesolr module into the Solr1.4/conf
2. Copy the solrconfig.xml and schema.xml provided with Drupal apachesolr module into the Nutch1.1/conf
3. Edit solrindex-mapping.xml in the Nutch1.1/conf and added like this
http://drupal.org/node/811062#comment-3154622
4. Edit Nutch1.1/schema.xml and addded
http://drupal.org/node/811062#comment-3240566
Post that I restarted solr; crawled the website; sent the data to solr.
I can still see the crawled data indexed in solr but if I search it from Drupal using apachesolr it is still not visible.
Comment #31
dstuart commentedHi nitinbh77,
Have you got the apache solr access control module on? The current version of the nutch module doesn't support this feature turn it off and try a search. If that doesn't works
Regards
Dave
Comment #32
nitinbh77 commentedHi Dave,
you exactly got the nerve of the issue. I just disabled the access control module and I am able to view the nutch search results on the drupal now.
Thanks a ton for your help.
Regards
Nitin Bhardwaj
Comment #33
dstuart commentedHi nitinbh77,
Have you got the apache solr access control module on? The current version of the nutch module doesn't support this feature turn it off and try a search. If that doesn't works
Regards
Dave
Comment #34
suneethark commentedHi all,
I installed nutch module "6.x" version with nutch 1.0 and solr 1.4. I was able to run nutch on my windows machine successfully. But I am not able to crawl external sites using nutch module. When I press "Start crawl" I get just get
* Starting Nutch Crawl.
* 0
It didn't work. When I select dry run/debug crawl, it didn't work.
I outputed as follows:
* http://www.9isolutions.com
* Starting Nutch Crawl.
* C:/xampp/htdocs/dprsearch/sites/all/modules/contrib/nutch/runbot -n "C:/cygwin/home/nutch-1.0" -j "C:/Program Files/Java/jdk1.6.0_21" -s "http://localhost:8983/solr" -u "http://www.9isolutions.com" -c "1" -f "100" -d "1"
Further it is not doing anything. I guess "exec" command is not calling the external program. I am unable to trace where I am going wrong. Anyone Please suggest how to crawl external sites with this module.... Am I doing in a right way?
Regards,
Suneetha.
Comment #35
robertdouglass commentedApacheSolr 6.x-2.x has an entity field. Does a syntax like the following work?
EDiT: meant "nutch"
<field dest="entity">nutch</field>Comment #36
dstuart commentedNo it currently doesn't it requires a source and dest. You could add a default value in your field definition in your solr config
Comment #37
robertdouglass commentedI'm trying to make it so that I can recognise the nutch documents as such. If I made entity default to "nutch", or something else that identifies them as a like group, this would superficially solve the problem, as the nutch documents are the only docs without an entity value. However it would also open the door to having other documents mislabelled as being from "nutch" if they somehow omit the entity field value.
I guess this is going to merit a patch to Nutch to allow for default values in the configuration xml.
Comment #38
dstuart commentedHey Robert,
As I wrote the original mapping patch for nutch, I'll take a stab at the change, it should be quite minor but it may not be in a stable release for a while
Regards
Dave
Comment #39
robertdouglass commentedDave, thanks. Would you mind posting your work here as well? I can see already that I'm going to have to extend it even further to be a Drupal specific Solr writer so that we can be compatible with apachesolr_multisite. The hash field, for example, has to be computed.
Comment #40
dstuart commentedHey Robert,
On reflection in respect to #36 I think the creation of a new mapping field in Nutch's solrindex-mapping.xml
Here is a link the jira issue and the patch https://issues.apache.org/jira/browse/NUTCH-924 i haven't posted the patch here as I wasn't sure about licensing.
Regards,
Dave
Comment #41
broncomania commentedHallo,
I am also struggling with Drupal Nutch and Solr. I try to get them run some days and several hours. I found some problems with the Nutch Runbot and specially with the missing documentation. I'm not a professinonal progger but I really need to configure Nutch and Solr. I see here a lot of people like who this module like me and can't get it run. I really think it's time to collect all the informations around this topic to create at least one documentation that really works. From this point is it much more easier for other to get a step in the problems and also for the people with their different knowledges.
My point is can someone who get it run give an step by step explanation what kind of changes are neccessary in the nutch module. Specially the conf section with the schema.xml. The next thing is the solr schema.xml. Just copy the apache_solr_module files to the nutch schema and the solr? Or just only to the solr system????
The next thing is the mapping. Here are some code examples, but where must I put these infos? In the schema.xml in the folder of nutch and solr or only in the solr folder? Next where to put this mappings? Between which tags? A lot of simple questions but without knowing this it's really hard to get it work.
I am really shure someone knows this and It's time to collect these infos.
So I am willing to spend some time and investigate in problems to find a way making a working nutch / solr system.
Thx in advance for this cool module.
Comment #42
maxmmize commentedErr, whats the right permissions?
The nutch/crawl folderErr, where is this folder at?
Comment #43
broncomania commented@ maxmmize You have to create this folder look here http://drupal.org/node/950766 Hope this helps a little bit.
So I try to merge the solr schema.xml with the nutch schema.xml and extend it with the mapping what is posted in this thread. Is this code right?? If yes, maybe it could be copied in the nutch module as a working starting base for others. I mean NUTCH got a schema and SOLR got a schema why should't we make also one?
I choose the schema from the apachesolr 2.0 version. Solr works like a charm with this. Now it's about nutch and just to get shure that this config is right.
Comment #44
karljohann commented@broncomania: In all fairness this module is still in alpha. Although it has perhaps been so for a while now. You can essentially get all the information you need to install this module successfully in this very thread. If you have any problems than you should be very specific in what the problem is and you'll probably get help fixing it.
@maxmmize: The nutch/crawl folder is in the nutch installation folder. For example /usr/local/nutch/crawl. The right permissions would be any that allow the runbot to write into the folder, so if you're running apache then setting the owner to apache is probably your best bet and then setting the permissions to, for example, 755.
Comment #45
broncomania commentedYes, I know that this is alpha. I give also my very best to fix problems and find solutions. I mean I research google now for several days for small problems during the installation and I thought that others get the same problems. I am also willing help to develop this modul to get maybe the second alpha version. See here http://drupal.org/node/950766 or here http://drupal.org/node/950722 or my question above. These are my problems for the moment... I am sure the next problems will come. See my posting above. I think I wrote it in the moment as you wrote your comment.
By the way karljohann I read http://drupal.org/node/811062#comment-3153582 that you already got it working maybe you can approve my posted scheme.xml.
Obviouly I am completly confused now. I read this thread again and again and I didn't get it.
I will start what I understand and what I did.
1. Solr is running with the original scheme.xml provided by the Drupal ApacheSolr modul.
Then i make the changes in the xml postet above. Is this right??? Or did I only use the original ApacheSolr scheme without any changes?
2. Now after reading this complete posting again and again I think I have to extend the Nutch scheme with the mapping posted here by dstuart. I am completly confused. If am right I just have to copy the solrindex-mapping under the the nutsch scheme infront of the
Just to make it clear. Can someone post here the changed Nutch and ApacheSolr schemes? Just to see where to make the changes and get an understanding for why you did this.
I hope my problem is clear now.
Thanx in advance
Comment #46
dstuart commentedHey all,
As this thread is getting really long and will probably have
Conflicting information i will try and consolidate the howto bits into
A readme file with the module. Further to that I will try and make the
Module a little note user friendly with some basic checks on your
Nutch setup. I have some time tomorrow so let's see where I can get to.
Broncomania hopefully I'll answer all your questions in the readme
Cheers
Dave
Comment #47
broncomania commentedOh Man great!! Hope you see my other Issues and you can integrate the solved notice and last but not least the
problem in the runbot.
Did I already said thank you?? Really a lovely modul. I hope I can help a little bit in the development process.
Comment #48
maxmmize commented@broncomania - geez, thanks! How did I miss that post!?
I added the values in step one to the xml. Added the folders and files. chown and chmod etc.
What do I have to put in for the http.agent.name and the other values?
Comment #49
maxmmize commentedConsolidated my post to below. Sorry.
Comment #50
broncomania commentedOkay update: I get it run!! It's so easy if u know what are doing. I will explain it further more in my post. Here http://drupal.org/node/950766
@dstuart Yes the documentation is a really helpful idea. Now after I get it run I see my mistakes.
Comment #51
maxmmize commentedFor search purposes I am leaving my issues here. Other may benefit. For my Outstanding Errors, any help would be appreciated.
Fixed Errors:
ERROR crawl.Injector - Injector: java.io.IOException: Not a file: file:/home/xxxx/lib/nutch/seed/urlsfixed by rm -r urls and creating a file called urls
Input path does not exist: file:/home/nolosear/lib/nutch/crawl/linkdb/currentfixed by creating dir current
[Sun Oct 24 21:27:57 2010] [error] [client X.x.x.x] sh: /home/xxxx/public_html/modules/nutch/runbot: Permission denied, referer: http://xxxxx.com/admin/settings/nutch/crawlFixed chmod 755
[Sun Oct 24 21:45:49 2010] [error] [client xxxx] sh: /home/xxxx/public_html/modules/nutch/runbot: /bin/bash^M: bad interpreter: No such file or directory, referer: http://xxxx.com/admin/settings/nutch/crawlEdit file in vi. :set fileformat=unix press enter then :wq! press enter
Debunking Madness:
Unable to search crawled and indexed content in SolrTurn off Apache Solr node access under Modules
Perplexing Solutions:
2010-10-24 23:15:01,551 WARN regex.RegexURLNormalizer - can't find rules for scope 'inject', using defaultAdd more than one URL into your Nutch Module for crawling
Outstanding Errors:
2010-10-24 22:56:41,324 ERROR crawl.Generator - Generator: java.io.IOException: lock file /home/xxxxx/lib/nutch/crawl/crawldb/.locked already exists.and
2010-10-24 22:56:48,026 ERROR solr.SolrIndexer - org.apache.hadoop.mapred.InvalidInputException: Input Pattern file:/home/xxxx/lib/nutch/crawl/segments/*/crawl_fetch matches 0 filesand
2010-10-24 22:56:45,882 ERROR crawl.LinkDb - LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input Pattern file:/home/xxxx/lib/nutch/crawl/segments/*/parse_data matches 0 filesand
ls: /home/xxxx/lib/nutch/crawl/segments/*: No such file or directory, referer: http://xxxxx.com/admin/settings/nutch/crawland
/home/xxxx/public_html/modules/nutch/runbot: line 88: /home/nolosear/lib/nutch/seed/urls: Permission denied, referer: http://xxxxx.com/admin/settings/nutch/crawlAny help would be appreciated.
Comment #52
broncomania commentedThis looks like u created a folder named urls ? This is not what i described in my posting its a file
Can u post the command that u use for calling the runbot?
if you use the commandline it should like like this
/home/YOURUSER/public_html/sites/all/modules/nutch/runbot -n '/opt/nutch' -j '/usr/lib/jvm/java-6-sun' -s 'http://localhost:8080/solr' -u 'http://www.example.com!http://www.example1.com' -c '1' -f '100' -d '0'Comment #53
broncomania commentedI got also this error messages!! I didn't find a solution for this problems !!
My solution was
rm -fr /nutch-1.2and then
tar xzf apache-nutch-1.2-bin.tar.gzstart from the scratch until I didn't make any faults during the configurations. Sorry but that was the way for me... maybe someone else can give you and me a hint!! At least it was working for me. Don't make mistakes :-)PS: I updated my posting with all the configurations who are nessesary for a running ubuntu system. Maybe this help you again
Comment #54
maxmmize commentedHm, well. I made the changes as noted and added a second URL to the nutch and it crawled. I crawled 1000 URLS in an hour.
Solr now has 4535 documents in index but I can't get them to display in the search. :-\
Comment #55
dstuart commentedHey maxmmize,
As per comment #33 do you have solr access control turned on?
Rgds,
Dave
Comment #56
maxmmize commented@dstuart Thank you very much. I did. I do remember seeing that post and I should have made a note to revisit it. I turned it off and now I can search the engine.
Comment #57
broncomania commentedSo after I get it run now under ubuntu 10.04 http://drupal.org/node/950766 I will come now to my needs and questions.
1.: I developed a module in which users can add a one or several domains to their profile that should get crawled after adding. Also I created my own node type called domain. Is it possible to tag the crawled domains with the uid of drupal? So that i can only search websites of this user? I have to build a connection between the domain and the user. Okay I can grab the search result url and compare it with my stored infos in the db to the owner of the domain, but I prefer the way to tag the domain. Is this possible???
2.: I read already here #6 that some ask if its possible to add the domain to the node type, in my case domain. How can I do this exactly dstuart. Have you got a code example for this?
3.: If a user creates an account and add one domain as example. How can I make an addition to the seed/urls so that this domain get instantly crawled or if the crawler is running put this in the waiting qeue?
4.: If a user deletes his domain how can I delete this domain from the search index?
Help and ideas are needed.
Thank you for reading :-)
Comment #58
maxmmize commentedAfter I received a memory error I increased my mem_limit from 200megs to 2gigs.
My crawler stopped working. My logs show:
[Mon Oct 25 11:49:51 2010] [error] [client xxxx] /home/nolosear/public_html/modules/nutch/runbot: line 88: /home/nolosear/lib/nutch/seed/urls: Permission denied, referer: http://nolosearch.com/admin/settings/nutch/crawlI ran ls -la
Should URLS be nolosear nolosear? Do I have the right permissions? Do you have any idea why my crawler would just stop working?
Below is my debug:
Comment #59
maxmmize commentedSo, I chwon the file urls to nolosear:nolosear and I don't receive that error anymore. I still get error:
[Mon Oct 25 19:36:33 2010] [error] [client] rmdir: /home/nolosear/lib/nutch/crawl/MERGEDsegments: No such file or directory, referer: http://xxxxx.com/admin/settings/nutch/crawl
[Mon Oct 25 19:36:33 2010] [error] [client] mv: cannot stat `/home/nolosear/lib/nutch/crawl/MERGEDsegments/*': No such file or directory, referer: http://xxxx.com/admin/settings/nutch/crawl
Comment #60
maxmmize commentedFilter URLs: *
I see the example but what exactly doe is it do in the nutch module?
Comment #61
broncomania commentedHas someone found out how to map the nutch ngram language information to the solr index?
I try to add this informations in the mapping,but solr ignore this informations!
Comment #62
maxmmize commentedWhere exactly do these go in the schema.xml fle?
Comment #63
robertdouglass commentedFIXED:
ERROR solr.DrupalSolrIndexer - org.apache.hadoop.mapred.InvalidInputException: Input Pattern file:/home/robert/lib/nutch_1_2/crawl/segments/*/crawl_fetch matches 0 filesThis error was solved for me by checking my seed URLs (q=admin/settings/nutch/seed) and making sure they are prefixed with http://
For example, when I had
example.comI got the above error, but updating that tohttp://example.comfixed it and I was able to crawl.I'm going to add some text to the documentation to clarify this.
Comment #64
robertdouglass commentedHere's the documentation change I just committed.
Comment #65
robertdouglass commentedTake that back ... hasn't been committed yet.
Comment #66
dstuart commentedHey Robert,
Have commited patch #64
Regards,
Dave
Comment #67
broncomania commentedI got nutch working right now and I find a problem. It's about the the type field like node,story, ... nutch didn't fill it with a default value or something similar. This is not a problem if you use the standard installation. I upgrade it with the apachesolr multilanguage module. It is still working but if i change the language from german to english i got several mistakes about check_plain and that it is an array. So I started a research and the reason was that nutch didn't set the type.
So I extend the solr schema.xml. In the type field I added this value default="nutch" and the errors are gone. I think now this is not the smartest way to fix this. Is it possible to extend the solr-mapping with the field type and submit a standard value like "nutch" or something else? This should be much better as set a default value.
Any ideas are welcome
Bronco
Comment #68
d0t101101 commentedThanks broncomania,
Your suggestion here (#67) was helpful. I ran into the same problem, but I am only using Apache_solr and Nutch modules (not the multilanguage module you had mentioned). I had to customize the schema.xml further for nutch's solr indexing step to successfully complete, it kept complaining about non multiValue fields.
In my case, specifying the 'default' value for this field wasn't enough, I also had to change the type field to allow multiple values "multiValue="true"". I'm uncertain what other implications this change may have though.
The data is now being index by solr, but not searchable via drupal. To help me diagnose this, would you kindly share your schema.xml your using today? Is it much different than your earlier post (#43)?
Best regards,
.
Comment #69
ssedume commentedHi buddy did you ever manage to make this this work?
Comment #70
mac_perlinski commentedAs i can see NUTCH_VIRTUAL_NODE_PATH is followed by the value of digest field. I also read your comment in the code that this part needs to be done.
If we want to create NUTCH_VIRTUAL_NODE_PATH for example:
http://example.com/nutchnode/59a5ec5b86f2fd6552f8433bba963089 where NUTCH_VIRTUAL_NODE_PATH is substituted by nutchnode callback we need to change digest field to be indexed so we can retrieve document with certain digest key.
Whats the status of this feature ?
Comment #71
avpadernoI am closing this issue, since Drupal 6 isn't supported anymore.