Okay I run as I started in a lot of problems and I followed several documentations but nothing works an I feel really sad!! Such a cool module but f***k I am a carpenter and not a progger. So now I found the issues and I want to share it with others.
Okay steps you have to do to get it run!
Download Nutch and extract it.
I choose /opt. After extraction you got a folder called nutch-1.2 in the folder /opt >> /opt/nutch-1.2.
After that you have to copy this
<property>
<name>http.agent.name</name>
<value></value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
NOTE: You should also check other related properties:
http.robots.agents
http.agent.description
http.agent.url
http.agent.email
http.agent.version
and set their values appropriately.
</description>
</property>
<property>
<name>http.robots.agents</name>
<value>*</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
</description>
</property>
<property>
<name>http.robots.403.allow</name>
<value>true</value>
<description>Some servers return HTTP status 403 (Forbidden) if
/robots.txt doesn't exist. This should probably mean that we are
allowed to crawl the site nonetheless. If this is set to false,
then such sites will be treated as forbidden.</description>
</property>
<property>
<name>http.agent.description</name>
<value></value>
<description>Further description of our bot- this text is used in
the User-Agent header. It appears in parenthesis after the agent name.
</description>
</property>
<property>
<name>http.agent.url</name>
<value></value>
<description>A URL to advertise in the User-Agent header. This will
appear in parenthesis after the agent name. Custom dictates that this
should be a URL of a page explaining the purpose and behavior of this
crawler.
</description>
</property>
<property>
<name>http.agent.email</name>
<value></value>
<description>An email address to advertise in the HTTP 'From' request
header and User-Agent header. A good practice is to mangle this
address (e.g. 'info at example dot com') to avoid spamming.
</description>
</property>
<property>
<name>http.agent.version</name>
<value>Nutch-1.2</value>
<description>A version string to advertise in the User-Agent
header.</description>
</property>
<property>
<name>http.agent.host</name>
<value></value>
<description>Name or IP address of the host on which the Nutch crawler
would be running. Currently this is used by 'protocol-httpclient'
plugin.
</description>
</property>
<property>
<name>http.timeout</name>
<value>10000</value>
<description>The default network timeout, in milliseconds.</description>
</property>
<property>
<name>http.max.delays</name>
<value>100</value>
<description>The number of times a thread will delay when trying to
fetch a page. Each time it finds that a host is busy, it will wait
fetcher.server.delay. After http.max.delays attepts, it will give
up on the page for now.</description>
</property>
into /opt/nutch-1.2/conf/nutch-site.xml
Then you have to fill in the values like http.agent.name and the others.
So now you have to create some folders and files.
/opt/nutch-1.2/seed
/opt/nutch-1.2/seed/urls << the urls is a file !!!!
/opt/nutch-1.2/logs/
/opt/nutch-1.2/logs/hadoop.log << this is a file !
/opt/nutch-1.2/crawl
/opt/nutch-1.2/crawl/segments
/opt/nutch-1.2/crawl/crawldb
/opt/nutch-1.2/crawl/linkdb
Now we make a symbolic link to the folder. This makes some things easier.
ln -s /opt/nutch-1.2 /opt/nutch
So after u created the folder you have to set the "right" permissions.
By the way I have absolutly no plan if its right save or whatever, but it works!!!
At this point I need some advise by some permisson gurus.
Okay Nutch need the permissions of the user that uses nutch. This means the webserver user in my case.
chown myuser:myuser -R /opt/nutch
chmod 777 -R /opt/nutch
Now we have to change the runbot in the Drupal Nutch Modul.
Change this
#!/bin/sh
to this
#!/bin/bash
This is important!!!!!! otherwise the crawler will never ever run correctly.
Okay we go on. Now we just fix a small problem in the nutch.admin.inc.
This could make a notice. Not important but he WE CAN FIX THIS.
Add in row 170 this
$rtn_output = '';
It's a minor thing.
I really hope I didn't forget something...
Configure Nutch
NUTCH_HOME = /opt/nutch
JAVA_HOME = depends on your installation check /usr/lib/jvm/JAVAVERSION
SOLR URL = http://localhost:8080/solr (mostly)
So if you try the debug crawl you should see no errors anymore! Feel free to submit mistakes or changes to this posting. Hope this will help to get Nutch Running! I will go on now to integrate Nutch with Solr. If I find an way I will post it here also.
| Comment | File | Size | Author |
|---|---|---|---|
| #15 | filepermcheck.patch | 7.3 KB | broncomania |
| #9 | Nutch crawler 1289125929485.png | 35.62 KB | broncomania |
| #10 | Nutch crawler 1289290284951.png | 62.73 KB | broncomania |
Comments
Comment #1
karljohann commentedVery nice to see some documention actually :)
Here are a few comments:
"nutch-site.xml
Then you have to fill in the values like http.agent.name and the others."
- It's not necessary to fill out all these values and most of them won't affect whether Nutch will run or not.
"Okay Nutch need the permissions of the user that uses nutch. This means the webserver user in my case.
chown myuser:myuser -R /opt/nutch
chmod 777 -R /opt/nutch"
- You need to change the permissions so the user who is running the runbot script (usually the web server) can write in the crawl directory. 777 means that anybody and everybody can write in this folder, which is obviously not recommended. I make the web server the owner of the crawl folder (chown -R nginx nutch/crawl) and then set the permissions to 755 (owner can write, group and everybody can read).
"Now we have to change the runbot in the Drupal Nutch Modul.
Change this
#!/bin/sh
to this
#!/bin/bash
This is important!!!!!! otherwise the crawler will never ever run correctly."
- I assume this is different between distros, but definitely worth a try if it isn't working.
You also have to copy the solrindex-mapping.xml to the conf folder (http://drupal.org/node/811062#comment-3154622) as well as merge the schema.xml and solrconfig.xml (use the ones from the apachesolr module and then check this: http://drupal.org/node/811062#comment-3240566)
Hope this helps a bit.
Comment #2
broncomania commentedDEAR READER IGNORE THIS INFORMATION ITS NOT CORRECT ONLY conf USING!! :-)
PART 2
So after several attempts I got it and here is the second part:
Don't forget I installed the solr and the nutch programm in the /opt folder. Maybe you have different settings then you must adjust the path!!!
1.: Copy from the Drupal ApacheSolr modul (version 2) the schema.xml to the Solr schema.xml. You will find the solr schema.xml here.
/opt/solr/config/scheme.xml2.: Copy now the ApacheSolr modul solrconfig.xml into the
/opt/solr/config/solrconfig.xmlfile
3.: Nutch Version 1.2 open the solrindex-mapping.xml. You will find the file here .
/opt/nutch-1.2/conf/solrindex-mapping.xmlNow change this entry
<field dest="content" source="content"/>to this
<field dest="body" source="content"/>4.: Restart Tomcat
/etc/init.d/tomcat6 restartThis nessesary that the new informtions get into the solr modul.
And that's it !!! Quite easy or???
Now let the bot run with a command like this. Remind I use the path of my installation!!!
/home/YOURUSER/public_html/sites/all/modules/nutch/runbot -n '/opt/nutch' -j '/usr/lib/jvm/java-6-sun' -s 'http://localhost:8080/solr' -u 'http://www.example.com!http://www.example1.com' -c '1' -f '100' -d '0'@karljohann You are right with point of the crawler informations in the nutch-site.xml but it would give errors if you don't fill in the http.agent.name. Maybe the crawler run but it gives an exception without it. So just for the correctness we fill in the infos. :-)
Thx also for the sh / bash info that it is dependent of the distro. I didn't know that. Here I just want to give an running example for Ubuntu 10.04 distros and great thanx for the short tutorial about the permissions of the user who run the bot and which folders need what kind of permissions.
It is not nessesary to merge files or to copy the solrindex-mapping.xml somwhere. This information is just confusing. It's not nesessary for the installation I described here!
Comment #3
karljohann commentedWell it might be confusing but you still need to do it.
Comment #4
broncomania commentedSo in my case it works without that! I followed the steps in the other posting, but it didn't work. That was the reason for me to make this step by step tutorial. Why is it necessary to get it work and why is it working in my case without this step if it is necessary? Ask yourself, why should so many people post and ask always the same questions, if the tutorial was right and straight? Anyway, maybe it is necessary on a different distro. Which distro did you use and where do you put the solrindex-mapping.xml? How did you integrate that in the solr system or how could solr system know that you copy the file there? You see even me, who got it working has a lot of questions about that. Can you please clarify why, me and the other readers really don't understand it.
Best regards
Frank
Comment #5
karljohann commentedIf you want the results to show up in Solr you have to copy the solrconfig.xml and schema.xml as well as modifying the solrindex-mapping.xml. If you got it to work without doing this then I clearly am misunderstanding something.
Comment #6
broncomania commentedYes, it was an misunderstanding! I agree all your steps. They are the same I described in the step by step tutorial. So the readers can follow this tutorial and get it work
Comment #7
maxmmize commentedDo you know how to add/limit the teaser results so instead of getting 200 characters to display in my search results I can get 400 or more?
Is this something i have to tell nutch to do, Solr or ApacheSolr and where?
Comment #8
broncomania commentedI don't know how to do it at the moment, because I have other problems. But if you found a solution for this, please post it here. Maybe we can collect snippets for the most things user want to change.
I found a nice articel about the german language and how to optimize it for the german language search.
german word compound solr index optimizer
http://www.early-dance.de/news/9189-apachesolr-issues-german-and-other-g...
Comment #9
broncomania commentedUpdate permissions problem:
I ran into the problem with the right permissions so I wrote this small snippet after I get it run.
ad the following to the nutch.admin.inc file to get a result like in the attached picture
Add this code to the nutch_admin_settings() function before the
Add this function to the end of the nutch.admin.inc file.
That's it. In my case is the crawler running if your result is equal to my attached picture.
Hope this makes it a litte bit easier to locate the problems and will find the way into the next alpha ??
Comment #10
broncomania commentedI added some more checks in the code above. Look at the attached picture if you want toknow how it looks like.
Comment #11
dstuart commentedHey broncomania,
This looks good, is it possible to get in a patch instead of pasted into a comment
Regards,
Dave
Comment #12
broncomania commentedAh, I never build a patch, but I will try. Give me some days...
Comment #13
maxmmize commented@Bronco - Looks really good man! I did find the solution to changing my title and snippet length, but like you, have never written a patch before. I can post the links if you like but it looks like you have your hands full with Dave's request.
Good job, the easier we make it for novices the more people will get into the development and use.
Comment #14
dstuart commentedHey,
To help out a little read up here http://drupal.org/patch/create
You will need to checkout the latest copy of the codebase cvs -z6 -d:pserver:anonymous:anonymous@cvs.drupal.org:/cvs/drupal-contrib checkout -d nutch-HEAD contributions/modules/nutch/ port your changes across to the checkout and use the instructions from the link above
Cheers,
Dave
Comment #15
broncomania commented@maxmmize It's not so hard to make a patch need round about 1 hour. It looks like we collect really cool infos and functions to make this crawler running. Please don't hesitate. Post your links here, maybe i can help you!
@dstuart Thx for the short hint. This saves me a lot of time.
So here is the patch untested.
So have fun
Frank
Comment #16
marco69 commentedDear Broncomania,
I have followed step by step your tutorial ... great effort really appreciated..
How then I see the results from the nutch crawling .... I assumed that solr would index those and show tehm on the result, but this is not happening any help here?
Thank you so much.
Ciao Marco
Comment #17
avpadernoI am closing this issue, since Drupal 6 isn't supported anymore.