Okay I run as I started in a lot of problems and I followed several documentations but nothing works an I feel really sad!! Such a cool module but f***k I am a carpenter and not a progger. So now I found the issues and I want to share it with others.

Okay steps you have to do to get it run!

Download Nutch and extract it.
I choose /opt. After extraction you got a folder called nutch-1.2 in the folder /opt >> /opt/nutch-1.2.
After that you have to copy this

<property>
  <name>http.agent.name</name>
  <value></value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

	http.robots.agents
	http.agent.description
	http.agent.url
	http.agent.email
	http.agent.version

  and set their values appropriately.

  </description>
</property>

<property>
  <name>http.robots.agents</name>
  <value>*</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  </description>
</property>

<property>
  <name>http.robots.403.allow</name>
  <value>true</value>
  <description>Some servers return HTTP status 403 (Forbidden) if
  /robots.txt doesn't exist. This should probably mean that we are
  allowed to crawl the site nonetheless. If this is set to false,
  then such sites will be treated as forbidden.</description>
</property>

<property>
  <name>http.agent.description</name>
  <value></value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  </description>
</property>

<property>
  <name>http.agent.url</name>
  <value></value>
  <description>A URL to advertise in the User-Agent header.  This will 
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
  </description>
</property>

<property>
  <name>http.agent.email</name>
  <value></value>
  <description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
  </description>
</property>

<property>
  <name>http.agent.version</name>
  <value>Nutch-1.2</value>
  <description>A version string to advertise in the User-Agent 
   header.</description>
</property>

<property>
  <name>http.agent.host</name>
  <value></value>
  <description>Name or IP address of the host on which the Nutch crawler
  would be running. Currently this is used by 'protocol-httpclient'
  plugin.
  </description>
</property>

<property>
  <name>http.timeout</name>
  <value>10000</value>
  <description>The default network timeout, in milliseconds.</description>
</property>

<property>
  <name>http.max.delays</name>
  <value>100</value>
  <description>The number of times a thread will delay when trying to
  fetch a page.  Each time it finds that a host is busy, it will wait
  fetcher.server.delay.  After http.max.delays attepts, it will give
  up on the page for now.</description>
</property>

into /opt/nutch-1.2/conf/nutch-site.xml
Then you have to fill in the values like http.agent.name and the others.

So now you have to create some folders and files.
/opt/nutch-1.2/seed
/opt/nutch-1.2/seed/urls << the urls is a file !!!!
/opt/nutch-1.2/logs/
/opt/nutch-1.2/logs/hadoop.log << this is a file !
/opt/nutch-1.2/crawl
/opt/nutch-1.2/crawl/segments
/opt/nutch-1.2/crawl/crawldb
/opt/nutch-1.2/crawl/linkdb

Now we make a symbolic link to the folder. This makes some things easier.
ln -s /opt/nutch-1.2 /opt/nutch
So after u created the folder you have to set the "right" permissions.
By the way I have absolutly no plan if its right save or whatever, but it works!!!
At this point I need some advise by some permisson gurus.
Okay Nutch need the permissions of the user that uses nutch. This means the webserver user in my case.

chown myuser:myuser -R /opt/nutch
chmod 777 -R /opt/nutch

Now we have to change the runbot in the Drupal Nutch Modul.
Change this
#!/bin/sh
to this
#!/bin/bash
This is important!!!!!! otherwise the crawler will never ever run correctly.

Okay we go on. Now we just fix a small problem in the nutch.admin.inc.
This could make a notice. Not important but he WE CAN FIX THIS.
Add in row 170 this

$rtn_output = '';

It's a minor thing.

I really hope I didn't forget something...
Configure Nutch
NUTCH_HOME = /opt/nutch
JAVA_HOME = depends on your installation check /usr/lib/jvm/JAVAVERSION
SOLR URL = http://localhost:8080/solr (mostly)

So if you try the debug crawl you should see no errors anymore! Feel free to submit mistakes or changes to this posting. Hope this will help to get Nutch Running! I will go on now to integrate Nutch with Solr. If I find an way I will post it here also.

Comments

karljohann’s picture

Very nice to see some documention actually :)

Here are a few comments:

"nutch-site.xml
Then you have to fill in the values like http.agent.name and the others."
- It's not necessary to fill out all these values and most of them won't affect whether Nutch will run or not.

"Okay Nutch need the permissions of the user that uses nutch. This means the webserver user in my case.
chown myuser:myuser -R /opt/nutch
chmod 777 -R /opt/nutch"
- You need to change the permissions so the user who is running the runbot script (usually the web server) can write in the crawl directory. 777 means that anybody and everybody can write in this folder, which is obviously not recommended. I make the web server the owner of the crawl folder (chown -R nginx nutch/crawl) and then set the permissions to 755 (owner can write, group and everybody can read).

"Now we have to change the runbot in the Drupal Nutch Modul.
Change this
#!/bin/sh
to this
#!/bin/bash
This is important!!!!!! otherwise the crawler will never ever run correctly."
- I assume this is different between distros, but definitely worth a try if it isn't working.

You also have to copy the solrindex-mapping.xml to the conf folder (http://drupal.org/node/811062#comment-3154622) as well as merge the schema.xml and solrconfig.xml (use the ones from the apachesolr module and then check this: http://drupal.org/node/811062#comment-3240566)

Hope this helps a bit.

broncomania’s picture

DEAR READER IGNORE THIS INFORMATION ITS NOT CORRECT ONLY conf USING!! :-)

You also have to copy the solrindex-mapping.xml to the conf folder (http://drupal.org/node/811062#comment-3154622) as well as merge the schema.xml and solrconfig.xml (use the ones from the apachesolr module and then check this: http://drupal.org/node/811062#comment-3240566)

PART 2

So after several attempts I got it and here is the second part:

Don't forget I installed the solr and the nutch programm in the /opt folder. Maybe you have different settings then you must adjust the path!!!

1.: Copy from the Drupal ApacheSolr modul (version 2) the schema.xml to the Solr schema.xml. You will find the solr schema.xml here.
/opt/solr/config/scheme.xml

2.: Copy now the ApacheSolr modul solrconfig.xml into the
/opt/solr/config/solrconfig.xml
file

3.: Nutch Version 1.2 open the solrindex-mapping.xml. You will find the file here .
/opt/nutch-1.2/conf/solrindex-mapping.xml
Now change this entry
<field dest="content" source="content"/>
to this
<field dest="body" source="content"/>

4.: Restart Tomcat
/etc/init.d/tomcat6 restart
This nessesary that the new informtions get into the solr modul.

And that's it !!! Quite easy or???

Now let the bot run with a command like this. Remind I use the path of my installation!!!
/home/YOURUSER/public_html/sites/all/modules/nutch/runbot -n '/opt/nutch' -j '/usr/lib/jvm/java-6-sun' -s 'http://localhost:8080/solr' -u 'http://www.example.com!http://www.example1.com' -c '1' -f '100' -d '0'

@karljohann You are right with point of the crawler informations in the nutch-site.xml but it would give errors if you don't fill in the http.agent.name. Maybe the crawler run but it gives an exception without it. So just for the correctness we fill in the infos. :-)
Thx also for the sh / bash info that it is dependent of the distro. I didn't know that. Here I just want to give an running example for Ubuntu 10.04 distros and great thanx for the short tutorial about the permissions of the user who run the bot and which folders need what kind of permissions.

It is not nessesary to merge files or to copy the solrindex-mapping.xml somwhere. This information is just confusing. It's not nesessary for the installation I described here!

karljohann’s picture

Well it might be confusing but you still need to do it.

broncomania’s picture

So in my case it works without that! I followed the steps in the other posting, but it didn't work. That was the reason for me to make this step by step tutorial. Why is it necessary to get it work and why is it working in my case without this step if it is necessary? Ask yourself, why should so many people post and ask always the same questions, if the tutorial was right and straight? Anyway, maybe it is necessary on a different distro. Which distro did you use and where do you put the solrindex-mapping.xml? How did you integrate that in the solr system or how could solr system know that you copy the file there? You see even me, who got it working has a lot of questions about that. Can you please clarify why, me and the other readers really don't understand it.

Best regards
Frank

karljohann’s picture

If you want the results to show up in Solr you have to copy the solrconfig.xml and schema.xml as well as modifying the solrindex-mapping.xml. If you got it to work without doing this then I clearly am misunderstanding something.

broncomania’s picture

Yes, it was an misunderstanding! I agree all your steps. They are the same I described in the step by step tutorial. So the readers can follow this tutorial and get it work

maxmmize’s picture

Do you know how to add/limit the teaser results so instead of getting 200 characters to display in my search results I can get 400 or more?

Is this something i have to tell nutch to do, Solr or ApacheSolr and where?

broncomania’s picture

I don't know how to do it at the moment, because I have other problems. But if you found a solution for this, please post it here. Maybe we can collect snippets for the most things user want to change.

I found a nice articel about the german language and how to optimize it for the german language search.

german word compound solr index optimizer

http://www.early-dance.de/news/9189-apachesolr-issues-german-and-other-g...

broncomania’s picture

StatusFileSize
new35.62 KB

Update permissions problem:

I ran into the problem with the right permissions so I wrote this small snippet after I get it run.

ad the following to the nutch.admin.inc file to get a result like in the attached picture

Add this code to the nutch_admin_settings() function before the

return system_settings_form($form);
/**
  * Nutch Ordern rechte und besitzer check
  * @ seed/urls    existiert und beschreibbar ?
  * @ crawl/crawldb   existiert und beschreibbar ?
  * @ logs/haadop.log
  */	
  $NUTCH_HOME = variable_get('nutch_nutch_dir', '/usr/local/nutch');
  $check_true = '<span style="color:green;">Ok</span>';
     
  $form['permissions_check'] = array(
     '#type' => 'fieldset',
     '#title' => 'Folder permissions',
     '#description' => t('Folder permissions and owner check. If something is not okay Nutch will not run properly!'),  
     '#collapsible' => true,
     '#collapsed' => true,
  );	

  $apache_uid = posix_getuid();
  $form['permissions_check']['apache_uid'] = array(
    '#title' => 'Apache Uid',
     '#value' => $apache_uid, 
     '#prefix' => '<div>'.t('Apache UID').':',
     '#suffix' => '</div>',
  );	
  
  
  /**
  * Seed
  */
  $form['permissions_check']['seed'] = array(
     '#type' => 'fieldset',
     '#title' => 'Seed check',
     '#description' => t('Overview about your seeds folder and seed/urls file'),  
  );	  
  
  $seed_folder_exist = is_dir($NUTCH_HOME.'/seed');
  $form['permissions_check']['seed']['seed_folder_exist'] = array(
    '#title' => 'seed',
     '#value' => ($seed_folder_exist) ? $check_true : nutch_nutch_check_false(), 
     '#prefix' => '<div>'.t('Seed folder exist check').':',
     '#suffix' => '</div>',
  );	  

  
 $seed_owner_uid = fileowner($NUTCH_HOME.'/seed');
  $form['permissions_check']['seed']['seed_owner'] = array(
    '#title' => 'seed',
     '#value' => ($seed_owner_uid == $apache_uid) ? $check_true : nutch_nutch_check_false($seed_owner_uid), 
     '#prefix' => '<div>'.t('Seed owner check').':',
     '#suffix' => '</div>',
  );	  



  $seed_urls_owner_uid = fileowner($NUTCH_HOME.'/seed/urls');
  $form['permissions_check']['seed']['seed_urls_owner'] = array(
    '#title' => 'seed/urls',
     '#value' => ($seed_owner_uid == $apache_uid) ? $check_true : nutch_nutch_check_false($seed_urls_owner_uid), 
     '#prefix' => '<div>'.t('Seed / urls owner check').':',
     '#suffix' => '</div>',
  );	

  $seed_permission = substr(sprintf('%o', fileperms($NUTCH_HOME.'/seed')), -4); 
  $form['permissions_check']['seed']['seed_permission'] = array(
    '#title' => 'seed',
     '#value' => $seed_permission, 
     '#prefix' => '<div>'.$NUTCH_HOME.t('/seed owner permission').':',
     '#suffix' => '</div>',
  );	

  $seed_url_permission = substr(sprintf('%o', fileperms($NUTCH_HOME.'/seed/urls')), -4); 
  $form['permissions_check']['seed']['seed_url_permission'] = array(
    '#title' => 'seed/urls',
     '#value' => $seed_permission, 
     '#prefix' => '<div>'.$NUTCH_HOME.t('/seed/urls owner permission').':',
     '#suffix' => '</div>',
  );	


  /**
  * crawl
  */
  $form['permissions_check']['crawl'] = array(
     '#type' => 'fieldset',
     '#title' => 'Crawl check',
     '#description' => t('Overview about your crawl folder'),  
  );	  
  
  $crawldb_owner_uid = fileowner($NUTCH_HOME.'/crawl/crawldb');
  $form['permissions_check']['crawl']['crawldb_owner'] = array(
    '#title' => 'crawl/crawldb ',
     '#value' => ($crawldb_owner_uid == $apache_uid) ? $check_true : nutch_nutch_check_false($crawldb_owner_uid), 
     '#prefix' => '<div>'.t('Crawldb owner check').':',
     '#suffix' => '</div>',
  );
   

 $linkdb_owner_uid = fileowner($NUTCH_HOME.'/crawl/linkdb');
  $form['permissions_check']['crawl']['linkdb_owner'] = array(
    '#title' => 'crawl/linkdb ',
     '#value' => ($linkdb_owner_uid == $apache_uid) ? $check_true : nutch_nutch_check_false($linkdb_owner_uid), 
     '#prefix' => '<div>'.t('Linkdb owner check').':',
     '#suffix' => '</div>',
  );  

  $segments_owner_uid = fileowner($NUTCH_HOME.'/crawl/segments');
  $form['permissions_check']['crawl']['segments_owner'] = array(
    '#title' => 'crawl/segments',
     '#value' => ($segments_owner_uid == $apache_uid) ? $check_true : nutch_nutch_check_false($segments_owner_uid), 
     '#prefix' => '<div>'.t('Segments owner check').':',
     '#suffix' => '</div>',
  );
 


  $crawldb_permission = substr(sprintf('%o', fileperms($NUTCH_HOME.'/crawl/crawldb')), -4); 
  $form['permissions_check']['crawl']['crawldb_permission'] = array(
    '#title' => 'crawl/crawldb ',
     '#value' => $crawldb_permission, 
     '#prefix' => '<div>'.$NUTCH_HOME.t('/crawl/crawldb permission check').':',
     '#suffix' => '</div>',
  );
  
  $linkdb_permission = substr(sprintf('%o', fileperms($NUTCH_HOME.'/crawl/linkdb')), -4); 
  $form['permissions_check']['crawl']['lindbdb_permission'] = array(
    '#title' => 'crawl/linkdb ',
     '#value' => $crawldb_permission, 
     '#prefix' => '<div>'.$NUTCH_HOME.t('/crawl/linkdb permission check').':',
     '#suffix' => '</div>',
  );

  $segments_permission = substr(sprintf('%o', fileperms($NUTCH_HOME.'/crawl/segments')), -4); 
  $form['permissions_check']['crawl']['segments_permission'] = array(
    '#title' => 'crawl/segments ',
     '#value' => $segments_permission, 
     '#prefix' => '<div>'.$NUTCH_HOME.t('/crawl/segments permission check').':',
     '#suffix' => '</div>',
  );


  /**
  * Hadoop
  */
  $form['permissions_check']['hadoop'] = array(
     '#type' => 'fieldset',
     '#title' => 'Hadoop check',
     '#description' => t('Overview about your hadoop folder.'),  
  );	  
  
  
  $hadoop_owner_uid = fileowner($NUTCH_HOME.'/logs/hadoop.log');
  $form['permissions_check']['hadoop']['hadoop_owner'] = array(
    '#title' => 'logs/haadop.log',
     '#value' => ($hadoop_owner_uid == $apache_uid) ? $check_true : nutch_nutch_check_false($hadoop_owner_uid), 
     '#prefix' => '<div>'.t('Hadoop owner check').':',
     '#suffix' => '</div>',
  );
   
  
   $hadoop_permission = substr(sprintf('%o', fileperms($NUTCH_HOME.'/logs/hadoop.log')), -4); 
   $form['permissions_check']['hadoop']['hadoop_permission'] = array(
    '#title' => 'logs/haadop.log',
     '#value' => $hadoop_permission,
     '#prefix' => '<div>'.$NUTCH_HOME.t('/logs/haadop.log permission check').':',
     '#suffix' => '</div>',
  );
  

Add this function to the end of the nutch.admin.inc file.

function  nutch_nutch_check_false($permissions = '')
{  
  $output = '<span style="color:red;">Wrong '.$permissions.'</span>';	
  
  return $output;
}

That's it. In my case is the crawler running if your result is equal to my attached picture.

Hope this makes it a litte bit easier to locate the problems and will find the way into the next alpha ??

broncomania’s picture

StatusFileSize
new62.73 KB

I added some more checks in the code above. Look at the attached picture if you want toknow how it looks like.

dstuart’s picture

Hey broncomania,

This looks good, is it possible to get in a patch instead of pasted into a comment

Regards,

Dave

broncomania’s picture

Ah, I never build a patch, but I will try. Give me some days...

maxmmize’s picture

@Bronco - Looks really good man! I did find the solution to changing my title and snippet length, but like you, have never written a patch before. I can post the links if you like but it looks like you have your hands full with Dave's request.

Good job, the easier we make it for novices the more people will get into the development and use.

dstuart’s picture

Hey,

To help out a little read up here http://drupal.org/patch/create

You will need to checkout the latest copy of the codebase cvs -z6 -d:pserver:anonymous:anonymous@cvs.drupal.org:/cvs/drupal-contrib checkout -d nutch-HEAD contributions/modules/nutch/ port your changes across to the checkout and use the instructions from the link above

Cheers,

Dave

broncomania’s picture

StatusFileSize
new7.3 KB

@maxmmize It's not so hard to make a patch need round about 1 hour. It looks like we collect really cool infos and functions to make this crawler running. Please don't hesitate. Post your links here, maybe i can help you!

@dstuart Thx for the short hint. This saves me a lot of time.

So here is the patch untested.

So have fun
Frank

marco69’s picture

Dear Broncomania,

I have followed step by step your tutorial ... great effort really appreciated..

How then I see the results from the nutch crawling .... I assumed that solr would index those and show tehm on the result, but this is not happening any help here?

Thank you so much.

Ciao Marco

avpaderno’s picture

Assigned: broncomania » Unassigned
Status: Active » Closed (outdated)

I am closing this issue, since Drupal 6 isn't supported anymore.