The way it works!!! Get nutch1.2 run on Ubuntu 10.04 [#950766]

Okay I run as I started in a lot of problems and I followed several documentations but nothing works an I feel really sad!! Such a cool module but f***k I am a carpenter and not a progger. So now I found the issues and I want to share it with others.

Okay steps you have to do to get it run!

Download Nutch and extract it.
I choose /opt. After extraction you got a folder called nutch-1.2 in the folder /opt >> /opt/nutch-1.2.
After that you have to copy this

<property>
  <name>http.agent.name</name>
  <value></value>
  <description>HTTP 'User-Agent' request header. MUST NOT be empty - 
  please set this to a single word uniquely related to your organization.

  NOTE: You should also check other related properties:

	http.robots.agents
	http.agent.description
	http.agent.url
	http.agent.email
	http.agent.version

  and set their values appropriately.

  </description>
</property>

<property>
  <name>http.robots.agents</name>
  <value>*</value>
  <description>The agent strings we'll look for in robots.txt files,
  comma-separated, in decreasing order of precedence. You should
  put the value of http.agent.name as the first agent name, and keep the
  default * at the end of the list. E.g.: BlurflDev,Blurfl,*
  </description>
</property>

<property>
  <name>http.robots.403.allow</name>
  <value>true</value>
  <description>Some servers return HTTP status 403 (Forbidden) if
  /robots.txt doesn't exist. This should probably mean that we are
  allowed to crawl the site nonetheless. If this is set to false,
  then such sites will be treated as forbidden.</description>
</property>

<property>
  <name>http.agent.description</name>
  <value></value>
  <description>Further description of our bot- this text is used in
  the User-Agent header.  It appears in parenthesis after the agent name.
  </description>
</property>

<property>
  <name>http.agent.url</name>
  <value></value>
  <description>A URL to advertise in the User-Agent header.  This will 
   appear in parenthesis after the agent name. Custom dictates that this
   should be a URL of a page explaining the purpose and behavior of this
   crawler.
  </description>
</property>

<property>
  <name>http.agent.email</name>
  <value></value>
  <description>An email address to advertise in the HTTP 'From' request
   header and User-Agent header. A good practice is to mangle this
   address (e.g. 'info at example dot com') to avoid spamming.
  </description>
</property>

<property>
  <name>http.agent.version</name>
  <value>Nutch-1.2</value>
  <description>A version string to advertise in the User-Agent 
   header.</description>
</property>

<property>
  <name>http.agent.host</name>
  <value></value>
  <description>Name or IP address of the host on which the Nutch crawler
  would be running. Currently this is used by 'protocol-httpclient'
  plugin.
  </description>
</property>

<property>
  <name>http.timeout</name>
  <value>10000</value>
  <description>The default network timeout, in milliseconds.</description>
</property>

<property>
  <name>http.max.delays</name>
  <value>100</value>
  <description>The number of times a thread will delay when trying to
  fetch a page.  Each time it finds that a host is busy, it will wait
  fetcher.server.delay.  After http.max.delays attepts, it will give
  up on the page for now.</description>
</property>

into /opt/nutch-1.2/conf/nutch-site.xml
Then you have to fill in the values like http.agent.name and the others.

So now you have to create some folders and files.
/opt/nutch-1.2/seed
/opt/nutch-1.2/seed/urls << the urls is a file !!!!
/opt/nutch-1.2/logs/
/opt/nutch-1.2/logs/hadoop.log << this is a file !
/opt/nutch-1.2/crawl
/opt/nutch-1.2/crawl/segments
/opt/nutch-1.2/crawl/crawldb
/opt/nutch-1.2/crawl/linkdb

Now we make a symbolic link to the folder. This makes some things easier.
ln -s /opt/nutch-1.2 /opt/nutch
So after u created the folder you have to set the "right" permissions.
By the way I have absolutly no plan if its right save or whatever, but it works!!!
At this point I need some advise by some permisson gurus.
Okay Nutch need the permissions of the user that uses nutch. This means the webserver user in my case.

chown myuser:myuser -R /opt/nutch
chmod 777 -R /opt/nutch

Now we have to change the runbot in the Drupal Nutch Modul.
Change this
#!/bin/sh
to this
#!/bin/bash
This is important!!!!!! otherwise the crawler will never ever run correctly.

Okay we go on. Now we just fix a small problem in the nutch.admin.inc.
This could make a notice. Not important but he WE CAN FIX THIS.
Add in row 170 this

$rtn_output = '';

It's a minor thing.

I really hope I didn't forget something...
Configure Nutch
NUTCH_HOME = /opt/nutch
JAVA_HOME = depends on your installation check /usr/lib/jvm/JAVAVERSION
SOLR URL = http://localhost:8080/solr (mostly)

So if you try the debug crawl you should see no errors anymore! Feel free to submit mistakes or changes to this posting. Hope this will help to get Nutch Running! I will go on now to integrate Nutch with Solr. If I find an way I will post it here also.

Comment	File	Size	Author
#15	filepermcheck.patch	7.3 KB	broncomania
#9	Nutch crawler 1289125929485.png	35.62 KB	broncomania
#10	Nutch crawler 1289290284951.png	62.73 KB	broncomania

Comments

Comment #1

karljohann commented 24 October 2010 at 22:57

Very nice to see some documention actually :)

Here are a few comments:

"nutch-site.xml
Then you have to fill in the values like http.agent.name and the others."
- It's not necessary to fill out all these values and most of them won't affect whether Nutch will run or not.

"Okay Nutch need the permissions of the user that uses nutch. This means the webserver user in my case.
chown myuser:myuser -R /opt/nutch
chmod 777 -R /opt/nutch"
- You need to change the permissions so the user who is running the runbot script (usually the web server) can write in the crawl directory. 777 means that anybody and everybody can write in this folder, which is obviously not recommended. I make the web server the owner of the crawl folder (chown -R nginx nutch/crawl) and then set the permissions to 755 (owner can write, group and everybody can read).

"Now we have to change the runbot in the Drupal Nutch Modul.
Change this
#!/bin/sh
to this
#!/bin/bash
This is important!!!!!! otherwise the crawler will never ever run correctly."
- I assume this is different between distros, but definitely worth a try if it isn't working.

You also have to copy the solrindex-mapping.xml to the conf folder (http://drupal.org/node/811062#comment-3154622) as well as merge the schema.xml and solrconfig.xml (use the ones from the apachesolr module and then check this: http://drupal.org/node/811062#comment-3240566)

Hope this helps a bit.

Comment #2

broncomania commented 25 October 2010 at 10:33

DEAR READER IGNORE THIS INFORMATION ITS NOT CORRECT ONLY conf USING!! :-)

You also have to copy the solrindex-mapping.xml to the conf folder (http://drupal.org/node/811062#comment-3154622) as well as merge the schema.xml and solrconfig.xml (use the ones from the apachesolr module and then check this: http://drupal.org/node/811062#comment-3240566)

PART 2

So after several attempts I got it and here is the second part:

Don't forget I installed the solr and the nutch programm in the /opt folder. Maybe you have different settings then you must adjust the path!!!

1.: Copy from the Drupal ApacheSolr modul (version 2) the schema.xml to the Solr schema.xml. You will find the solr schema.xml here.
/opt/solr/config/scheme.xml

2.: Copy now the ApacheSolr modul solrconfig.xml into the
/opt/solr/config/solrconfig.xml
file

3.: Nutch Version 1.2 open the solrindex-mapping.xml. You will find the file here .
/opt/nutch-1.2/conf/solrindex-mapping.xml
Now change this entry
<field dest="content" source="content"/>
to this
<field dest="body" source="content"/>

4.: Restart Tomcat
/etc/init.d/tomcat6 restart
This nessesary that the new informtions get into the solr modul.

And that's it !!! Quite easy or???

Now let the bot run with a command like this. Remind I use the path of my installation!!!
/home/YOURUSER/public_html/sites/all/modules/nutch/runbot -n '/opt/nutch' -j '/usr/lib/jvm/java-6-sun' -s 'http://localhost:8080/solr' -u 'http://www.example.com!http://www.example1.com' -c '1' -f '100' -d '0'

@karljohann You are right with point of the crawler informations in the nutch-site.xml but it would give errors if you don't fill in the http.agent.name. Maybe the crawler run but it gives an exception without it. So just for the correctness we fill in the infos. :-)
Thx also for the sh / bash info that it is dependent of the distro. I didn't know that. Here I just want to give an running example for Ubuntu 10.04 distros and great thanx for the short tutorial about the permissions of the user who run the bot and which folders need what kind of permissions.

It is not nessesary to merge files or to copy the solrindex-mapping.xml somwhere. This information is just confusing. It's not nesessary for the installation I described here!

Comment #3

karljohann commented 25 October 2010 at 11:50

Well it might be confusing but you still need to do it.

Comment #4

broncomania commented 25 October 2010 at 15:11

So in my case it works without that! I followed the steps in the other posting, but it didn't work. That was the reason for me to make this step by step tutorial. Why is it necessary to get it work and why is it working in my case without this step if it is necessary? Ask yourself, why should so many people post and ask always the same questions, if the tutorial was right and straight? Anyway, maybe it is necessary on a different distro. Which distro did you use and where do you put the solrindex-mapping.xml? How did you integrate that in the solr system or how could solr system know that you copy the file there? You see even me, who got it working has a lot of questions about that. Can you please clarify why, me and the other readers really don't understand it.

Best regards
Frank

Comment #5

karljohann commented 25 October 2010 at 22:38

If you want the results to show up in Solr you have to copy the solrconfig.xml and schema.xml as well as modifying the solrindex-mapping.xml. If you got it to work without doing this then I clearly am misunderstanding something.

Comment #6

broncomania commented 26 October 2010 at 19:04

Yes, it was an misunderstanding! I agree all your steps. They are the same I described in the step by step tutorial. So the readers can follow this tutorial and get it work

Comment #7

maxmmize commented 31 October 2010 at 01:20

Do you know how to add/limit the teaser results so instead of getting 200 characters to display in my search results I can get 400 or more?

Is this something i have to tell nutch to do, Solr or ApacheSolr and where?

Comment #8

broncomania commented 1 November 2010 at 00:27

I don't know how to do it at the moment, because I have other problems. But if you found a solution for this, please post it here. Maybe we can collect snippets for the most things user want to change.

I found a nice articel about the german language and how to optimize it for the german language search.

german word compound solr index optimizer

http://www.early-dance.de/news/9189-apachesolr-issues-german-and-other-g...

Comment #9

broncomania commented 9 November 2010 at 08:47

Status	File	Size
new	Nutch crawler 1289125929485.png	35.62 KB

Update permissions problem:

I ran into the problem with the right permissions so I wrote this small snippet after I get it run.

ad the following to the nutch.admin.inc file to get a result like in the attached picture

Add this code to the nutch_admin_settings() function before the

return system_settings_form($form);

/**
  * Nutch Ordern rechte und besitzer check
  * @ seed/urls    existiert und beschreibbar ?
  * @ crawl/crawldb   existiert und beschreibbar ?
  * @ logs/haadop.log
  */	
  $NUTCH_HOME = variable_get('nutch_nutch_dir', '/usr/local/nutch');
  $check_true = '<span style="color:green;">Ok</span>';
     
  $form['permissions_check'] = array(
     '#type' => 'fieldset',
     '#title' => 'Folder permissions',
     '#description' => t('Folder permissions and owner check. If something is not okay Nutch will not run properly!'),  
     '#collapsible' => true,
     '#collapsed' => true,
  );	

  $apache_uid = posix_getuid();
  $form['permissions_check']['apache_uid'] = array(
    '#title' => 'Apache Uid',
     '#value' => $apache_uid, 
     '#prefix' => '<div>'.t('Apache UID').':',
     '#suffix' => '</div>',
  );	
  
  
  /**
  * Seed
  */
  $form['permissions_check']['seed'] = array(
     '#type' => 'fieldset',
     '#title' => 'Seed check',
     '#description' => t('Overview about your seeds folder and seed/urls file'),  
  );	  
  
  $seed_folder_exist = is_dir($NUTCH_HOME.'/seed');
  $form['permissions_check']['seed']['seed_folder_exist'] = array(
    '#title' => 'seed',
     '#value' => ($seed_folder_exist) ? $check_true : nutch_nutch_check_false(), 
     '#prefix' => '<div>'.t('Seed folder exist check').':',
     '#suffix' => '</div>',
  );	  

  
 $seed_owner_uid = fileowner($NUTCH_HOME.'/seed');
  $form['permissions_check']['seed']['seed_owner'] = array(
    '#title' => 'seed',
     '#value' => ($seed_owner_uid == $apache_uid) ? $check_true : nutch_nutch_check_false($seed_owner_uid), 
     '#prefix' => '<div>'.t('Seed owner check').':',
     '#suffix' => '</div>',
  );	  



  $seed_urls_owner_uid = fileowner($NUTCH_HOME.'/seed/urls');
  $form['permissions_check']['seed']['seed_urls_owner'] = array(
    '#title' => 'seed/urls',
     '#value' => ($seed_owner_uid == $apache_uid) ? $check_true : nutch_nutch_check_false($seed_urls_owner_uid), 
     '#prefix' => '<div>'.t('Seed / urls owner check').':',
     '#suffix' => '</div>',
  );	

  $seed_permission = substr(sprintf('%o', fileperms($NUTCH_HOME.'/seed')), -4); 
  $form['permissions_check']['seed']['seed_permission'] = array(
    '#title' => 'seed',
     '#value' => $seed_permission, 
     '#prefix' => '<div>'.$NUTCH_HOME.t('/seed owner permission').':',
     '#suffix' => '</div>',
  );	

  $seed_url_permission = substr(sprintf('%o', fileperms($NUTCH_HOME.'/seed/urls')), -4); 
  $form['permissions_check']['seed']['seed_url_permission'] = array(
    '#title' => 'seed/urls',
     '#value' => $seed_permission, 
     '#prefix' => '<div>'.$NUTCH_HOME.t('/seed/urls owner permission').':',
     '#suffix' => '</div>',
  );	


  /**
  * crawl
  */
  $form['permissions_check']['crawl'] = array(
     '#type' => 'fieldset',
     '#title' => 'Crawl check',
     '#description' => t('Overview about your crawl folder'),  
  );	  
  
  $crawldb_owner_uid = fileowner($NUTCH_HOME.'/crawl/crawldb');
  $form['permissions_check']['crawl']['crawldb_owner'] = array(
    '#title' => 'crawl/crawldb ',
     '#value' => ($crawldb_owner_uid == $apache_uid) ? $check_true : nutch_nutch_check_false($crawldb_owner_uid), 
     '#prefix' => '<div>'.t('Crawldb owner check').':',
     '#suffix' => '</div>',
  );
   

 $linkdb_owner_uid = fileowner($NUTCH_HOME.'/crawl/linkdb');
  $form['permissions_check']['crawl']['linkdb_owner'] = array(
    '#title' => 'crawl/linkdb ',
     '#value' => ($linkdb_owner_uid == $apache_uid) ? $check_true : nutch_nutch_check_false($linkdb_owner_uid), 
     '#prefix' => '<div>'.t('Linkdb owner check').':',
     '#suffix' => '</div>',
  );  

  $segments_owner_uid = fileowner($NUTCH_HOME.'/crawl/segments');
  $form['permissions_check']['crawl']['segments_owner'] = array(
    '#title' => 'crawl/segments',
     '#value' => ($segments_owner_uid == $apache_uid) ? $check_true : nutch_nutch_check_false($segments_owner_uid), 
     '#prefix' => '<div>'.t('Segments owner check').':',
     '#suffix' => '</div>',
  );
 


  $crawldb_permission = substr(sprintf('%o', fileperms($NUTCH_HOME.'/crawl/crawldb')), -4); 
  $form['permissions_check']['crawl']['crawldb_permission'] = array(
    '#title' => 'crawl/crawldb ',
     '#value' => $crawldb_permission, 
     '#prefix' => '<div>'.$NUTCH_HOME.t('/crawl/crawldb permission check').':',
     '#suffix' => '</div>',
  );
  
  $linkdb_permission = substr(sprintf('%o', fileperms($NUTCH_HOME.'/crawl/linkdb')), -4); 
  $form['permissions_check']['crawl']['lindbdb_permission'] = array(
    '#title' => 'crawl/linkdb ',
     '#value' => $crawldb_permission, 
     '#prefix' => '<div>'.$NUTCH_HOME.t('/crawl/linkdb permission check').':',
     '#suffix' => '</div>',
  );

  $segments_permission = substr(sprintf('%o', fileperms($NUTCH_HOME.'/crawl/segments')), -4); 
  $form['permissions_check']['crawl']['segments_permission'] = array(
    '#title' => 'crawl/segments ',
     '#value' => $segments_permission, 
     '#prefix' => '<div>'.$NUTCH_HOME.t('/crawl/segments permission check').':',
     '#suffix' => '</div>',
  );


  /**
  * Hadoop
  */
  $form['permissions_check']['hadoop'] = array(
     '#type' => 'fieldset',
     '#title' => 'Hadoop check',
     '#description' => t('Overview about your hadoop folder.'),  
  );	  
  
  
  $hadoop_owner_uid = fileowner($NUTCH_HOME.'/logs/hadoop.log');
  $form['permissions_check']['hadoop']['hadoop_owner'] = array(
    '#title' => 'logs/haadop.log',
     '#value' => ($hadoop_owner_uid == $apache_uid) ? $check_true : nutch_nutch_check_false($hadoop_owner_uid), 
     '#prefix' => '<div>'.t('Hadoop owner check').':',
     '#suffix' => '</div>',
  );
   
  
   $hadoop_permission = substr(sprintf('%o', fileperms($NUTCH_HOME.'/logs/hadoop.log')), -4); 
   $form['permissions_check']['hadoop']['hadoop_permission'] = array(
    '#title' => 'logs/haadop.log',
     '#value' => $hadoop_permission,
     '#prefix' => '<div>'.$NUTCH_HOME.t('/logs/haadop.log permission check').':',
     '#suffix' => '</div>',
  );

Add this function to the end of the nutch.admin.inc file.

function  nutch_nutch_check_false($permissions = '')
{  
  $output = '<span style="color:red;">Wrong '.$permissions.'</span>';	
  
  return $output;
}

That's it. In my case is the crawler running if your result is equal to my attached picture.

Hope this makes it a litte bit easier to locate the problems and will find the way into the next alpha ??

Comment #10

broncomania commented 9 November 2010 at 08:13

Status	File	Size
new	Nutch crawler 1289290284951.png	62.73 KB

I added some more checks in the code above. Look at the attached picture if you want toknow how it looks like.

Comment #11

dstuart commented 9 November 2010 at 08:32

Hey broncomania,

This looks good, is it possible to get in a patch instead of pasted into a comment

Regards,

Dave

Comment #12

broncomania commented 15 November 2010 at 11:41

Ah, I never build a patch, but I will try. Give me some days...

Comment #13

maxmmize commented 15 November 2010 at 16:22

@Bronco - Looks really good man! I did find the solution to changing my title and snippet length, but like you, have never written a patch before. I can post the links if you like but it looks like you have your hands full with Dave's request.

Good job, the easier we make it for novices the more people will get into the development and use.

Comment #14

dstuart commented 16 November 2010 at 05:04

Hey,

To help out a little read up here http://drupal.org/patch/create

You will need to checkout the latest copy of the codebase cvs -z6 -d:pserver:anonymous:anonymous@cvs.drupal.org:/cvs/drupal-contrib checkout -d nutch-HEAD contributions/modules/nutch/ port your changes across to the checkout and use the instructions from the link above

Cheers,

Dave

Comment #15

broncomania commented 16 November 2010 at 12:21

Status	File	Size
new	filepermcheck.patch	7.3 KB

@maxmmize It's not so hard to make a patch need round about 1 hour. It looks like we collect really cool infos and functions to make this crawler running. Please don't hesitate. Post your links here, maybe i can help you!

@dstuart Thx for the short hint. This saves me a lot of time.

So here is the patch untested.

So have fun
Frank

Comment #16

marco69 commented 31 May 2011 at 13:27

Dear Broncomania,

I have followed step by step your tutorial ... great effort really appreciated..

How then I see the results from the nutch crawling .... I assumed that solr would index those and show tehm on the result, but this is not happening any help here?

Thank you so much.

Ciao Marco

Comment #17

avpaderno

he/him

Italian

Brescia, 🇮🇹 🇪🇺

commented 23 September 2019 at 12:41

Assigned:	broncomania	» Unassigned
Status:	Active	» Closed (outdated)

I am closing this issue, since Drupal 6 isn't supported anymore.

The way it works!!! Get nutch1.2 run on Ubuntu 10.04