I am assigning myself this task. I will make a patch that allows us to archive hadoop.log with a timestamp or delete the archives or current hadoop.log

This allows us to clean the hadoop log in the interface so we don't have to got to sshd. Also, gives the option to archive the log.

Comments

dstuart’s picture

Hey maxmmize,

To note Nutch automatically does this on a daily basis do you envision it needs to be done more regularly than this

Regards,

Dave

maxmmize’s picture

Well, for me, I monitor every crawl. I have to run a small crawl at like a max of 10 links for 1 URL, then abotu 20-40 just to make sure the URL is being crawled correctly. I suppose once I get my fetch lists down after a month or so it won't be necessary.

Since I am always monitoring my crawls, I never ran into the function you described.

Now that I have your information it seems either redundant or a blessing depending on what part of the stage of crawling and indexing you are at.

On the other hand, for testing crawl scripts and such it seems like a good admin tool, not that a shell script couldn't achieve the same thing though.

Final thoughts?

dstuart’s picture

Hey Maxmmize,

By all means, its a useful feature that has a good use case, I was ensuring you weren't wasn't effort if the above fit your requirements.

Regards,

David

maxmmize’s picture

Here it is, kind of heavy and out of shape a bit but functions more or less what I wanted to achieve. Probably a better way to display it though.

Start a crawl and click on hadoop.log. When you submit a new crawl the current hadoop is timestamped then moved and then hadoop is cleared. Needs more logic, like, delete log and perhaps a drop down list instead. Have to work with a bit to see what I really want. Anyway, gimme feedback when you get a chance.

nutch.admin.inc

<?php
// $Id: nutch.admin.inc,v 1.1 2010/04/11 23:40:26 dstuart Exp $

/**
 * @file
 * Implements admin functionality relating to Nutch setup and running
 */


/**
 * Settings form for Nutch setup
 * @TODO Need to add all of the extra crawl options 
 *        - topN: -topN, then instead of all unfetched urls you only get N urls with the highest score
 *        - threads
 *        - depth
 *        - adddays
 */
function nutch_admin_settings() {
  if(!file_check_directory(variable_get('nutch_nutch_dir', '/usr/local/nutch'))){
    drupal_set_message(t('The Directory %directory either does not exist or is not writable by the webserver.', array('%directory' => variable_get('nutch_nutch_dir', '/usr/local/nutch'))), 'error');
  }
  $form = array();
  $form['system'] = array(
    '#type' => 'fieldset',
    '#title' => t('System settings'),
    '#collapsible' => TRUE,
    '#collapsed' => FALSE,
  );

  $form['system']['nutch_nutch_dir'] = array(
    '#type' => 'textfield',
    '#title' => t('NUTCH_HOME'),
    '#description' => t('The absolute path to where the Nutch distribution is found, so that $NUTCH_HOME/bin/nutch resolves to the main Nutch script.'),
    '#default_value' => variable_get('nutch_nutch_dir', '/usr/local/nutch'),
    '#required' => TRUE,
  );
  $form['system']['nutch_java'] = array(
    '#type' => 'textfield',
    '#title' => t('JAVA_HOME'),
    '#description' => t('The absolute path to a directory which is JAVA_HOME. $JAVA_HOME/bin/java should resolve to the Java binary that will run Nutch.'),
    '#default_value' => variable_get('nutch_java', '/usr'),
    '#required' => TRUE,
  );

  if (!$nutch_solr  = variable_get('nutch_solr', FALSE)) {
    $host = variable_get('apachesolr_host', 'localhost');
    $port = variable_get('apachesolr_port', '8983');
    $path = variable_get('apachesolr_path', '/solr');
    $nutch_solr = sprintf('http://%s:%s%s',$host, $port, $path);
  }
  $form['system']['nutch_solr'] = array(
    '#type' => 'textfield',
    '#title' => t('SOLR URL'),
    '#description' => t('The url of your apache solr instance.'),
    '#default_value' => $nutch_solr,
    '#required' => TRUE,
  );
  $form['system']['nutch_crawl_depth'] = array(
    '#type' => 'select',
    '#title' => t('Per host URL fetch limit'),
    '#options' => drupal_map_assoc(array(1, 2, 5, 10)),
    '#default_value' => variable_get('nutch_urls_to_fetch', 2),
    '#description' => t('The depth to follow a links on a page'),
  );
  $form['system']['nutch_topN'] = array(
    '#type' => 'select',
    '#title' => t('Top URLs by Score'),
    '#options' => drupal_map_assoc(array(1, 5, 10, 20, 50, 100)),
    '#default_value' => variable_get('nutch_urls_to_fetch', 100),
    '#description' => t('Grab the top N URLs based on nutch Score.'),
  );   
  $form['system']['nutch_urls_to_fetch'] = array(
    '#type' => 'select',
    '#title' => t('The number of URLs to fetch and index'),
    '#options' => drupal_map_assoc(array(10, 20, 50, 100, 200, 500, 750, 1000, 2000, 5000, 10000)),
    '#default_value' => variable_get('nutch_urls_to_fetch', 100),
    '#description' => t('When crawl is run, this number of URLs will be fetched and indexed. It is important to know how long this takes on your equipment so as to avoid starting overlapping jobs. Always start with small numbers (hundreds) and profile how long it takes before moving up.'),
  ); 
  $form['system']['nutch_commit'] = array(
    '#type' => 'checkbox',
    '#title' => t('Send to Solr'),
    '#description' => t('Send to Solr at the end of the Nutch crawl'),
    '#default_value' => variable_get('nutch_commit', 1),
  );    
  $form['system']['nutch_debug'] = array(
    '#type' => 'checkbox',
    '#title' => t('Dry run/debug crawl'),
    '#default_value' => variable_get('nutch_debug', 0),
  );    
  $form['system']['nutch_crawl_on_cron'] = array(
    '#type' => 'checkbox',
    '#title' => t('Run crawl on cron'),
    '#default_value' => variable_get('nutch_crawl_on_cron', 0),
  );
  return system_settings_form($form);
}

function nutch_admin_crawl() {
  if(!file_check_directory(variable_get('nutch_nutch_dir', '/usr/local/nutch'))){
    drupal_set_message(t('The Directory %directory either does not exist or is not writable by the webserver.', array('%directory' => variable_get('nutch_nutch_dir', '/usr/local/nutch'))), 'error');
  }  
  $form = array();
  $form['controls'] = array(
    '#type' => 'fieldset',
    '#title' => t('crawl controls'),
    '#collapsible' => FALSE,
    '#collapsed' => FALSE,
  );
  $form['controls']['nutch_commit'] = array(
    '#type' => 'checkbox',
    '#title' => t('Send to Solr'),
    '#description' => t('Send to Solr at the end of the Nutch crawl'),
    '#default_value' => variable_get('nutch_commit', 1),
  );    
  $form['controls']['nutch_debug'] = array(
    '#type' => 'checkbox',
    '#title' => t('Dry run/debug crawl'),
    '#default_value' => variable_get('nutch_debug', 0),
  );  
  $form['controls']['crawl'] = array(
    '#type' => 'submit',
    '#value' => t('Start Crawl'),
  );
  return $form;
}

function nutch_admin_crawl_submit($form, &$form_state) {
  nutch_start_crawl(
           variable_get('nutch_nutch_dir', '/usr/local/nutch'), 
           variable_get('nutch_java', '/usr'), 
           variable_get('nutch_solr', FALSE), 
           (@$form['#post']['nutch_commit']?$form['#post']['nutch_commit']:variable_get('nutch_commit', 1)), 
           variable_get('nutch_urls_to_fetch', 100), 
           variable_get('nutch_seed_url', 'http://localhost'), 
           (@$form['#post']['nutch_debug']?$form['#post']['nutch_debug']:variable_get('nutch_debug', 0))
  );
}

function nutch_start_crawl($nutch_home, $java_home, $solr_url, $commit=0, $fetch_amount=100, $seed_urls='', $debug=0) {
  
  $command =  $_SERVER['DOCUMENT_ROOT'] . base_path() . drupal_get_path('module', 'nutch') .'/'.'runbot';

  if (!empty($nutch_home)){
    $command .= ' -n '. escapeshellarg($nutch_home);
  }
  else {
    drupal_set_message(t("You must supply a nutch home directory; crawl aborted."));
    return 0;
  }
  
  if (!empty($java_home)) {
    $command .= ' -j '. escapeshellarg($java_home);
  }
  else {
    drupal_set_message(t("You must supply a java home; crawl aborted."));
    return 0;
  }  

  if (!empty($solr_url)) {
    $command .= ' -s '. escapeshellarg($solr_url);
  }
  else {
    drupal_set_message(t("You must the url path to your Solr instance; crawl aborted."));
    return 0;
  }

  if (!empty($seed_urls)) {
    /* 
     * Replace all of the line feeds with ! because the
     * shell scipt doesnt recognise them when adding them
     * to the urls file. We could run echo -e but its not
     * a universal command 
     */
    $seed_urls_replaced = str_replace("\n", '!', $seed_urls);
    if($debug == 1) drupal_set_message($seed_urls_replaced);
    $command .= ' -u '. escapeshellarg($seed_urls_replaced);
  }
  
  $command .= ' -c '. escapeshellarg($commit);
  $command .= ' -f '. escapeshellarg($fetch_amount);
  $command .= ' -d '. escapeshellarg($debug);
  
  drupal_set_message(t("Starting Nutch Crawl."));
  if($debug == 1){
    drupal_set_message($command);
    $rtn = exec($command . ' &', $output);
    $rtn_output = $debug;
    foreach($output as $line){
      $rtn_output .= $line . "<br/>";
    }
    drupal_set_message($rtn_output);     
  }
  else{
    $tmp = array();
    $process = proc_open($command . ' &', array(), $tmp);
    drupal_set_message($process);
    $rtn = proc_close($process);
  }
  if (!$rtn) {
      nutch_admin_logs_archive();
  }
}

function nutch_admin_seed() {
  $form = array();
  $form['controls'] = array(
    '#type' => 'fieldset',
    '#title' => t('seed controls'),
    '#collapsible' => TRUE,
    '#collapsed' => FALSE,
  );
  /* @TODO DO a compare of the current seed files on disk and that saved in the variables table
             if different set a message */ 
  $form['controls']['nutch_seed_url'] = array(
    '#type' => 'textarea',
    '#title' => t('Seed URLs'),
    '#description' => t('The URL/s Nutch uses seed crawl. Please urls on a separate line'),
    '#default_value' => variable_get('nutch_seed_url', 'http://localhost'),
    '#required' => TRUE,
  );
  $form['controls']['nutch_url_filters'] = array(
    '#type' => 'textarea',
    '#title' => t('Filter URLs'),
    '#description' => t('Filter Regex for defining your crawl criteria. Please urls on a separate line'),
    '#default_value' => variable_get('nutch_url_filters', "+^http://localhost\n-."),
    '#required' => TRUE,
  );  
  $form['controls']['nutch_mimetype_blacklist'] = array(
    '#type' => 'textarea',
    '#title' => t('Mimetype Blacklist'),
    '#description' => t('List of mime types to ignore'),
    '#default_value' => variable_get('nutch_mimetype_blacklist', "swf|SWF|doc|DOC|mp3|MP3|WMV|wmv|txt|TXT|rtf|RTF|avi|AVI|m3u|M3U|flv|FLV|WAV|wav|mp4|MP4|avi|AVI|rss|RSS|xml|XML|pdf|PDF|js|JS|gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP"),
    '#required' => TRUE,
  );   
  $form['controls']['nutch_protocol_blacklist'] = array(
    '#type' => 'textfield',
    '#title' => t('Protocol Blacklist'),
    '#description' => t('List of protocol links to ignore'),
    '#default_value' => variable_get('nutch_protocol_blacklist', "https|telnet|file|ftp|mailto"),
    '#required' => TRUE,
  ); 	 
  return system_settings_form($form);
}


function nutch_admin_inject_submit($form, &$form_state) {
    drupal_set_message(t("Nutch is injecting URLs."));
    nutch_do_inject(variable_get('nutch_nutch_dir', '/usr/local/nutch'), variable_get('nutch_crawl_dir', '/usr/local/nutch/crawl'), variable_get('nutch_urls_dir', '/usr/local/nutch/seed'));
}


function nutch_admin_logs() {
  $output = '';
  $log_file = variable_get('nutch_nutch_dir', '/usr/local/nutch') .'/logs/hadoop.log';
  if (!is_readable($log_file)) {
    drupal_set_message(t('Cannot access  %log either it does not exist or is not writable by the webserver.', array('%log' => $log_file)), 'error');
  } else {
    $output = check_plain(file_get_contents($log_file));
  }
  
  /* {log modification 1 - start} */
  $output = 'select a log on the right...';
  $list = scandir(str_replace(basename($log_file),'',$log_file));
  $logs = '';
  foreach ($list as $k => $v) {
      if ($v != '.' && $v != '..' 
      //&& $v != 'hadoop.log'
      ) $logs .= '<a href="?log='.$k.'">'.$v.'</a><br />';
      if (!empty($_REQUEST['log']) && $_REQUEST['log']==$k) $output = check_plain(file_get_contents(str_replace(basename($log_file),'',$log_file).$v));
  }
  $form = array();
  $form['before']=array('#value'=>'<div style="float:left;width:70%;">');
  $form['log'] = array('#type' => 'textarea','#title' => $log_file,'#value' => $output,'#rows' => 30);
  $form['between']=array('#value'=>'</div><div style="float:left;width:30%;"><div class="form-item"><label>Log Files</label>');
  $form['menu']=array('#value'=>$logs);
  $form['after']=array('#value'=>'</div></div>');
  return $form;
  /* {log modification 1 - end} */
}

function nutch_get_current_seed_urls(){
  $seed_file = variable_get('nutch_nutch_dir', '/usr/local/nutch') .'/seed/urls';
  $output = check_plain(file_get_contents($seed_file));
  if (!$output) {
    $output = '';
  }
  return $output;
}


/* {log modification 2 - start} */
function nutch_admin_logs_archive($log_file=null) {
    if (!$log_file) $log_file = variable_get('nutch_nutch_dir', '/usr/local/nutch') .'/logs/hadoop.log';
    $path = explode("/", $log_file);
    $file = array_pop($path);
    $lines = array_reverse(explode("\n", file_get_contents($log_file)));
    $split = 'crawl.Injector - Injector: starting';
    $buffer = '';
    $i = 1;
    $logs = array();
    $log = '';
    //echo '<pre>';
    foreach ($lines as $line) {
        $buffer .= $line."\n";
        $log .= $line."\n";
        if (strstr($line, $split)) {
            do $new = implode('/', $path).'/'.date('hisa-mdy').'_'.sprintf('%03d', $i++).'.'.$file;
            while (file_exists($new));
            if (file_put_contents($new, $buffer)) file_put_contents($log_file, str_replace($buffer, '', $contents));
            $buffer = $line . "\n";
            $logs[]=$log;
            $log=$line."\n";
        }
    }
    
    //print_r($logs);
    //die;
    //file_put_contents($log_file, str_replace(implode("\n", $)))
    return count($logs);
}
function nutch_admin_conf() {
    $output = '';
    
    $default_file = variable_get('nutch_nutch_dir', '/usr/local/nutch') .'/conf/nutch-default.xml';
    if(!is_readable($default_file)){
        drupal_set_message(t('Cannot access  %default either it does not exist or is not writable by the webserver.', array('%default' => $default_file)), 'error');
    }else{
        $site_default = html_entity_decode(check_plain(file_get_contents($default_file)));
    }
    
    $config_file = variable_get('nutch_nutch_dir', '/usr/local/nutch') .'/conf/nutch-site.xml';
    if(!is_writable($config_file)){
        drupal_set_message(t('Cannot access  %config either it does not exist or is not writable by the webserver.', array('%config' => $config_file)), 'error');
    }else{
        $site_config = html_entity_decode(check_plain(file_get_contents($config_file)));
    }
    
    $form=array();
    $form['open']=array('#value'=>'<div style="width:50%;float:left;">');
    $form['path']=array(
    '#type'=>'hidden',
    '#value'=>$config_file,
    );
    $form['site_config']= array(
    '#type' => 'textarea',
    '#title' => 'Site Config',
    '#value' => $site_config,
    '#rows' => 30,//count(explode("\n", $default_file)),
    );
    $form['middle']=array('#value'=>'</div><div style="width:50%;float:left;">');
    $form['site_default'] = array(
    '#type' => 'textarea',
    '#title'=>'Site Default',
    '#value' => $site_default,
    '#rows'=>30,
    );
    $form['end']=array('#value'=>'</div><div style="clear:both;"></div>');
    $form['controls']['save'] = array('#type' => 'submit','#value' => t('Save Config'));
    return $form;
}
function nutch_admin_conf_submit($form, &$form_state) {
    $path = variable_get('nutch_nutch_dir', '/usr/local/nutch') .'/conf/nutch-site.xml';
    
    $path = $form['#post']['path'];
    $content = $form['#post']['site_config'];
    
    if (!file_exists($path)) {
        drupal_set_message('Failed to save site configuration file, '.$path.' doesn\'t exist.', 'error');
        return 0;
    } elseif (!is_writable($path)) {
        drupal_set_message('Failed to save site configuration file, '.$path.' isn\'t writable.', 'error');
        return 0;
    } elseif (!file_put_contents($path, $content)) {
        drupal_set_message('Failed to save site configuration file, unknown reasath was '.$path.'.', 'error');
        return 0;
    } else {
        drupal_set_message('Site configuration has been saved.');
    }
}
/* {log modification 2 - end} */
maxmmize’s picture

Has anyone tried this out yet? It works for me but maybe others have some opinion on functionality.

avpaderno’s picture

Issue summary: View changes
Status: Active » Closed (outdated)

I am closing this issue, since Drupal 6 isn't supported anymore.