I'm having weird issues.

If follow the process of rm -fd -R'ing crawl/* , and then index with http://example1.com as the seed, I get crawling and everything is good. If I repeat (deleting crawl/*) and try with http://example2.com, I get the error below.

Anybody know what the error means, or what the difference between the URLs could possibly be?

2010-11-11 19:23:23,349 INFO  crawl.LinkDb - LinkDb: URL normalize: true
2010-11-11 19:23:23,349 INFO  crawl.LinkDb - LinkDb: URL filter: true
2010-11-11 19:23:23,361 INFO  crawl.LinkDb - LinkDb: adding segment: /home/robert/lib/nutch_1_2/crawl/segments/20101111192321
2010-11-11 19:23:23,699 ERROR crawl.LinkDb - LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/robert/lib/nutch_1_2/crawl/segments/20101111192321/parse_data
	at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
	at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
	at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
	at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
	at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
	at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
	at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:292)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)

Comments

robertdouglass’s picture

More info from the log:

2010-11-11 20:11:04,828 ERROR solr.SolrIndexer - org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/robert/lib/nutch_1_2/crawl/segments/20101111201101/crawl_parse
Input path does not exist: file:/home/robert/lib/nutch_1_2/crawl/segments/20101111201101/parse_data
Input path does not exist: file:/home/robert/lib/nutch_1_2/crawl/segments/20101111201101/parse_text
Input path does not exist: file:/home/robert/lib/nutch_1_2/crawl/linkdb/current

I think that the site I'm crawling does some redirecting for SSO purposes that might be killing the crawling.

dstuart’s picture

Hey Robert,

You get this error if your crawl didn't find any pages, currently the runbot script isn't intelligent enough to deal with a zero result it just carries on assuming that there is data.

The next question is root cause, have you altered your regex-urlfilter.txt to support the new site?

Regards,

Dave

dstuart’s picture

Hey Robert,

Any update on this one, did it fix the problem?

Regards,

Dave

avpaderno’s picture

Status: Active » Closed (outdated)

I am closing this issue, since Drupal 6 isn't supported anymore.