One URL will crawl, and the other won't. [#969108]

I'm having weird issues.

If follow the process of rm -fd -R'ing crawl/* , and then index with http://example1.com as the seed, I get crawling and everything is good. If I repeat (deleting crawl/*) and try with http://example2.com, I get the error below.

Anybody know what the error means, or what the difference between the URLs could possibly be?

2010-11-11 19:23:23,349 INFO  crawl.LinkDb - LinkDb: URL normalize: true
2010-11-11 19:23:23,349 INFO  crawl.LinkDb - LinkDb: URL filter: true
2010-11-11 19:23:23,361 INFO  crawl.LinkDb - LinkDb: adding segment: /home/robert/lib/nutch_1_2/crawl/segments/20101111192321
2010-11-11 19:23:23,699 ERROR crawl.LinkDb - LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/robert/lib/nutch_1_2/crawl/segments/20101111192321/parse_data
	at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
	at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
	at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
	at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
	at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
	at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
	at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
	at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
	at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:292)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)

Comments

Comment #1

robertdouglass commented 11 November 2010 at 19:12

More info from the log:

2010-11-11 20:11:04,828 ERROR solr.SolrIndexer - org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/robert/lib/nutch_1_2/crawl/segments/20101111201101/crawl_parse
Input path does not exist: file:/home/robert/lib/nutch_1_2/crawl/segments/20101111201101/parse_data
Input path does not exist: file:/home/robert/lib/nutch_1_2/crawl/segments/20101111201101/parse_text
Input path does not exist: file:/home/robert/lib/nutch_1_2/crawl/linkdb/current

I think that the site I'm crawling does some redirecting for SSO purposes that might be killing the crawling.

Comment #2

dstuart commented 11 November 2010 at 19:26

Hey Robert,

You get this error if your crawl didn't find any pages, currently the runbot script isn't intelligent enough to deal with a zero result it just carries on assuming that there is data.

The next question is root cause, have you altered your regex-urlfilter.txt to support the new site?

Regards,

Dave