I'm having weird issues.
If follow the process of rm -fd -R'ing crawl/* , and then index with http://example1.com as the seed, I get crawling and everything is good. If I repeat (deleting crawl/*) and try with http://example2.com, I get the error below.
Anybody know what the error means, or what the difference between the URLs could possibly be?
2010-11-11 19:23:23,349 INFO crawl.LinkDb - LinkDb: URL normalize: true
2010-11-11 19:23:23,349 INFO crawl.LinkDb - LinkDb: URL filter: true
2010-11-11 19:23:23,361 INFO crawl.LinkDb - LinkDb: adding segment: /home/robert/lib/nutch_1_2/crawl/segments/20101111192321
2010-11-11 19:23:23,699 ERROR crawl.LinkDb - LinkDb: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/robert/lib/nutch_1_2/crawl/segments/20101111192321/parse_data
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:190)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:44)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:201)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:810)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:781)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:730)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1249)
at org.apache.nutch.crawl.LinkDb.invert(LinkDb.java:175)
at org.apache.nutch.crawl.LinkDb.run(LinkDb.java:292)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.LinkDb.main(LinkDb.java:255)
Comments
Comment #1
robertdouglass commentedMore info from the log:
I think that the site I'm crawling does some redirecting for SSO purposes that might be killing the crawling.
Comment #2
dstuart commentedHey Robert,
You get this error if your crawl didn't find any pages, currently the runbot script isn't intelligent enough to deal with a zero result it just carries on assuming that there is data.
The next question is root cause, have you altered your regex-urlfilter.txt to support the new site?
Regards,
Dave
Comment #3
dstuart commentedHey Robert,
Any update on this one, did it fix the problem?
Regards,
Dave
Comment #4
avpadernoI am closing this issue, since Drupal 6 isn't supported anymore.