truncate(分段方式)的页面,nutch的默认设置是不处理这种方式的,需要修改conf/nutch-site.xml,在里面增加一个 parser.skip.truncated 属性:
<property>
<name>parser.skip.truncated</name>
<value>false</value>
</property>
然后再进行抓取页面, bin/nutch crawl urls -dir crawl,就可以抓到
如果直接抓取页面创建索引到solr中,出现以下错误:
Exception in thread "main" java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:123)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:81)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:65)
at org.apache.nutch.crawl.Crawl.run(Crawl.java:155)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)
解决办法:
cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/collection1/conf
然后在${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml文件中添加以下字段(如果schema.xml中存在,就不需要添加了)
<field name="host" type="string" stored="false" indexed="true"/>
<field name="digest" type="string" stored="true" indexed="false"/>
<field name="segment" type="string" stored="true" indexed="false"/>
<field name="boost" type="float" stored="true" indexed="false"/>
<field name="tstamp" type="date" stored="true" indexed="false"/>
如果 ${APACHE_SOLR_HOME}/example/solr/collection1/data文件夹存在,则删除data文件夹,然后重启solr.