首页 \ 教程 \ solr

知识点

Solr

Nutch-2.2.1学习之七Nutch与Solr的集成

Nutch安装指南

Apache Nutch（一）

Nutch报错和解决方法

Nutch学习笔记一 ---环境搭建

nutch + solr —— 搭建初探

Hadoop源码浅析——Job提交相关

Param's Blog: Nutch 1.3 and Solr Integration

Nutch1.6学习笔记

Nutch1.4相关

Nutch源码阅读进程3---fetch

nutch 1.7导入Eclipse

nutch安装,与solr整合

集成Nutch和Solr

Lucene、Nutch、Solr

Nutch Job failed异常

2019-03-27 01:09|来源: 网路

truncate(分段方式)的页面，nutch的默认设置是不处理这种方式的，需要修改conf/nutch-site.xml，在里面增加一个 parser.skip.truncated 属性:

<property>
<name>parser.skip.truncated</name>
<value>false</value>
</property>

然后再进行抓取页面, bin/nutch crawl urls -dir crawl,就可以抓到

如果直接抓取页面创建索引到solr中,出现以下错误:

Exception in thread "main" java.io.IOException: Job failed!
        at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
        at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:123)
        at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:81)
        at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:65)
        at org.apache.nutch.crawl.Crawl.run(Crawl.java:155)
        at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
        at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

解决办法:

cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/collection1/conf

然后在${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml文件中添加以下字段(如果schema.xml中存在,就不需要添加了)

如果 ${APACHE_SOLR_HOME}/example/solr/collection1/data文件夹存在,则删除data文件夹,然后重启solr.

转自：http://www.cnblogs.com/Jrain/p/3553034

相关问答

Nutch 2.2.1 + hBase(Nutch 2.2.1 + hBase)[2022-07-22]

解决了。您必须向库中添加正确版本的gora-hbase。山 - HBase的-0.3.jar Solved. You must add correct version of gora-hbase to you libraries. gora-hbase-0.3.jar
带有nutch REST api的Nutch弹性分度器中的未知问题(Unknown issue in Nutch elastic indexer with nutch REST api)[2022-07-13]

我找到了解决这个问题的方法。这是由于番石榴依赖的版本兼容性。 Hadoop使用guava-11.0.2.jar作为依赖。但是nutch中的弹性索引器插件需要18.0版本的番石榴。这就是为什么它试图在分布式hadoop中运行时抛出异常。所以我们只需要在hadoop库中将guava版本更新到18.0（可以在$ HADOOP_HOME / share / hadoop / common / libs /中找到）。 I have found solution for this issue. This is ...
Nutch 2.3 REST curl语法(Nutch 2.3 REST curl syntax)[2023-05-26]

从用户邮件列表中，我了解了用于生成的args是： “正常化”：布尔 “过滤器”：布尔 “crawlId”：字符串 “CURTIME”：长 “一批”：字符串 From the user mailing list, I learned the args to use for generate are: "normalize":boolean "filter":boolean "crawlId":String "curTime":long "batch":String
apache nutch爬行问题(apache nutch crawling issue)[2022-12-06]

如果您检查错误输出： file:/generate-temp-b42b2b91-e1e5-4e82-8861- 881a7a607bd9/_temporary/0/_temporary/attempt_local2075293294_0001_r _000000_0/fetchlist-1 它抱怨说它不能在你的文件系统的根目录中创建一个文件，这没关系，你的servlet不应该写在那里。看看https://github.com/apache/nutch/blob/master/src/java/org/a ...
Nutch：工作失败了(Nutch: Job Failed)[2022-03-16]

正在使用bin / nutch注入bin / crawl / crawldb bin / urls命令来注入而不是bin / nutch注入crawl / crawldb bin / urls 这解决了错误。并且对于获取网址我已经对regex-urlfilter.txt文件进行了更改，现在我可以获取网址了。 was using bin/nutch inject bin/crawl/crawldb bin/urls command to inject instead of bin/nutch inject ...
成功完成Nutch爬网后，Elasticsearch索引失败(Elasticsearch indexing fails after successful Nutch crawl)[2022-03-21]

在Nutch 1.14发布之前，您需要应用此补丁https://github.com/apache/nutch/pull/156并重建： cd apache-nutch-1.13 wget https://raw.githubusercontent.com/apache/nutch/e040ace189aa0379b998c8852a09c1a1a2308d82/src/java/org/apache/nutch/indexer/CleaningJob.java mv CleaningJob.java s ...
Nutch 2.1 - 生成器作业运行时异常作业失败(Nutch 2.1 - generator job runtime exception job failed)[2023-06-08]

上述错误是由于我安装的服务器上的分区空间不足造成的。当我尝试运行nutch generate命令时，检查共享内存文件空间不足时的答案 The above errors are due to insufficient space on the partition on the server where i have installed . check the answer at Insufficient space for shared memory file when i try to run nutch ...
使用elasticsearch进行Apache Nutch索引(Apache Nutch Indexing using elasticsearch)[2023-02-06]

确保您在nutch弹性依赖项和本地服务器中运行相同的版本。如果它们不相同，那么不要浪费你的时间，并使用http协议从nutch而不是Java api直接推送到elastic。 Make sure you are running the same versions in nutch elastic dependency and your local server. If they are not the same, then do not waste your time, and use the http ...
使用Nutch内容限制的建议(Advice in Using Nutch Content Limit)[2022-06-08]

问题是您将爬网深度设置为无限（-1）。当您的抓取工具系统遇到重要的网址时，例如https://en.wikipedia.org, https://wikipedia.org and https://en.wikibooks.org ，您的系统在抓取过程中可能会耗尽内存。您应该通过设置NUTCH_HEAPSIZE环境变量值来增加Nuch的内存， eg, export NUTCH_HEAPSIZE=4000 （请参阅Nutch脚本中的详细信息）。请注意，此值等同于Hadoop的HADOOP_HEAPSIZ ...
Nutch和Elasticsearch 1.1.1(Nutch and Elasticsearch 1.1.1)[2022-09-01]

主干包含一个补丁https://issues.apache.org/jira/browse/NUTCH-1745已经提交并将在Nutch-1.9中。它应该解决您的问题 - 尽管您发布的消息无助于确定问题实际上是什么。顺便说一下，你可以通过在Nutch用户列表上发布来获得更多相关的受众 The trunk contains a patch https://issues.apache.org/jira/browse/NUTCH-1745 which has been committed and will b ...

知识点

相关文章

最近更新

Nutch Job failed异常

相关问答

Nutch 2.2.1 + hBase(Nutch 2.2.1 + hBase)[2022-07-22]

带有nutch REST api的Nutch弹性分度器中的未知问题(Unknown issue in Nutch elastic indexer with nutch REST api)[2022-07-13]

Nutch 2.3 REST curl语法(Nutch 2.3 REST curl syntax)[2023-05-26]

apache nutch爬行问题(apache nutch crawling issue)[2022-12-06]

Nutch：工作失败了(Nutch: Job Failed)[2022-03-16]

成功完成Nutch爬网后，Elasticsearch索引失败(Elasticsearch indexing fails after successful Nutch crawl)[2022-03-21]

Nutch 2.1 - 生成器作业运行时异常作业失败(Nutch 2.1 - generator job runtime exception job failed)[2023-06-08]

使用elasticsearch进行Apache Nutch索引(Apache Nutch Indexing using elasticsearch)[2023-02-06]

使用Nutch内容限制的建议(Advice in Using Nutch Content Limit)[2022-06-08]

Nutch和Elasticsearch 1.1.1(Nutch and Elasticsearch 1.1.1)[2022-09-01]