首页 \ 教程 \ hadoop

知识点

hadoop

Apache Nutch 1.3 学习笔记九（SolrIndexer）

Apache Nutch 1.3 学习笔记二

nutch安装,与solr整合

基于hadoop+nutch+solr的搜索引擎环境搭载<二>nutch+solr整合以及搭载在hadoop上

Param's Blog: Nutch 1.3 and Solr Integration

Apache Nutch（一）

Nutch 教程

Apache Nutch 1.3 学习笔记十二（Nutch 2.0 的主要变化）

Nutch-1.3中没了自带的搜索war文件，Nutch爬取与Solr搜索结合

基于hadoop+nutch+solr的搜索引擎环境搭载<三>hadoop,nutch,solr整合到eclipse上开发

nutch，solr集成在hadoop上

Nutch1.3集成Solr网页快照功能实现（一）

Apache nutch1.5 & Apache solr3.6

Solr3.6.2与nutch1.6的整合

Nutch 1.3和Hadoop 0.20.203.0的整合

2019-03-28 14:20|来源: 网络

一、Hadoop的安装。 http://www.linuxidc.com/Linux/2011-10/45730.htm

二、Nutch1.3的下载安装

svn co http://svn.apache.org/repos/asf/nutch/branches/branch-1.3/ ~/nutch

也可以直接在http://labs.renren.com/apache-mirror//nutch/ 上下载，我下的是1.3版本。

三、修改conf/下的nutch-site.xml

<configuration>
<property>
<name>http.agent.name</name>
<value>HD nutch agent</value>
<description>HTTP 'User-Agent' request header. MUST NOT be empty -
please set this to a single word uniquely related to your organization.
</description>
</property>
<property>
<name>http.robots.agents</name>
<value>HD nutch agent</value>
<description>The agent strings we'll look for in robots.txt files,
comma-separated, in decreasing order of precedence. You should
put the value of http.agent.name as the first agent name, and keep the
default * at the end of the list. E.g.: BlurflDev,Blurfl,*
</description>
</property>
</configuration>

四、将hadoop中conf下的所有文件考到nutch的conf下。

五、用ant重新编译Nutch，如果ant没安装apt-get install ant可以直接安装~

注：如果没有重新编译，对于nutch-site.xml的修改是无效的，会出现Nutch Fetcher: No agents listed in ‘http.ag ent.name’ property的错误

六、进入到runtime/deploy/bin下：

./nutch crawl hdfs://localhost:9000/user/fzuir/urls.txt -dir hdfs://localhost:9000/user/fzuir/crawled -depth 3 -topN 10

这个时候，还会报一个错误：

NullPointerException at org.apache.nutch.fetcher.FetcherOutputFormat.checkOutputSpecs.

这个是因为Nutch1.3的一个bug，在Nutch的官网上有提到在1.4的版本上有修改，但是1.4还么有发布，所有就根据官网的提示自己改下两个java文件，然后重新编译下：

相关问答

nutch2.2.1 整合到java web工程[2023-12-09]

不知道你转为web项目意义何在，可能是为了在线对爬取的配置，首先：nutch2.2.1提供了一个API，实现其相应的接口，就可以实现相应的功能了，第二点：nutch本身爬取就需要占用大规模的资源，支持hadoop的，hadoop本身就是离线的，所以设置成web意义不大。所以建议去实现nutch的API接口进行爬取，然后用web项目查询数据就好了
Apache Nutch - NoSuchMethodError(Apache Nutch - NoSuchMethodError)[2023-06-14]

通过安装早期版本的Nutch（1.4）解决了这个问题。问题在于一些较新版本的hadoop在Windows中无法正常工作。 Solved the issue by installing an earlier version of Nutch (1.4) . The problem was with some of the newer versions of hadoop not working correctly in windows.
Nutch 2.2.1 + hBase(Nutch 2.2.1 + hBase)[2022-07-22]

解决了。您必须向库中添加正确版本的gora-hbase。山 - HBase的-0.3.jar Solved. You must add correct version of gora-hbase to you libraries. gora-hbase-0.3.jar
简单的Nutch 1.3 / Solr指数解释(Simple Nutch 1.3/Solr index explanation)[2022-05-10]

nutch模式将id（= url）定义为teh唯一键。如果你重新抓取url，当nutch将数据发布到solr时，文档将被替换为solr索引。 The nutch schema defines id (= url) as teh unique key. If you re-crawl the url teh document will be replaced in solr index when nutch posts the data to solr.
apache nutch爬行问题(apache nutch crawling issue)[2022-12-06]

如果您检查错误输出： file:/generate-temp-b42b2b91-e1e5-4e82-8861- 881a7a607bd9/_temporary/0/_temporary/attempt_local2075293294_0001_r _000000_0/fetchlist-1 它抱怨说它不能在你的文件系统的根目录中创建一个文件，这没关系，你的servlet不应该写在那里。看看https://github.com/apache/nutch/blob/master/src/java/org/a ...
Nutch：工作失败了(Nutch: Job Failed)[2022-03-16]

正在使用bin / nutch注入bin / crawl / crawldb bin / urls命令来注入而不是bin / nutch注入crawl / crawldb bin / urls 这解决了错误。并且对于获取网址我已经对regex-urlfilter.txt文件进行了更改，现在我可以获取网址了。 was using bin/nutch inject bin/crawl/crawldb bin/urls command to inject instead of bin/nutch inject ...
在hadoop上运行nutch，那是nutch的日志？(running nutch on the hadoop ，where is the nutch logs？)[2021-10-19]

如果在hadoop上运行nutch，则会生成与每个映射器和每个阶段的reducer相对应的日志。它的位置是{HADOOP_LOG_DIR}/userlogs//syslog If you are running nutch on hadoop, the logs corresponding to each mapper and reducer of each phase is generated. The location of that is {HADOOP_LOG_DIR}/user ...
Nutch如何与Hadoop集群合作？(How does Nutch work with Hadoop cluster?)[2023-01-31]

nutch的阶段是：Inject - > generate - > Fetch - > Parse - > Update - > Index 其中Fetch阶段是nutch发送url请求的地方（因此我将仅讨论此阶段并在答案中生成阶段。）生成阶段会在crawldb中创建URL的获取列表。在创建fetchlist时，属于同一主机的url通常属于同一分区，因为分区功能基于主机名。因此，最终的获取列表将如下所示： fetch list 1 : all urls of host a1, b1, c1 fetch ...
Nutch和Elasticsearch 1.1.1(Nutch and Elasticsearch 1.1.1)[2022-09-01]

主干包含一个补丁https://issues.apache.org/jira/browse/NUTCH-1745已经提交并将在Nutch-1.9中。它应该解决您的问题 - 尽管您发布的消息无助于确定问题实际上是什么。顺便说一下，你可以通过在Nutch用户列表上发布来获得更多相关的受众 The trunk contains a patch https://issues.apache.org/jira/browse/NUTCH-1745 which has been committed and will b ...
Nutch v Solr v Nutch + Solr(Nutch v Solr v Nutch+Solr)[2022-04-21]

在目前阶段，Nutch只负责抓取网页，这意味着访问网页，提取内容，找到更多链接并重复这个过程（我正在跳过很多复杂的东西，但希望你能得到这个想法）。爬网过程的最后一步是将数据存储在后端（ES / Solr是1.x分支上支持的数据存储）。因此，在这个步骤中，Solr开始发挥作用，在Nutch完成其工作之后，您需要将数据存储在某处以便能够在其上执行查询：这是Solr作业。前段时间Nutch包含了编写倒排索引的能力（正如问题中所解释的那样），但是决定（也是前一段时间）是弃用这个以支持使用Solr / ES（ ...

知识点

相关文章

最近更新

Nutch 1.3和Hadoop 0.20.203.0的整合

相关问答

nutch2.2.1 整合到java web工程[2023-12-09]

Apache Nutch - NoSuchMethodError(Apache Nutch - NoSuchMethodError)[2023-06-14]

Nutch 2.2.1 + hBase(Nutch 2.2.1 + hBase)[2022-07-22]

简单的Nutch 1.3 / Solr指数解释(Simple Nutch 1.3/Solr index explanation)[2022-05-10]

apache nutch爬行问题(apache nutch crawling issue)[2022-12-06]

Nutch：工作失败了(Nutch: Job Failed)[2022-03-16]

在hadoop上运行nutch，那是nutch的日志？(running nutch on the hadoop ，where is the nutch logs？)[2021-10-19]

Nutch如何与Hadoop集群合作？(How does Nutch work with Hadoop cluster?)[2023-01-31]

Nutch和Elasticsearch 1.1.1(Nutch and Elasticsearch 1.1.1)[2022-09-01]

Nutch v Solr v Nutch + Solr(Nutch v Solr v Nutch+Solr)[2022-04-21]