首页 \ 教程 \ solr

知识点

Solr

Solr3.6.2与nutch1.6的整合

基于hadoop+nutch+solr的搜索引擎环境搭载<二>nutch+solr整合以及搭载在hadoop上

【知识全面好文】Apache nutch 1.5和Solr 3.6安装配置

基于hadoop+nutch+solr的搜索引擎环境搭载<三>hadoop,nutch,solr整合到eclipse上开发

Centos 下安装配置tomcat6 与 solr 3.6 整合

Nutch搜索引擎（第2期）_ Solr简介及安装

Nutch 1.3和Hadoop 0.20.203.0的整合

nutch的安装、配置以及使用（学习笔记）

Nutch安装指南

CentOS 6.4环境下的Apache Nutch 1.7 + Solr 4.4.0安装笔记

集成Nutch和Solr

Nutch&Solr小计

nutch，solr集成在hadoop上

Lucene、Nutch、Solr

荐安装nutch2+Hbase+Slor4

nutch安装,与solr整合

2019-03-27 01:17|来源: 网路

linux环境下安装ant，svn

svn检出nutch1.8版本的源码

svn co http://svn.apache.org/repos/asf/nutch/tags/release-1.8/

进入cd ./release-1.8，运行ant命令，下载下来nutch相关的各个jar包

nutch通过ivy进行依赖管理，里面有

ant构建后生成build和runtime两个文件夹，runtime包含了deploy和local两种nutch运行方式

runtime文件夹下面有local和deploy文件夹，local文件夹下面有bin conf lib plugins test urls文件，bin目录下面有nucth和crawl命令

可以vi查看他们，得到具体的源码说明

deploy文件夹下面有apache-nutch-1.8.job bin，我们运行deploy命令时，将apache-nutch-1.8.job 提交给jobtrack运行mr命令

在/usr/local/release-1.8/runtime/local/bin/运行抓取命令crawl

./crawl
Missing seedDir : crawl <seedDir> <crawlDir> <solrURL> <numberOfRounds>参数是抓取地址，crawlDir抓取数据放到的目录，solrURL是solr的目录，numberOfRounds是原来版本中depth的意思。

在local文件夹下面建立mkdir url.txt 输入http://www.163.com

为了运行1.8还得安装solr版本，选取solr4.8.1。

从http://mirror.bit.edu.cn/apache/lucene/solr/4.8.1/下载solr-4.8.1，然后unzip到指定目录下，指定SOLR_HOME

然后cd ${SOLR_HOME}/example ，运行java -jar start.jar

5. 检查 Solr 安装

安装solr-4.8.1之后，输入一下网址，验证是否安装成功，

http://localhost:8983/solr/#/

这样，我们在/usr/local/release-1.8/runtime/local/bin/crawl ../urls ../data http://master:8983/solr/ 2抓取指定文件夹下的文件。

6. 整合 Solr、Nutch

We have both Nutch and Solr installed and setup correctly. And Nutch already created crawl data from the seed URL(s). Below are the steps to delegate searching to Solr for links to be searchable:

mv ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml.org
cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml ${APACHE_SOLR_HOME}/example/solr/collection1/conf/
vi ${APACHE_SOLR_HOME}/example/solr/conf/schema.xml
Copy exactly in 351 line: <field name="_version_" type="long" indexed="true" stored="true"/>
restart Solr with the command “java -jar start.jar” under ${APACHE_SOLR_HOME}/example
run the Solr Index command:

bin/nutch solrindex http://127.0.0.1:8983/solr/ ../data/crawldb -linkdb ../data/linkdb ../data/segments/*

The call signature for running the solrindex has changed. The linkdb is now optional, so you need to denote it with a "-linkdb" flag on the command line.

This will send all crawl data to Solr for indexing. For more information please see bin/nutch solrindex

If all has gone to plan, we are now ready to search with http://localhost:8983/solr/admin/. If you want to see the raw HTML indexed by Solr, change the content field definition inschema.xml to:

<field name="content" type="text" stored="true" indexed="true"/>

转自：http://blog.csdn.net/tbdp6411/article/details/26700347

相关问答

nutch2.2.1 整合到java web工程[2022-12-26]

不知道你转为web项目意义何在，可能是为了在线对爬取的配置，首先：nutch2.2.1提供了一个API，实现其相应的接口，就可以实现相应的功能了，第二点：nutch本身爬取就需要占用大规模的资源，支持hadoop的，hadoop本身就是离线的，所以设置成web意义不大。所以建议去实现nutch的API接口进行爬取，然后用web项目查询数据就好了
为什么使用nutch和solr[2023-07-13]

恩，这个我没有集成过，但是我想在不操作数据库的基础上，是不需要修改application.xml和struts.xml这些配置文件的，只需要把爬行的xml文件配置到web.xml里面就可以了，定时爬取的话就在application.xml里面配置quartz任务。希望能够帮到你
Nutch与Solr(Nutch versus Solr)[2022-06-20]

Nutch是构建网络爬虫和搜索引擎的框架。 Nutch可以完成从收集网页到建立倒排索引的整个过程。它也可以将这些索引推送到Solr。 Solr主要是一个搜索引擎，支持分面搜索和许多其他简洁的功能。但Solr不提取数据，你必须提供它。因此，也许你必须要问的第一件事是在你是否有可用的索引数据（在XML中，在CMS或数据库中）。在这种情况下，您应该只使用Solr并为其提供数据。另一方面，如果你不得不从网络上获取数据，你可能更愿意使用Nutch。 Nutch is a framework to build ...
nutch没有索引的索引规范(nutch not indexing specifig teg in solr)[2022-05-09]

我快速浏览了GH存储库，因为代码实际上像普通的ParseFilter您应该能够使用parsechecker命令检查数据是否被正确拉出： $ bin/nutch parsechecker 这应输出由Nutch（contentType，signature，url）和ParseData （状态，标题，外ParseData等）提取的常用数据以及从插件中提取的任何其他信息。您还可以使用indexchecker命令： $ bin/nutch indexchecker 这将输出将由活动索引插 ...
在整合nutch 2.3，Hbase和Solr时，在索引方面花费了太多时间(Taking too much time in indexing while integrating nutch 2.3, Hbase and Solr)[2022-01-27]

最后，我解决了它。基本上， java -jar start.jar下载了jar文件，因此它不会在这里进行索引，而是下载Solr 4.8 jar然后配置它。由于性能，我用Solr 5.2.1替换了Solr 4.8，现在Solr工作正常。 Finally,I resolved it. Basically, java -jar start.jar downloads the jar files,so it is not doing indexing here but downloading the Solr 4 ...
Nutch没有删除Solr的重复项(Nutch not deleting duplicates from Solr)[2022-09-04]

在signatureField标签中，我有“id”而不是“signature” true true ...
简单的Nutch 1.3 / Solr指数解释(Simple Nutch 1.3/Solr index explanation)[2022-05-10]

nutch模式将id（= url）定义为teh唯一键。如果你重新抓取url，当nutch将数据发布到solr时，文档将被替换为solr索引。 The nutch schema defines id (= url) as teh unique key. If you re-crawl the url teh document will be replaced in solr index when nutch posts the data to solr.
我应该使用cygwin进行nutch和solr集成吗？(Should i use cygwin for nutch and solr integration?)[2023-01-10]

使用cygwin，这是一个很好的指南，可以将它们组合在一起： http://amac4.blogspot.com/2013/07/setting-up-solr-with-apache-tomcat-be.html Use cygwin, heres an excellent guide to set them up together: http://amac4.blogspot.com/2013/07/setting-up-solr-with-apache-tomcat-be.html
输出到solr的nutch服务器(nutch server that outputs to solr)[2021-12-23]

您只需要配置Nutch（ http.agent.name ）所需的参数，并指出您想要在所需的Solr实例中索引您的内容，例如使用您只需要添加solr.server.url的bin/crawl脚本solr.server.url属性： $ bin/crawl -i -D solr.server.url=http://localhost:8983/solr/ urls/ crawl/ 2 如果您在终端中执行bin/crawl ，您将获得有关可用选项的更多信息。这里有更全面的介绍。对于2.x分支， bin/c ...
Nutch v Solr v Nutch + Solr(Nutch v Solr v Nutch+Solr)[2022-04-21]

在目前阶段，Nutch只负责抓取网页，这意味着访问网页，提取内容，找到更多链接并重复这个过程（我正在跳过很多复杂的东西，但希望你能得到这个想法）。爬网过程的最后一步是将数据存储在后端（ES / Solr是1.x分支上支持的数据存储）。因此，在这个步骤中，Solr开始发挥作用，在Nutch完成其工作之后，您需要将数据存储在某处以便能够在其上执行查询：这是Solr作业。前段时间Nutch包含了编写倒排索引的能力（正如问题中所解释的那样），但是决定（也是前一段时间）是弃用这个以支持使用Solr / ES（ ...

知识点

相关文章

最近更新

nutch安装,与solr整合

6. 整合 Solr、Nutch

相关问答

nutch2.2.1 整合到java web工程[2022-12-26]

为什么使用nutch和solr[2023-07-13]

Nutch与Solr(Nutch versus Solr)[2022-06-20]

nutch没有索引的索引规范(nutch not indexing specifig teg in solr)[2022-05-09]

在整合nutch 2.3，Hbase和Solr时，在索引方面花费了太多时间(Taking too much time in indexing while integrating nutch 2.3, Hbase and Solr)[2022-01-27]

Nutch没有删除Solr的重复项(Nutch not deleting duplicates from Solr)[2022-09-04]

简单的Nutch 1.3 / Solr指数解释(Simple Nutch 1.3/Solr index explanation)[2022-05-10]

我应该使用cygwin进行nutch和solr集成吗？(Should i use cygwin for nutch and solr integration?)[2023-01-10]

输出到solr的nutch服务器(nutch server that outputs to solr)[2021-12-23]

Nutch v Solr v Nutch + Solr(Nutch v Solr v Nutch+Solr)[2022-04-21]