首页 \ 教程 \ solr

知识点

Solr

nutch

Apache Nutch（一）

nutch 1.7导入Eclipse

nutch安装,与solr整合

Nutch环境搭建

Nutch安装指南

荐安装nutch2+Hbase+Slor4

Nutch插件开发及发布流程

使用nutch和solr建立搜索引擎

Nutch集成Solr中文分词Schema

Nutch 教程

布鲁 » nutch1.1导入eclipse中运行

Setting up Nutch 2.1 with MySQL to handle UTF-8

利用Lucene与Nutch构建简单的全文搜索引擎

Nutch 2.2+MySQL+Solr4.2实现网站内容的抓取和索引

Nutch 1-build

2019-03-27 01:11|来源: 网路

1. install software

Cygwin, jdk, ant, nutch

2. configure

environment variable

JAVA_HOME = C:\PROGRA~1\Java\jdk1.7.0_45

ANT_HOME = C:\PROGRA~1\Ant\apache-ant-1.9.3

PATH = ...

copy source file

copy apache-nutch-2.2.1-src folder into home of Cygwin

build

enter home/apache-nutch-2.2.1-src then build

ant

It takes about half an hour to download dependency.

3. test

Stan@Stan-PC ~/nutch/runtime/local
$ ls
bin  conf  lib  plugins  test

Stan@Stan-PC ~/nutch/runtime/local
$ bin/nutch
Usage: nutch COMMAND
where COMMAND is one of:
 inject         inject new urls into the database
 hostinject     creates or updates an existing host table from a text file
 generate       generate new batches to fetch from crawl db
 fetch          fetch URLs marked during generate
 parse          parse URLs marked during fetch
 updatedb       update web table after parsing
 updatehostdb   update host table after parsing
 readdb         read/dump records from page database
 readhostdb     display entries from the hostDB
 elasticindex   run the elasticsearch indexer
 solrindex      run the solr indexer on parsed batches
 solrdedup      remove duplicates from solr
 parsechecker   check the parser for a given url
 indexchecker   check the indexing filters for a given url
 plugin         load a plugin and run one of its classes main()
 nutchserver    run a (local) Nutch server on a user defined port
 junit          runs the given JUnit test
 or
 CLASSNAME      run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

Stan@Stan-PC ~/nutch/runtime/local

continue...

转自：http://www.cnblogs.com/harrysun/p/3516783

相关问答

NUTCH与JAVA如何结合[2022-08-16]

有一本书比较好叫做lucene+nutch搜索引擎开发 nutch本身就是用java写的开源项目如果想用就是修改其源代码来做应用我也在学目前理解是这样的
在hadoop多集群环境中运行nutch时出错(error while running nutch on hadoop multi cluster environment)[2022-03-24]

我可以解决这个问题。将文件从本地文件系统复制到HDFS目标文件系统时，它曾经是这样的：bin / hadoop dfs -put~ / nutch / urls urls。但它应该是“bin / hadoop dfs -put~ / nutch / urls / * urls”，这里urls / *将允许子目录。 I could solve this issue. when copying files from local file system to HDFS destination filesyst ...
如何在伪分布模式下安装Hadoop中运行Nutch(How to run Nutch in Hadoop installed in pseudo-distributed mode)[2023-04-04]

确保你已经从源码构建Nutch，即不要使用只能在本地模式下工作的二进制版本。一旦你编译完毕蚂蚁干净的运行转到运行时/ deploy / bin并像往常一样运行脚本。注意，您需要在重新编译之前修改conf文件。 Make sure you have built Nutch from source i.e. don't use the binary release which works only in local mode. Once you've compile with ant clean run ...
Nutch和Elasticsearch(Nutch and Elasticsearch)[2022-03-10]

我没有使用Nutch和ES 1.5 / 1.6 / 1.7但是在indexer-elastic插件使用的API之间不应该有重大变化。我刚刚按照https://github.com/apache/nutch/blob/master/src/plugin/indexer-elastic/howto_upgrade_es.txt中的说明进行操作并构建/测试（ ant test ）Nutch 1.11和ES 1.7.2没有任何麻烦。这意味着，代码构建正常，但我还没有测试将实际数据索引到Elasticsearch ...
带有nutch REST api的Nutch弹性分度器中的未知问题(Unknown issue in Nutch elastic indexer with nutch REST api)[2022-07-13]

我找到了解决这个问题的方法。这是由于番石榴依赖的版本兼容性。 Hadoop使用guava-11.0.2.jar作为依赖。但是nutch中的弹性索引器插件需要18.0版本的番石榴。这就是为什么它试图在分布式hadoop中运行时抛出异常。所以我们只需要在hadoop库中将guava版本更新到18.0（可以在$ HADOOP_HOME / share / hadoop / common / libs /中找到）。 I have found solution for this issue. This is ...
Apache Nutch 2.3和MySQL(Apache Nutch 2.3 and MySQL)[2021-12-02]

事实上，Nutch并不关心下面的数据库，Nutch通过Gora与爬行数据库一起工作。因此，如果Gora支持数据库（MySQL，HBase，Cassandra），Nutch可以抓取并将内容放入数据库。请检查gora和mysql版本以修复您的错误。您可以按照以下指南操作： http ： //www.solutions.asia/2013/06/installing-nutch-22-with-mysql-to.html 。希望这可以帮助， Le Quoc Do In fact, Nutch doesn' ...
使用elasticsearch进行Apache Nutch索引(Apache Nutch Indexing using elasticsearch)[2023-02-06]

确保您在nutch弹性依赖项和本地服务器中运行相同的版本。如果它们不相同，那么不要浪费你的时间，并使用http协议从nutch而不是Java api直接推送到elastic。 Make sure you are running the same versions in nutch elastic dependency and your local server. If they are not the same, then do not waste your time, and use the http ...
Nutch 2.x没有抓取像flipkart和jabong这样的网站(Nutch 2.x not crawling websites like flipkart and jabong)[2024-02-23]

regex-urlfilter阻止具有查询字符串参数的URL：跳过包含某些字符的URL作为可能的查询等。 - [*？@ =] 修改该文件，以便对带有查询字符串参数的URL进行爬网：跳过包含某些字符的URL作为可能的查询等。 - [* @！] Nutch可能缺乏对Ajax页面爬行的支持。看到这个您可以查看https://issues.apache.org/jira/browse/NUTCH-1323 The regex-urlfilter blocks urls that have querystri ...
如何从Nutch spidered webpages数据库中获取XML格式的信息(How to fetch information in XML format from Nutch spidered webpages database)[2021-11-24]

这取决于数据的结构。我假设你主要抓取HTML页面。通常，您可以使用XPath来抓取页面的某些部分，例如“// div [@ class ='books'] / a / text（）” 如果大部分文本都是非结构化的（没有结构化的HTML模式可以抓取），那么您将不得不使用正则表达式或信息提取。如果你很幸运，你可以使用正则表达式做一些/大部分。对于一些更复杂的结构，您需要使用信息提取/命名实体识别。您必须训练一个IE工具，例如斯坦福大学的CoreNLP，以识别书籍标题并在文档中注释它们。还可以查看BR ...
当我在$ NUTCH_HOME中运行ant运行时，Ant失败了(Ant failed when I run ant runtime in $NUTCH_HOME)[2023-04-22]

似乎是文件写入权限问题，请尝试使用sudo运行命令或者给予文件夹写入权限。心连心。 Seems an file write permission issue, either try running command with sudo Or give the folder write permission. hth.

知识点

相关文章

最近更新

Nutch 1-build

相关问答

NUTCH与JAVA如何结合[2022-08-16]

在hadoop多集群环境中运行nutch时出错(error while running nutch on hadoop multi cluster environment)[2022-03-24]

如何在伪分布模式下安装Hadoop中运行Nutch(How to run Nutch in Hadoop installed in pseudo-distributed mode)[2023-04-04]

Nutch和Elasticsearch(Nutch and Elasticsearch)[2022-03-10]

带有nutch REST api的Nutch弹性分度器中的未知问题(Unknown issue in Nutch elastic indexer with nutch REST api)[2022-07-13]

Apache Nutch 2.3和MySQL(Apache Nutch 2.3 and MySQL)[2021-12-02]

使用elasticsearch进行Apache Nutch索引(Apache Nutch Indexing using elasticsearch)[2023-02-06]

Nutch 2.x没有抓取像flipkart和jabong这样的网站(Nutch 2.x not crawling websites like flipkart and jabong)[2024-02-23]

如何从Nutch spidered webpages数据库中获取XML格式的信息(How to fetch information in XML format from Nutch spidered webpages database)[2021-11-24]

当我在$ NUTCH_HOME中运行ant运行时，Ant失败了(Ant failed when I run ant runtime in $NUTCH_HOME)[2023-04-22]