首页 \ 教程 \ solr

知识点

Solr

荐《Nutch笔记》Nutch-1.7+solr-4.7集成

荐我的架构演化笔记 12：Nutch1.7 构建互联网爬虫

荐 Nutch学习笔记2： Nutch-2.2.1脚本分析

荐 Nutch学习笔记7---url的正则过滤机制研究

Apache Nutch 1.3 学习笔记十二（Nutch 2.0 的主要变化）

nutch 1.7导入Eclipse

CentOS 6.4环境下的Apache Nutch 1.7 + Solr 4.4.0安装笔记

Nutch1.7 + Solr4.4搭建垂直搜索引擎

nutch的安装、配置以及使用（学习笔记）

elasticsearch 口水篇（6） Mapping 定义索引

Nutch1.7二次开发培训讲义

nutch学习笔记(二)入门篇

Nutch学习笔记一 ---环境搭建

Apache Nutch 1.3 学习笔记二

Nutch1.6学习笔记

荐 Nutch学习笔记4-Nutch 1.7 的索引篇 ElasticSearch

2019-03-27 01:21|来源: 网路

上一篇讲解了爬取和分析的流程，很重要的收获就是：

解析过程中，会根据页面的ContentType获得一系列的注册解析器，

依次调用每个解析器，当其中一个解析成功后就返回，否则继续执行下一个解析器。

当然，返回之前还要经过注册过的所有HtmlParseFilter的过滤，至少对HtmlParser是这样的。

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~·下面来看看索引过程。

当我们敲入

./bin/nutch index

发生了什么？

查看1.7版本的nutch脚本

elif [ "$COMMAND" = "index" ] ; then
  CLASS=org.apache.nutch.indexer.IndexingJob

那么我们就来看看IndexingJob这个类。

首先看函数

public int run(String[] args) throws Exception {

这个函数先解析参数。

然后调用

try {
            index(crawlDb, linkDb, segments, noCommit, deleteGone, params,
                    filter, normalize);
            return 0;
        } catch (final Exception e) {
            LOG.error("Indexer: " + StringUtils.stringifyException(e));
            return -1;
        }

那么我们继续看

public void index(Path crawlDb, Path linkDb, List<Path> segments,
            boolean noCommit, boolean deleteGone, String params,
            boolean filter, boolean normalize) throws IOException {

看代码的第100行

IndexWriters writers = new IndexWriters(getConf());
        LOG.info(writers.describe());

如果你查看IndexWriters的构造函数，会发现这仍然是通过插件机制获得所有可用的IndexWriter的。

PS:Nutch只是规定了一系列流程，至于每个流程，可以通过Plugin来介入。

为我们后续注入自己的东西比如定制化需求提供了很大的方便。

~~~~~~~~~~~~~

如果你此时运行

./bin/nutch index ./data/crawldb/ -linkdb ./data/linkdb/ -dir ./data/segments/ -deleteGone -filter -normalize

系统会报错：

Indexer: starting at 2014-06-26 14:49:35
Indexer: deleting gone documents: true
Indexer: URL filtering: true
Indexer: URL normalizing: true
Indexer: java.lang.RuntimeException: Missing SOLR URL. Should be set via -D solr.server.url
SOLRIndexWriter
	solr.server.url : URL of the SOLR instance (mandatory)
	solr.commit.size : buffer size when sending to SOLR (default 1000)
	solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
	solr.auth : use authentication (default false)
	solr.auth.username : use authentication (default false)
	solr.auth : username for authentication
	solr.auth.password : password for authentication

	at org.apache.nutch.indexwriter.solr.SolrIndexWriter.setConf(SolrIndexWriter.java:192)
	at org.apache.nutch.plugin.Extension.getExtensionInstance(Extension.java:160)
	at org.apache.nutch.indexer.IndexWriters.<init>(IndexWriters.java:57)
	at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:100)
	at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:185)
	at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
	at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:195)

这是为啥，如果你了解了Nutch, 其实自己都能找到原因。

学开源就是这点好处，你看了代码你就知道问题所在。

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

之所以这里定位到了Solr,是因为我们在配置文件里启用了index-solr插件。

修改前的nutch-default.xml关于插件是

<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
  <description>Regular expression naming plugin directory names to
  include.  Any plugin not matching this expression is excluded.
  In any case you need at least include the nutch-extensionpoints plugin. By
  default Nutch includes crawling just HTML and plain text via HTTP,
  and basic indexing and search plugins. In order to use HTTPS please enable
  protocol-httpclient, but be aware of possible intermittent problems with the
  underlying commons-httpclient library.
  </description>
</property>

看到了吧，这里就有indexer-solr的配置。

复制到nutch-site.xml,然后去掉indexer-solr，替换成indexer-elastic.

重新执行下面的命令

./bin/nutch index ./data/crawldb/ -linkdb ./data/linkdb/ -dir ./data/segments/ -deleteGone -filter -normalize

系统会报错，表示缺少配置项，自己解决吧。

~~~~~~~~~~~~~~~~~~~~~~~~~~

关于IndexWriter和IndexingFilter的官方解释：

IndexWriter -- Writes crawled data to a specific indexing backends (Solr, ElasticSearch, a CVS file, etc.).
IndexingFilter -- Permits one to add metadata to the indexed fields. All plugins found which implement this extension point are run sequentially on the parse (from javadoc).

~~~~~~~~~~~~~~~~~~~~~~~~~~

关于INdexWriter和IndexingFilter的连接关系

IndexerMapReduce.java的272行

// run indexing filters
      doc = this.filters.filter(doc, parse, key, fetchDatum, inlinks);

其实也就是依次调用每个IndexingFilter,如果结果不是null,则继续调用下一个。

然后过滤后的最终结果通过

public void open(JobConf job, String name) throws IOException {
		for (int i = 0; i < this.indexWriters.length; i++) {
			try {
				this.indexWriters[i].open(job, name);
			} catch (IOException ioe) {
				throw ioe;
			}
		}
	}

	public void write(NutchDocument doc) throws IOException {
		for (int i = 0; i < this.indexWriters.length; i++) {
			try {
				this.indexWriters[i].write(doc);
			} catch (IOException ioe) {
				throw ioe;
			}
		}
	}

也是调用每一个IndexWriter写入到对应的比如说Solr,ElasticSearch等索引器中。

至此，我们已经把解析和索引的整个过程及原理都分析完毕。

转自：http://my.oschina.net/qiangzigege/blog/284443

知识点

相关文章

最近更新

荐 Nutch学习笔记4-Nutch 1.7 的索引篇 ElasticSearch

相关问答

你之前使用elasticsearch索引nutch抓取结果吗？(Have you indexed nutch crawl results using elasticsearch before?)[2023-04-03]

Nutch在Elasticsearch上没有正确使用Mongodb的索引(Nutch does not Index on Elasticsearch correctly using Mongodb)[2023-11-19]

Nutch和Elasticsearch(Nutch and Elasticsearch)[2022-03-10]

我应该使用Nutch 1x或2x与Elasticsearch(Should I use Nutch 1x or 2x with Elasticsearch)[2023-08-09]

Nutch以外的爬虫与Elasticsearch一起使用(Crawlers other than Nutch that work with Elasticsearch)[2023-03-03]

什么是nutch1.10的elasticsearch版本？(what is the version of elasticsearch for nutch1.10?)[2022-10-04]

从Nutch 1.x将数据映射到Elasticsearch(Mapping data into Elasticsearch from Nutch 1.x)[2022-05-27]

成功完成Nutch爬网后，Elasticsearch索引失败(Elasticsearch indexing fails after successful Nutch crawl)[2022-03-21]

使用elasticsearch进行Apache Nutch索引(Apache Nutch Indexing using elasticsearch)[2023-02-06]

在ElasticSearch 1.7中使用NEST> 1.9.1？(Use NEST > 1.9.1 with ElasticSearch 1.7?)[2023-10-20]

知识点

相关文章

最近更新

荐 Nutch学习笔记4-Nutch 1.7 的 索引篇 ElasticSearch

相关问答

你之前使用elasticsearch索引nutch抓取结果吗？(Have you indexed nutch crawl results using elasticsearch before?)[2023-04-03]

Nutch在Elasticsearch上没有正确使用Mongodb的索引(Nutch does not Index on Elasticsearch correctly using Mongodb)[2023-11-19]

Nutch和Elasticsearch(Nutch and Elasticsearch)[2022-03-10]

我应该使用Nutch 1x或2x与Elasticsearch(Should I use Nutch 1x or 2x with Elasticsearch)[2023-08-09]

Nutch以外的爬虫与Elasticsearch一起使用(Crawlers other than Nutch that work with Elasticsearch)[2023-03-03]

什么是nutch1.10的elasticsearch版本？(what is the version of elasticsearch for nutch1.10?)[2022-10-04]

从Nutch 1.x将数据映射到Elasticsearch(Mapping data into Elasticsearch from Nutch 1.x)[2022-05-27]

成功完成Nutch爬网后，Elasticsearch索引失败(Elasticsearch indexing fails after successful Nutch crawl)[2022-03-21]

使用elasticsearch进行Apache Nutch索引(Apache Nutch Indexing using elasticsearch)[2023-02-06]

在ElasticSearch 1.7中使用NEST> 1.9.1？(Use NEST > 1.9.1 with ElasticSearch 1.7?)[2023-10-20]

荐 Nutch学习笔记4-Nutch 1.7 的索引篇 ElasticSearch