参考Nutch1.2原来的实现方式，其自带的索引功能其实是将整个网页进行了索引，而1.3版本在调用Solr服务之前，Nutch主动将无用的Html标签信息去掉了（其内部机制在此不做探讨），结果Solr中仅获取了网页之中的“正文”部分，也就是上面图片中看到的Content标签中的内容。我们所要做的工作，其核心就是将整个网页的缓存信息也交给Solr，并在查询Solr时作为结果内容返回。

首先，需要下载Nutch1.3的开发环境，下载链接：http://www.apache.org/dist//nutch/。构建工程很麻烦，也可以直接下载我构建好的工程：http://download.csdn.net/detail/Nightbreeze/3667744。JDK需要使用1.6版本。

在工程中找到“SolrIndexer”类，中的“indexSolr”方法，如下：

public void indexSolr(String solrUrl, Path crawlDb, Path linkDb,

List<Path> segments) throws IOException {

SimpleDateFormat sdf = new SimpleDateFormat("yyyy-MM-dd HH:mm:ss");

long start = System.currentTimeMillis();

LOG.info("SolrIndexer: starting at " + sdf.format(start));

final ~~JobConf~~ job = new NutchJob(getConf());

job.setJobName("index-solr " + solrUrl);

IndexerMapReduce.initMRJob(crawlDb, linkDb, segments, job);

job.set(SolrConstants.SERVER_URL, solrUrl);

NutchIndexWriterFactory.addClassToConf(job, SolrWriter.class);

job.setReduceSpeculativeExecution(false);

final Path tmp = new Path("tmp_" + System.currentTimeMillis() + "-" +

new Random().nextInt());

FileOutputFormat.setOutputPath(job, tmp);

try {

JobClient.runJob(job);

// do the commits once and for all the reducers in one go

SolrServer solr = new CommonsHttpSolrServer(solrUrl);

solr.commit();

long end = System.currentTimeMillis();

LOG.info("SolrIndexer: finished at " + sdf.format(end) + ", elapsed: " + TimingUtil.elapsedTime(start, end));

}

catch (Exception e){

LOG.error(e);

} finally {

FileSystem.get(job).delete(tmp, true);

}

Nutch在这里使用了Hadoop的分布式计算机制，我们跳转到：“IndexerMapReduce.initMRJob(crawlDb, linkDb, segments, job)”方法中看一下，如下：

public static void initMRJob(Path crawlDb, Path linkDb,

Collection<Path> segments,

~~JobConf~~ job) {

LOG.info("IndexerMapReduce: crawldb: " + crawlDb);

LOG.info("IndexerMapReduce: linkdb: " + linkDb);

for (final Path segment : segments) {

LOG.info("IndexerMapReduces: adding segment: " + segment);

~~FileInputFormat~~.addInputPath(job, new Path(segment, CrawlDatum.FETCH_DIR_NAME));

~~FileInputFormat~~.addInputPath(job, new Path(segment, CrawlDatum.PARSE_DIR_NAME));

~~FileInputFormat~~.addInputPath(job, new Path(segment, ParseData.DIR_NAME));

~~FileInputFormat~~.addInputPath(job, new Path(segment, ParseText.DIR_NAME));

}

~~FileInputFormat~~.addInputPath(job, new Path(crawlDb, CrawlDb.CURRENT_NAME));

~~FileInputFormat~~.addInputPath(job, new Path(linkDb, LinkDb.CURRENT_NAME));

job.setInputFormat(~~SequenceFileInputFormat~~.class);

job.setMapperClass(IndexerMapReduce.class);

job.setReducerClass(IndexerMapReduce.class);

job.setOutputFormat(IndexerOutputFormat.class);

job.setOutputKeyClass(Text.class);

job.setMapOutputValueClass(NutchWritable.class);

job.setOutputValueClass(NutchWritable.class);

}

可以看到，~~FileInputFormat~~.addInputPath(job, new Path(segment, ParseText.DIR_NAME));中仅处理了Segment文件夹下“parse_data”与“parse_text”中的内容。

本文出自 “果壳中的宇宙” 博客，转载请与作者联系！

转自：http://williamx.blog.51cto.com/3629295/722707

知识点

相关文章

最近更新

Nutch1.3集成Solr网页快照功能实现（一）

相关问答

如何解决：请教nutch和solr集成问题[2023-12-14]

如何解决：请教nutch和solr集成问题[2021-11-19]

Nutch与Solr(Nutch versus Solr)[2022-06-20]

Nutch 1.2 Solr 3.6集成问题(Nutch 1.2 Solr 3.6 integration issue)[2022-08-14]

Apache Nutch 1.12和Solr 5.4.1的集成失败(Integration of Apache Nutch 1.12 and Solr 5.4.1 failed)[2023-08-01]

nutch 1.2 solr 3.1集成问题(nutch 1.2 solr 3.1 integration issue)[2023-02-21]

简单的Nutch 1.3 / Solr指数解释(Simple Nutch 1.3/Solr index explanation)[2022-05-10]

我应该使用cygwin进行nutch和solr集成吗？(Should i use cygwin for nutch and solr integration?)[2023-01-10]

如何使用python在phantomjs中拍摄部分网页快照？(How to take a partial web snapshot in phantomjs using python?)[2022-06-11]

Nutch v Solr v Nutch + Solr(Nutch v Solr v Nutch+Solr)[2022-04-21]