nutch与起点R3集成之笔记(二)

2019-03-27 01:21|来源: 网路

         nutch与起点R3集成之笔记(一)中介绍了在起点R3中添加nutch要用到的索引字段,上述字段建好后,就可以通过nutch抓取一个或多个网站内容,并通过 bin/nutch solrindex 送到起点R3索引库中。

       三、nutch安装与配置

       1.安装nutch

       先从http://www.apache.org/dist//nutch/apache-nutch-1.3-bin.zip下载nutch1.3,展开。nutch可以在linux环境下运行,也可以在windows环境下运行,也可以导入到eclipse中运行。

       在linux环境下安装最简单,将展开后runtime/local目录下的内容上传到linux的一个目录下,如/opt/nutch1.3,同时将 /opt/nutch1.3/lib下的nutch-1.3.jar copy到  /opt/nutch1.3目录,并改名为 nutch-1.3.job,并chmod +x /opt/nutch1.3/bin。同时要有JDK环境,并在profile中设置JAVA_HOME,PATH中有JDK的bin路径。在 /opt/nutch1.3目录键入 bin/nutch ,出现如下提示:

 

[root@test nutch-1.3]# bin/nutch
Usage: nutch [-core] COMMAND
where COMMAND is one of:
  crawl             one-step crawler for intranets
  readdb            read / dump crawl db
  convdb            convert crawl db from pre-0.9 format
  mergedb           merge crawldb-s, with optional filtering
  readlinkdb        read / dump link db
  inject            inject new urls into the database
  generate          generate new segments to fetch from crawl db
  freegen           generate new segments to fetch from text files
  fetch             fetch a segment's pages
  parse             parse a segment's pages
  readseg           read / dump segment data
  mergesegs         merge several segments, with optional filtering and slicing
  updatedb          update crawl db from segments after fetching
  invertlinks       create a linkdb from parsed segments
  mergelinkdb       merge linkdb-s, with optional filtering
  index             run the indexer on parsed segments and linkdb
  solrindex         run the solr indexer on parsed segments and linkdb
  merge             merge several segment indexes
  dedup             remove duplicates from a set of segment indexes
  solrdedup         remove duplicates from solr
  plugin            load a plugin and run one of its classes main()
  server            run a search server
 or
  CLASSNAME         run the class named CLASSNAME
Most commands print help when invoked w/o parameters.

       表示安装成功。如果要安装成hadoop模式,还需要从网上将hadoop一些运行脚本拷贝到bin目录下。

        在windows环境下,必须安装linux运行模拟环境软件cygwin,从http://www.cygwin.org/cygwin/下载安装cygwin。在cygwin下运行nutch跟linux需要的配置时一样的,需要设置 java_home,path等等。

        在enlipse环境下,如何导入nutch1.3,网上有很多介绍,但很多是错的。其中一个重要的步骤是在构建路径时要将conf放在路径顺序中最前面,如下图:

并建立好主类为org.apache.nutch.crawl.Crawl的java运行应用程序,如下图:

 

对应的自变量设置为:

     2.配置nutch-site.xml

      无论是在linux下,在cygwin下,还是在eclipse环境里,首先需要修改conf中nutch-site.xml文件,在nutch-site.xml中加入:

 <property>
  <name>http.agent.name</name>
  <value>nutch-1.3</value>
 </property>

<property>
  <name>http.robots.agents</name>
  <value>nutch-1.3,*</value>
</property>


<property>
  <name>plugin.includes</name>
  <value>protocol-http|urlfilter-regex|parse-(html|tika|js|zip|swf|rss)|index-(basic|anchor|more)|scoring-opic|urlnormalizer-(pass|regex|basic)</value>
</property>

      同时在在eclipse环境下,还需要在nutch-site.conf文件里加入:

 

<property>
  <name>plugin.folders</name>
  <value>./src/plugin</value>
  <description>Directories where nutch plugins are located.  Each
  element may be a relative or absolute path.  If absolute, it is used
  as is.  If relative, it is searched for on the classpath.</description>
</property>

   3.配置solrindex-mapping.xml

    同时,修改nutch1.3confsolrindex-mapping.xml文件,把nutch的索引字段与起点R3的定义的索引字段进行映射。内容如下:

 

<?xml version="1.0" encoding="UTF-8"?>
<!--
 Licensed to the Apache Software Foundation (ASF) under one or more
 contributor license agreements.  See the NOTICE file distributed with
 this work for additional information regarding copyright ownership.
 The ASF licenses this file to You under the Apache License, Version 2.0
 (the "License"); you may not use this file except in compliance with
 the License.  You may obtain a copy of the License at

     http://www.apache.org/licenses/LICENSE-2.0

 Unless required by applicable law or agreed to in writing, software
 distributed under the License is distributed on an "AS IS" BASIS,
 WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
 See the License for the specific language governing permissions and
 limitations under the License.
-->

<mapping>
	<!-- Simple mapping of fields created by Nutch IndexingFilters
	     to fields defined (and expected) in Solr schema.xml.

             Any fields in NutchDocument that match a name defined
             in field/@source will be renamed to the corresponding
             field/@dest.
             Additionally, if a field name (before mapping) matches
             a copyField/@source then its values will be copied to 
             the corresponding copyField/@dest.

             uniqueKey has the same meaning as in Solr schema.xml
             and defaults to "id" if not defined.
         -->
	<fields>
		<field dest="title" source="title"/>
		<field dest="text" source="content"/>
		<field dest="lastModified" source="lastModified"/>	
		<field dest="type" source="type"/>	
		<field dest="site" source="site"/>
		<field dest="anchor" source="anchor"/>
		<field dest="host" source="host"/>		
		<field dest="segment" source="segment"/>
		<field dest="boost" source="boost"/>
		<field dest="tstamp" source="tstamp"/>
		<field dest="url" source="url"/>
		<field dest="id" source="digest"/>
		<copyField source="digest" dest="digest"/>
	</fields>
	<uniqueKey>id</uniqueKey>
</mapping>

     4.配置 regex-urlfilter.xml

      修改url过滤器,保证你要采集的网站,能不会被url过滤器给过滤掉,如要抓取新浪网站内容 ,所以在nutch的conf的regex-urlfilter.xml里加入一条:

+^http://www.sina

 regex-urlfilter.xml内容如下:

 

# Licensed to the Apache Software Foundation (ASF) under one or more
# contributor license agreements.  See the NOTICE file distributed with
# this work for additional information regarding copyright ownership.
# The ASF licenses this file to You under the Apache License, Version 2.0
# (the "License"); you may not use this file except in compliance with
# the License.  You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# The default url filter.
# Better for whole-internet crawling.

# Each non-comment, non-blank line contains a regular expression
# prefixed by '+' or '-'.  The first matching pattern in the file
# determines whether a URL is included or ignored.  If no pattern
# matches, the URL is ignored.

# skip file: ftp: and mailto: urls
-^(file|ftp|mailto):

# skip image and other suffixes we can't yet parse
#-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$
-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$

# skip URLs containing certain characters as probable queries, etc.
#-[?*!@=]

# skip URLs with slash-delimited segment that repeats 3+ times, to break loops
-.*(/[^/]+)/[^/]+\1/[^/]+\1/

# accept anything else
+^http://www.sina.
-.

        5.在nutch1.3目录下建一个 url目录(url目录与conf是统计目录),然后在url目录里建一个url.txt文件,url.txt文件内容为http://www.sina.com.cn 。

       

 

 

 


转自:http://my.oschina.net/sprint/blog/28717

相关问答

更多
  • 这主要是Nutch使用的Solrj版本罐和您尝试集成的Solr 3.6之间的javabin不兼容性。 您需要更新Solrj罐并重新生成作业。 按照论坛中提到的步骤操作。 This is mainly the javabin incompatiblity between the Solrj version jars used by Nutch and the Solr 3.6 which you are trying to integrate. You would need to update the Sol ...
  • 目前user.r已被弃用为安全风险。 应该有一种方法可以实现这种方法......但是还没有人开始研究它。 见http://chat.stackoverflow.com/transcript/291?m=9149463#9149463 Currently user.r deprecated as a security risk. There is supposed to be a dialected method for this to happen .. but no one has started wor ...
  • 创建自己的java类来管理Nutch循环。 它应该类似于org.apache.nutch.crawl.Crawl,但您必须通过调用Mysql连接器来替换对索引器的调用。 或者您可以在每个周期中调用您的Mysql连接器,具体取决于您是要在爬网结束时还是在发生时更新Mysql。 Create your own java class that manage the Nutch cycle. It should be similar to org.apache.nutch.crawl.Crawl but you w ...
  • 您需要将以下Apache Commons库添加到类路径中: commons-httpclient.jar (您可以将它放在nutch安装所使用的其他JAR所在的文件夹中)。 你可以在这里找到当前版本的HttpClient http://hc.apache.org/httpcomponents-client-ga/ 请注意,您的Nutch版本可能使用较旧版本的HttpClient,而当前版本的HttpClient与旧版本不兼容。 在这种情况下,您需要下载旧版本的HttpClient,并在您的库中包含旧版本。 ...
  • 您不应该在每个页面的标题中添加它,因为它的依赖项不一定会加载。 你需要在这里将它添加到你的主题requirejs-config; /app/design/frontend///requirejs.config 在文件中放这个; var config = { paths: { 'select2': 'js/select2.min', }, shim: { 'select2': { deps: ['jquery'], } } }; ...
  • 这是一个开放的JIRA问题( SPR-5991 )。 它具有标记为Spring 3.1的iBatis3支持 There's an open JIRA issue for this (SPR-5991). It has iBatis3 support tagged for Spring 3.1
  • Facebook可能令人困惑。 您需要的令牌取决于您尝试实现的操作。 例如, 您可以使用应用令牌来请求具有特定权限的用户令牌。 然后,您将使用该USER TOKEN代表用户访问facebook api。 在您发布的代码中,您似乎拥有要发送给Facebook的访问令牌作为授予您应用的访问令牌...而不是给定用户的访问令牌。 在成功授予给定用户访问权限后,您需要捕获(并存储)Facebook提供给您的用户访问令牌。 更多相关信息: https : //developers.facebook.com/docs/f ...
  • 即使Perforce允许它,恢复已删除的文件也不会留下一个集成 ,它会留下一个补充 。 移动操作包括将文件添加(不集成)到新位置,然后从旧位置删除它们。 如果您考虑一下,将文件集成到新位置然后从原始位置删除它是没有意义的。 集成的目的是维护两组文件之间的关系。 据我所知,Perforce没有提供“一键”机制来做你想做的事情,但这并不太难: 将已修改的文件复制到临时目录 恢复文件移动 集成文件,然后在新位置打开它们进行编辑 将更改的文件从临时目录复制到新位置并提交 Even if Perforce allow ...
  • LDR R0,[R0,R3] 这条线到底做了什么 R3用作基址( R0 )的偏移量。 换句话说,有效地址是R0+R3 , 从该地址加载的值被写入R0 。 请参阅LDR / STR的文档 。 不仅使用LDR R0,[R0,#0x1AC]原因可能是这是Thumb代码。 Thumb模式下LDR/STR的立即偏移量仅为7位 (5位向左移位2位),0x1AC不适合7位。 因此,偏移首先放在寄存器中。 LDR R0, [R0,R3] What does this line do exactly R3 acts as a ...
  • 如果你需要从头开始构建一个具有grunt功能的主题,那么你可以通过npm使用yeoman轻松构建它。 Yeoman是一个网络脚手架工具,因此您无需进行安装所有细节的艰苦工作。 步骤1: 安装节点js看链接 第2步: npm install -g yo 第3步: npm install -g generator-grunt-magento 步骤4: yo grunt-magento 更多细节来自这里 。 另一种可能的解决方案是在现有主题上创建grunt结构。 这是我们无法概括的内容,因此它实际上取决于您 ...