知识点
相关文章
更多最近更新
更多nutch与起点R3集成之笔记(二)
2019-03-27 01:21|来源: 网路
在nutch与起点R3集成之笔记(一)中介绍了在起点R3中添加nutch要用到的索引字段,上述字段建好后,就可以通过nutch抓取一个或多个网站内容,并通过 bin/nutch solrindex 送到起点R3索引库中。
三、nutch安装与配置
1.安装nutch
先从http://www.apache.org/dist//nutch/apache-nutch-1.3-bin.zip下载nutch1.3,展开。nutch可以在linux环境下运行,也可以在windows环境下运行,也可以导入到eclipse中运行。
在linux环境下安装最简单,将展开后runtime/local目录下的内容上传到linux的一个目录下,如/opt/nutch1.3,同时将 /opt/nutch1.3/lib下的nutch-1.3.jar copy到 /opt/nutch1.3目录,并改名为 nutch-1.3.job,并chmod +x /opt/nutch1.3/bin。同时要有JDK环境,并在profile中设置JAVA_HOME,PATH中有JDK的bin路径。在 /opt/nutch1.3目录键入 bin/nutch ,出现如下提示:
[root@test nutch-1.3]# bin/nutch Usage: nutch [-core] COMMAND where COMMAND is one of: crawl one-step crawler for intranets readdb read / dump crawl db convdb convert crawl db from pre-0.9 format mergedb merge crawldb-s, with optional filtering readlinkdb read / dump link db inject inject new urls into the database generate generate new segments to fetch from crawl db freegen generate new segments to fetch from text files fetch fetch a segment's pages parse parse a segment's pages readseg read / dump segment data mergesegs merge several segments, with optional filtering and slicing updatedb update crawl db from segments after fetching invertlinks create a linkdb from parsed segments mergelinkdb merge linkdb-s, with optional filtering index run the indexer on parsed segments and linkdb solrindex run the solr indexer on parsed segments and linkdb merge merge several segment indexes dedup remove duplicates from a set of segment indexes solrdedup remove duplicates from solr plugin load a plugin and run one of its classes main() server run a search server or CLASSNAME run the class named CLASSNAME Most commands print help when invoked w/o parameters.
表示安装成功。如果要安装成hadoop模式,还需要从网上将hadoop一些运行脚本拷贝到bin目录下。
在windows环境下,必须安装linux运行模拟环境软件cygwin,从http://www.cygwin.org/cygwin/下载安装cygwin。在cygwin下运行nutch跟linux需要的配置时一样的,需要设置 java_home,path等等。
在enlipse环境下,如何导入nutch1.3,网上有很多介绍,但很多是错的。其中一个重要的步骤是在构建路径时要将conf放在路径顺序中最前面,如下图:
并建立好主类为org.apache.nutch.crawl.Crawl的java运行应用程序,如下图:
对应的自变量设置为:
2.配置nutch-site.xml
无论是在linux下,在cygwin下,还是在eclipse环境里,首先需要修改conf中nutch-site.xml文件,在nutch-site.xml中加入:
<property> <name>http.agent.name</name> <value>nutch-1.3</value> </property> <property> <name>http.robots.agents</name> <value>nutch-1.3,*</value> </property> <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(html|tika|js|zip|swf|rss)|index-(basic|anchor|more)|scoring-opic|urlnormalizer-(pass|regex|basic)</value> </property>
同时在在eclipse环境下,还需要在nutch-site.conf文件里加入:
<property> <name>plugin.folders</name> <value>./src/plugin</value> <description>Directories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the classpath.</description> </property>
3.配置solrindex-mapping.xml
同时,修改nutch1.3的conf中solrindex-mapping.xml文件,把nutch的索引字段与起点R3的定义的索引字段进行映射。内容如下:
<?xml version="1.0" encoding="UTF-8"?> <!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> <mapping> <!-- Simple mapping of fields created by Nutch IndexingFilters to fields defined (and expected) in Solr schema.xml. Any fields in NutchDocument that match a name defined in field/@source will be renamed to the corresponding field/@dest. Additionally, if a field name (before mapping) matches a copyField/@source then its values will be copied to the corresponding copyField/@dest. uniqueKey has the same meaning as in Solr schema.xml and defaults to "id" if not defined. --> <fields> <field dest="title" source="title"/> <field dest="text" source="content"/> <field dest="lastModified" source="lastModified"/> <field dest="type" source="type"/> <field dest="site" source="site"/> <field dest="anchor" source="anchor"/> <field dest="host" source="host"/> <field dest="segment" source="segment"/> <field dest="boost" source="boost"/> <field dest="tstamp" source="tstamp"/> <field dest="url" source="url"/> <field dest="id" source="digest"/> <copyField source="digest" dest="digest"/> </fields> <uniqueKey>id</uniqueKey> </mapping>
4.配置 regex-urlfilter.xml
修改url过滤器,保证你要采集的网站,能不会被url过滤器给过滤掉,如要抓取新浪网站内容 ,所以在nutch的conf的regex-urlfilter.xml里加入一条:
+^http://www.sina
regex-urlfilter.xml内容如下:
# Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # The default url filter. # Better for whole-internet crawling. # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse #-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. #-[?*!@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept anything else +^http://www.sina.
-.
5.在nutch1.3目录下建一个 url目录(url目录与conf是统计目录),然后在url目录里建一个url.txt文件,url.txt文件内容为http://www.sina.com.cn 。
转自:http://my.oschina.net/sprint/blog/28717
相关问答
更多-
这主要是Nutch使用的Solrj版本罐和您尝试集成的Solr 3.6之间的javabin不兼容性。 您需要更新Solrj罐并重新生成作业。 按照论坛中提到的步骤操作。 This is mainly the javabin incompatiblity between the Solrj version jars used by Nutch and the Solr 3.6 which you are trying to integrate. You would need to update the Sol ...
-
Rebol R3有哪些配置文件以及它们是如何加载的?(What configuration files are there for Rebol R3 and how are they loaded?)[2022-05-28]
目前user.r已被弃用为安全风险。 应该有一种方法可以实现这种方法......但是还没有人开始研究它。 见http://chat.stackoverflow.com/transcript/291?m=9149463#9149463 Currently user.r deprecated as a security risk. There is supposed to be a dialected method for this to happen .. but no one has started wor ... -
nutch + mysql集成(nutch + mysql integration)[2022-10-21]
创建自己的java类来管理Nutch循环。 它应该类似于org.apache.nutch.crawl.Crawl,但您必须通过调用Mysql连接器来替换对索引器的调用。 或者您可以在每个周期中调用您的Mysql连接器,具体取决于您是要在爬网结束时还是在发生时更新Mysql。 Create your own java class that manage the Nutch cycle. It should be similar to org.apache.nutch.crawl.Crawl but you w ... -
您需要将以下Apache Commons库添加到类路径中: commons-httpclient.jar (您可以将它放在nutch安装所使用的其他JAR所在的文件夹中)。 你可以在这里找到当前版本的HttpClient http://hc.apache.org/httpcomponents-client-ga/ 请注意,您的Nutch版本可能使用较旧版本的HttpClient,而当前版本的HttpClient与旧版本不兼容。 在这种情况下,您需要下载旧版本的HttpClient,并在您的库中包含旧版本。 ...
-
您不应该在每个页面的标题中添加它,因为它的依赖项不一定会加载。 你需要在这里将它添加到你的主题requirejs-config; /app/design/frontend/
/ /requirejs.config 在文件中放这个; var config = { paths: { 'select2': 'js/select2.min', }, shim: { 'select2': { deps: ['jquery'], } } }; ... -
这是一个开放的JIRA问题( SPR-5991 )。 它具有标记为Spring 3.1的iBatis3支持 There's an open JIRA issue for this (SPR-5991). It has iBatis3 support tagged for Spring 3.1
-
Facebook可能令人困惑。 您需要的令牌取决于您尝试实现的操作。 例如, 您可以使用应用令牌来请求具有特定权限的用户令牌。 然后,您将使用该USER TOKEN代表用户访问facebook api。 在您发布的代码中,您似乎拥有要发送给Facebook的访问令牌作为授予您应用的访问令牌...而不是给定用户的访问令牌。 在成功授予给定用户访问权限后,您需要捕获(并存储)Facebook提供给您的用户访问令牌。 更多相关信息: https : //developers.facebook.com/docs/f ...
-
转换p4移动到p4集成(Convert p4 move to p4 integrate)[2022-04-03]
即使Perforce允许它,恢复已删除的文件也不会留下一个集成 ,它会留下一个补充 。 移动操作包括将文件添加(不集成)到新位置,然后从旧位置删除它们。 如果您考虑一下,将文件集成到新位置然后从原始位置删除它是没有意义的。 集成的目的是维护两组文件之间的关系。 据我所知,Perforce没有提供“一键”机制来做你想做的事情,但这并不太难: 将已修改的文件复制到临时目录 恢复文件移动 集成文件,然后在新位置打开它们进行编辑 将更改的文件从临时目录复制到新位置并提交 Even if Perforce allow ... -
LDR R0,[R0,R3] 这条线到底做了什么 R3用作基址( R0 )的偏移量。 换句话说,有效地址是R0+R3 , 从该地址加载的值被写入R0 。 请参阅LDR / STR的文档 。 不仅使用LDR R0,[R0,#0x1AC]原因可能是这是Thumb代码。 Thumb模式下LDR/STR的立即偏移量仅为7位 (5位向左移位2位),0x1AC不适合7位。 因此,偏移首先放在寄存器中。 LDR R0, [R0,R3] What does this line do exactly R3 acts as a ...
-
如果你需要从头开始构建一个具有grunt功能的主题,那么你可以通过npm使用yeoman轻松构建它。 Yeoman是一个网络脚手架工具,因此您无需进行安装所有细节的艰苦工作。 步骤1: 安装节点js看链接 第2步: npm install -g yo 第3步: npm install -g generator-grunt-magento 步骤4: yo grunt-magento 更多细节来自这里 。 另一种可能的解决方案是在现有主题上创建grunt结构。 这是我们无法概括的内容,因此它实际上取决于您 ...