Setting up Nutch 2.1 with MySQL to handle UTF-8

2019-03-27 00:49|来源: 网路

原文地址: http://nlp.solutions.asia/?p=180

These instructions assume Ubuntu 12.04 and Java 6 or 7 installed and JAVA_HOME configured.

Install MySQL Server and MySQL Client using the Ubuntu software center or sudo apt-get install mysql-server mysql-client at the command line.

As MySQL defaults to latin (are we still in the 1990s?) we need to edit sudo vi /etc/mysql/my.cnf and under [mysqld] add

innodb_file_format=barracuda
innodb_file_per_table=true
innodb_large_prefix=true
character-set-server=utf8mb4
collation-server=utf8mb4_unicode_ci
max_allowed_packet=500M

The innodb options are to help deal with the small primary key size restriction of MySQL. Restart your machine for the changes to take effect. The max_allowed_packet option is so you don’t run into issues as your database and the pages you store in it get larger.

Check to make sure MySQL is running by typing sudo netstat -tap | grep mysql  and you should see something like

tcp        0      0 localhost:mysql         *:*                     LISTEN

We need to set up the nutch database manually as the current Nutch/Gora/MySQL generated db schema defaults to latin. Log into mysql at the command line using your previously set up MySQL id and password type

mysql -u xxxxx -p

then in the MySQL editor type the following:

CREATE DATABASE nutch DEFAULT CHARACTER SET utf8mb4 DEFAULT COLLATE utf8mb4_unicode_ci;

and enter followed by

use nutch;

and enter and then copy and paste the following altogether:

CREATE TABLE `webpage` (
`id` varchar(767) NOT NULL,
`headers` blob,
`text` mediumtext DEFAULT NULL,
`status` int(11) DEFAULT NULL,
`markers` blob,
`parseStatus` blob,
`modifiedTime` bigint(20) DEFAULT NULL,
`score` float DEFAULT NULL,
`typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL,
`baseUrl` varchar(767) DEFAULT NULL,
`content` longblob,
`title` varchar(2048) DEFAULT NULL,
`reprUrl` varchar(767) DEFAULT NULL,
`fetchInterval` int(11) DEFAULT NULL,
`prevFetchTime` bigint(20) DEFAULT NULL,
`inlinks` mediumblob,
`prevSignature` blob,
`outlinks` mediumblob,
`fetchTime` bigint(20) DEFAULT NULL,
`retriesSinceFetch` int(11) DEFAULT NULL,
`protocolStatus` blob,
`signature` blob,
`metadata` blob,
PRIMARY KEY (`id`)
) ENGINE=InnoDB
ROW_FORMAT=COMPRESSED
DEFAULT CHARSET=utf8mb4;

Then type enter. You are done setting up the MySQL database for Nutch.

 

Set up Nutch 2.1 by downloading the latest version from http://www.apache.org/dyn/closer.cgi/nutch/. Untar the contents of the file you just downloaded and going forward we will refer to this folder as ${APACHE_NUTCH_HOME}.

From inside the nutch folder ensure the MySQL dependency for Nutch is available by editing the following in ${APACHE_NUTCH_HOME}/ivy/ivy.xml

<!– Uncomment this to use MySQL as database with SQL as Gora store. –>
<dependency org=”mysql” name=”mysql-connector-java” rev=”5.1.18″ conf=”*->default”/>

 

Edit the ${APACHE_NUTCH_HOME}/conf/gora.properties file either deleting or commenting out the Default SqlStore Properties using #. Then add the MySQL properties below replacing xxxxx with the user and password you set up when installing MySQL earlier.

###############################
# MySQL properties            #
###############################
gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver
gora.sqlstore.jdbc.url=jdbc:mysql://localhost:3306/nutch?createDatabaseIfNotExist=true
gora.sqlstore.jdbc.user=xxxxx
gora.sqlstore.jdbc.password=xxxxx

Edit the ${APACHE_NUTCH_HOME}/conf/gora-sql-mapping.xml file changing the length of the primarykey from 512 to 767 in both places.
<primarykey column=”id” length=”767″/>

Configure ${APACHE_NUTCH_HOME}/conf/nutch-site.xml to put in a name in the value field under http.agent.name. It can be anything but cannot be left blank. Add additional languages if you want (I have added Japanese ja-jp below) and utf-8 as default as well. You must specify Sqlstore.

<property>
<name>http.agent.name</name>
<value>Your Nutch Spider</value>
</property>

<property>
<name>http.accept.language</name>
<value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value>
<description>Value of the “Accept-Language” request header field.
This allows selecting non-English language as default one to retrieve.
It is a useful setting for search engines build for certain national group.
</description>
</property>

<property>
<name>parser.character.encoding.default</name>
<value>utf-8</value>
<description>The character encoding to fall back to when no other information
is available</description>
</property>

<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.sql.store.SqlStore</value>
<description>The Gora DataStore class for storing and retrieving data.
Currently the following stores are available: ….
</description>
</property>

Install ant using the Ubuntu software center or sudo apt-get install ant at the command line.

From the command line cd to your nutch folder type ant runtime
This may take a few minutes to compile.

 

Start your first crawl by typing the lines below at the terminal (replace ‘http://nutch.apache.org/’ with whatever site you want to crawl):

cd ${APACHE_NUTCH_HOME}/runtime/local
mkdir -p urls
echo 'http://nutch.apache.org/' > urls/seed.txt
bin/nutch crawl urls -depth 3 -topN 5

You can easily add more urls to search by hand in seed.txt if you want. For the crawl, depth is the number of rounds of generate/fetch/parse/update you want to do (not depth of links as you might think at first) and topNis the max number of links you want to actually parse each time. Note however Nutch keeps track of all links it encounters in the webpage table (it just limits the amount it actually parses to TopN so don’t be surprised by seeing many more rows in the webpage table than you expect by limiting with TopN).

Check your crawl results by looking at the webpage table in the nutch database.

mysql -u xxxxx -p
use nutch;
SELECT * FROM nutch.webpage;

You should see the results of your crawl (around 159 rows). It will be hard to read the columns so you may want to install MySQL Workbench via sudo apt-get install mysql-workbench and use that instead for viewing the data. You may also want to run the following SQL command select * from webpage where status = 2; to limit the rows in the webpage table to only urls that were actually parsed.

 

Set up and index with Solr If you are using Nutch 2.1 at this time you are into the bleeding edge and probably want the latest version of Solr 4.0 as well. Untar it to to $HOME/apache-solr-4.0.0-XXXX. This folder will be now referred to as ${APACHE_SOLR_HOME}.
Download http://nlp.solutions.asia/wp-content/uploads/2012/08/schema.xml  and use it to replace ${APACHE_SOLR_HOME}/example/solr/collection1/conf/schema.xml.

From the terminal start solr:
cd ${APACHE_SOLR_HOME}/example
java -jar start.jar

You can check this is running by opening http://localhost:8983/solr in your web browser.

Leave that terminal running and from a different terminal type the following:
cd ${APACHE_NUTCH_HOME}/runtime/local/
bin/nutch solrindex http://127.0.0.1:8983/solr/ -reindex

You can now run queries using Solr versus your crawled content. Openhttp://localhost:8983/solr/#/collection1/query and assuming you have crawled nutch.apache.org in the input box titled “q” you can do a search by inputting text:nutch and you should see something like this:

 

There remains a lot to configure to get a good web search going but you are at least started.


转自:http://www.cnblogs.com/AloneSword/p/3798126

相关问答

更多
  • MySQL 4.1及以上版本具有UTF-8的默认字符集。 您可以在my.cnf文件中验证,请记住设置客户端和服务器( default-character-set和character-set-server )。 如果您有要转换为UTF-8的现有数据,请转储数据库,并将其作为UTF-8导入,确保: 在查询/插入数据库之前,请使用SET NAMES utf8 创建新表时使用DEFAULT CHARSET=utf8 此时,您的MySQL客户端和服务器应该是UTF-8(请参阅my.cnf )。 记住你使用的任何语言( ...
  • 数据存储 : 在数据库中的所有表和文本列上指定utf8mb4字符集。 这使得MySQL在UTF-8中物理存储和检索本地编码的值。 请注意,如果指定了utf8mb4_* collation(没有任何显式字符集),MySQL将隐式使用utf8mb4编码。 在旧版本的MySQL(<5.5.3)中,您不幸被迫使用简单的utf8 ,它只支持Unicode字符的一个子集。 我希望我在开玩笑。 数据访问 : 在您的应用程序代码(例如PHP)中,无论使用utf8mb4数据库访问方法,都需要将连接字符集设置为utf8mb4 ...
  • 为列指定类似utf8的字符编码时 ,意味着MySQL将使用该编码来存储文本。 当您为数据库或表指定默认字符编码时,这意味着它们的列将具有该编码,除非您另有说明。 这会影响数据在磁盘上占用的字节数:在latin1中,每个字符为1个字节,sjis为2个字节,在utf8中,它会有所不同。 如果您使用日语存储大量文本,则可能需要使用sjis而不是utf8。 当您为列指定类似utf8_general_ci的排序规则时 ,这意味着MySQL将在ORDER BY或索引中对数据进行不同的排序 。 文化有不同的文本排序规则: ...
  • 将编码参数添加到DB适配器: // Setup the database service $di->set('db', function(){ $cnx = new DbAdapter(array( "host" => "localhost", "username" => "root", "password" => "", "dbname" => BDD, "charset" => "utf8", ...
  • 是的,堆栈跟踪清楚地表明MySQL查询中存在用于创建表的语法。 引起:java.io.IOException:com.mysql.jdbc.exceptions.jdbc4.MySQLSyntaxErrorException:您的SQL语法中有错误; 检查与您的MySQL服务器版本对应的手册,以便在'-depth_webpage(id VARCHAR(767)PRIMARY KEY,标题BLOB,text VARCHAR(32000),sta'在第1行附近使用正确的语法 请确保查询在MySQL workbe ...
  • 上述错误是由于我安装的服务器上的分区空间不足造成的。 当我尝试运行nutch generate命令时,检查共享内存文件空间不足时的答案 The above errors are due to insufficient space on the partition on the server where i have installed . check the answer at Insufficient space for shared memory file when i try to run nutch ...
  • 我错过了MySQL的内容字段,即LongBlob并存储图像。 I missed the content field in MySQL which is LongBlob and stores the image.
  • 问题是您将爬网深度设置为无限(-1)。 当您的抓取工具系统遇到重要的网址时,例如https://en.wikipedia.org, https://wikipedia.org and https://en.wikibooks.org ,您的系统在抓取过程中可能会耗尽内存。 您应该通过设置NUTCH_HEAPSIZE环境变量值来增加Nuch的内存, eg, export NUTCH_HEAPSIZE=4000 (请参阅Nutch脚本中的详细信息)。 请注意,此值等同于Hadoop的HADOOP_HEAPSIZ ...
  • 我在过去遇到过类似的问题,这对我来说是个诡计。 mysql_query("SET names UTF8"); Okay, i found the answer!! I searched a bit more, (have been doing so for hours) and found this question: Whether to use “SET NAMES” Using the answer from that question i ran this query: mysql_query(" ...