首页 \ 问答 \ 如何从Nutch spidered webpages数据库中获取XML格式的信息(How to fetch information in XML format from Nutch spidered webpages database)

如何从Nutch spidered webpages数据库中获取XML格式的信息(How to fetch information in XML format from Nutch spidered webpages database)

 我正在尝试建立书籍聚合门户网站。 Nutch为我提供了优秀的网络抓取工具，但我想要非常具体的信息，如书名，书价，ISBN，作者等。如何从抓取的网页中提取该信息？ 如果可能的话，我想以XML格式获取此信息。  
 除了上述内容，我想问一下这是否正确！ 可以用其他开源软件更好地完成吗？ 

I'm trying to build books aggregation portal. Nutch provides me excellent web crawler, but I want very specific information like, book title, book price, ISBN, author etc. How to extract that information from the crawled pages? I would like to fetch this information in XML format if possible. 
In addition to the above, I would like to ask if this is the right approach! Can it be done in better way with other open source software?

原文：https://stackoverflow.com/questions/15909558

更新时间：2021-11-24 16:11

最满意答案

 出于脚本目的，请不要使用git log和其他瓷器命令。 在这种情况下， git rev-list --count HEAD使用plumbing命令产生相同的结果。  
 无论如何，这听起来像git describe一个很好的任务。 如果你使用标签，它将报告最接近的标签以及当前提交的前面。 它还包括提交的SHA-1，因此版本不像纯提交计数度量那样不明确。 

For scripting purposes don't use git log and other porcelain commands. In this case, git rev-list --count HEAD yields the same results using plumbing commands. 
Anyway, this sounds like an excellent task for git describe. If you use tags it'll report the closest tag and how far ahead the current commit is. It also includes the SHA-1 of the commit so the version isn't ambiguous like a pure commit count metric is.

相关问答

在MSBuild中生成版本号(Generating version number in MSBuild)[2024-05-10]

MSbuild社区任务包含一个名为Version的任务，它提供了一些生成版本号的算法。它非常易于使用和定制。恕我直言，最好使用一个绑定整个SDLC的数字，这样您就可以将已部署的产品跟踪到构建结果，将这些产品跟踪到VCS，等等。正如Christopher Painter所做的那样，我建议使用jenkins内部编号。 The MSbuild Community Tasks contains a task named Version, it provides some algorithms to gener ...
使用git作为自动脚本的一部分 - 暂停脚本以获得rebase(Using git as part of an automated script - pausing the script for a rebase)[2023-12-22]

您可以将脚本分成两部分，第一部分在无冲突的基础上调用第二部分。如果需要手动干预，操作员将在完成合并后调用第二部分。 You could split your script in two, the first part calling the second one upon a conflict-less rebase. In case of manual intervention needed the operator would call the second part after completing ...
如何使用版本号从SVN正确释放脚本(How to properly release scripts from SVN with version number)[2023-09-07]

由于您引用的关键字替换使用propset ，您应该能够使用递归选项（-R）在所有脚本上设置它，前提是它们位于公共目录树中： svn propset foo -R some-value . 上面会将属性'foo'设置为在当前目录上递归地赋值'some-value'。在您的情况下，将'foo'替换为'svn：keywords'，将'some-value'替换为'Id'。就个人而言，我不是关键字替换的忠实粉丝，但在Perl脚本的情况下，我可以看到它很方便。至于如何释放它们？我会创建一个标签作为第一步。 ...
如何使用Maven的sonarQube在提交之前收集增量分析(How to use sonarQube from Maven to gather incremental analysis prior to commit)[2022-08-08]

成功使用它的关键是双重的：1）确保在sonarQube服务器上安装了“问题报告”插件2）指定sonar.exclusion以确保排除生成的源有了这个和“声纳”配置文件，用户只需要添加这个“声纳”配置文件和声纳：声纳目标到maven命令这改变了： mvn clean install -P myProfile 至： mvn clean install sonar:sonar -P myProfile,sonar maven配置文件可以将声纳目标绑定到验证阶段，以便在添加配置文件时省略指定目标的需要。 ...
如何获取脚本所在的补丁的一部分？(How to get a part of the patch where script located?)[2021-06-12]

您可以使用Bash参数扩展： PROJECTNAME=${PWD##*/} 要获取父目录，需要两个步骤： PROJECTNAME=${PWD%/*} PROJECTNAME=${PROJECTNAME##*/} You can use Bash parameter expansion: PROJECTNAME=${PWD##*/} To get the parent directory requires two steps: PROJECTNAME=${PWD%/*} PROJECTNAME=${PR ...
在提交到git之前自动更新内部版本号(Automatic updating build number before commit to git)[2022-10-03]

这样做的一般方法是使用预提交钩子。 The general way of doing this would be with a pre-commit hook.
使用增量版本创建目录(create directory with incremental version)[2021-12-02]

谢谢你的帮助!! 我想出来了。如果有人需要它，请把它放在这篇文章中。 `@echo on setlocal EnableDelayedExpansion set max_number=0 For /f "tokens=2-4 delims=/ " %%a in ('date /t') do (set mydate=%%c%%a%%b) for /d %%d in (destination_location\folder_name_ABC%max_number%_%mydate%%) do ( set c ...
需要版本号方案的帮助(Need help with version number schemes)[2022-02-23]

使用SVN提交号或其他提交ID（我们通常使用git describe输出，这在大多数情况下是完美的）。构建应该是可追踪的 - 如果您希望实际能够确定正在运行的内容，那么仅从已提交的源构建它很重要。以秒精度（UNIX纪元时间）使用时间/日期。这将适用于您，直到2038年。如果数字太大，请使用不同的纪元（例如2000年）。此外，Uint32不限于2 ^ 16（65535），它的32位给你2 ^ 32，或大约40亿。 Use SVN commit numbers, or other commit IDs ...
将增量版本号集成为提交脚本的一部分(Integrating an incremental version number as part of a commit script)[2022-07-03]

出于脚本目的，请不要使用git log和其他瓷器命令。在这种情况下， git rev-list --count HEAD使用plumbing命令产生相同的结果。无论如何，这听起来像git describe一个很好的任务。如果你使用标签，它将报告最接近的标签以及当前提交的前面。它还包括提交的SHA-1，因此版本不像纯提交计数度量那样不明确。 For scripting purposes don't use git log and other porcelain commands. In this ca ...
用于自动提交和增加内部版本号的脚本(Script for automatic commit and increase of build number)[2022-07-03]

最好将构建保留在自己的文本文件中，并使用于维护文件的脚本保持不变而不是更改它。 #!/bin/bash # Initialize build.txt if it doesn't exist if [[ ! -f build.txt ]]; then echo 0 >> build.txt fi let build=$(cat build.txt)+1 echo $build > build.txt echo $build 不写自修改代码的原因？它更难以推理，看起来它自我修改的事实是阻止你检查它 ...

nutch

Apache Nutch（一）

Nutch 教程

荐 Nutch学习笔记2： Nutch-2.2.1脚本分析

Nutch安装指南

Nutch源码阅读进程3---fetch

Nutch搜索引擎（第3期）_ Nutch简单应用

nutch与起点R3集成之笔记（二）

Setting up Nutch 2.1 with MySQL to handle UTF-8

Apache Nutch 1.3 学习笔记二

如何从Nutch spidered webpages数据库中获取XML格式的信息(How to fetch information in XML format from Nutch spidered webpages database)

最满意答案

相关问答

在MSBuild中生成版本号(Generating version number in MSBuild)[2024-05-10]

使用git作为自动脚本的一部分 - 暂停脚本以获得rebase(Using git as part of an automated script - pausing the script for a rebase)[2023-12-22]

如何使用版本号从SVN正确释放脚本(How to properly release scripts from SVN with version number)[2023-09-07]

如何使用Maven的sonarQube在提交之前收集增量分析(How to use sonarQube from Maven to gather incremental analysis prior to commit)[2022-08-08]

如何获取脚本所在的补丁的一部分？(How to get a part of the patch where script located?)[2021-06-12]

在提交到git之前自动更新内部版本号(Automatic updating build number before commit to git)[2022-10-03]

使用增量版本创建目录(create directory with incremental version)[2021-12-02]

需要版本号方案的帮助(Need help with version number schemes)[2022-02-23]

将增量版本号集成为提交脚本的一部分(Integrating an incremental version number as part of a commit script)[2022-07-03]

用于自动提交和增加内部版本号的脚本(Script for automatic commit and increase of build number)[2022-07-03]

相关文章

最新问答