首页 \ 问答 \ 成功完成Nutch爬网后,Elasticsearch索引失败(Elasticsearch indexing fails after successful Nutch crawl)

成功完成Nutch爬网后,Elasticsearch索引失败(Elasticsearch indexing fails after successful Nutch crawl)

我不确定为什么但Nutch 1.13无法将数据索引到ES(v2.3.3)。 它正在爬行,这很好,但是当它需要索引到ES时它会给我这个错误消息:

Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)

就在此之前是这样的:

elastic.bulk.close.timeout : elastic timeout for the last bulk in seconds. (default 600)

我不确定超时是否与作业失败有关?

我已经多次运行Nutch v1.10而没有任何问题,但现在决定升级。 从来没有出现此错误,直到现在,升级。

编辑:仔细检查错误消息后:

    Error running:
  /home/david/tutorials/nutch/nutch-1.13/runtime/local/bin/nutch index -Delastic.server.url=http://localhost:9300/search-index/ searchcrawl//crawldb -linkdb searchcrawl//linkdb searchcrawl//segments/20170519125546

在那个特定的细分市场上似乎失败了,这意味着什么? 我只知道如何使用Nutch的基础知识,我绝不是专家。 是否在链接上失败了?


I'm not sure why but Nutch 1.13 is failing to index the data to ES (v2.3.3). It is crawling, that is fine, but when it comes time to index to ES its giving me this error message:

Indexer: java.io.IOException: Job failed!
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:865)
at org.apache.nutch.indexer.IndexingJob.index(IndexingJob.java:147)
at org.apache.nutch.indexer.IndexingJob.run(IndexingJob.java:230)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.nutch.indexer.IndexingJob.main(IndexingJob.java:239)

Right before that is has this:

elastic.bulk.close.timeout : elastic timeout for the last bulk in seconds. (default 600)

I'm not sure exactly if the timeout has anything to do with the job failing?

I've run Nutch v1.10 many times with no problems but decided to upgrade now. Never had this error before until now, with upgrading.

EDIT: After closer inspection of the error message:

    Error running:
  /home/david/tutorials/nutch/nutch-1.13/runtime/local/bin/nutch index -Delastic.server.url=http://localhost:9300/search-index/ searchcrawl//crawldb -linkdb searchcrawl//linkdb searchcrawl//segments/20170519125546

It seems to be failing there, on that particular segment, what does that mean? I only know the basics of how to use Nutch, I'm by no means an expert. Is it failing on a link?


原文:https://stackoverflow.com/questions/44074819
更新时间:2022-03-21 22:03

最满意答案

重新排列术语:1 - pchisq(3.841459,1,10.50742)= 0.9并在结果周围包围abs以构造最小化函数:

 optim( 1, function(x) abs(pchisq(3.841459, 1, x)  - 0.1) )
#-------
$par
[1] 10.50742

$value
[1] 1.740301e-08

$counts
function gradient 
      56       NA 

$convergence
[1] 0

$message
NULL

要进行灵敏度分析,可以连续更改其他参数的值:

for( crit.val in seq(2.5, 3.5, by=0.1)) {
         print( optim( 1, 
                function(x) abs(pchisq(crit.val, 1, x)  - 0.1), 
                method="Brent" , lower=0, upper=20)$par)}
[1] 8.194852
[1] 8.375145
[1] 8.553901
[1] 8.731204
[1] 8.907135
[1] 9.081764
[1] 9.255156
[1] 9.427372
[1] 9.598467
[1] 9.768491
[1] 9.937492

Rearrange terms in: 1 - pchisq(3.841459, 1, 10.50742) = 0.9 and wrap abs around the result to construct a minimization function:

 optim( 1, function(x) abs(pchisq(3.841459, 1, x)  - 0.1) )
#-------
$par
[1] 10.50742

$value
[1] 1.740301e-08

$counts
function gradient 
      56       NA 

$convergence
[1] 0

$message
NULL

To do a sensitivity analysis, you can serially alter the values of the other parameters:

for( crit.val in seq(2.5, 3.5, by=0.1)) {
         print( optim( 1, 
                function(x) abs(pchisq(crit.val, 1, x)  - 0.1), 
                method="Brent" , lower=0, upper=20)$par)}
[1] 8.194852
[1] 8.375145
[1] 8.553901
[1] 8.731204
[1] 8.907135
[1] 9.081764
[1] 9.255156
[1] 9.427372
[1] 9.598467
[1] 9.768491
[1] 9.937492

相关问答

更多

相关文章

更多

最新问答

更多
  • 您如何使用git diff文件,并将其应用于同一存储库的副本的本地分支?(How do you take a git diff file, and apply it to a local branch that is a copy of the same repository?)
  • 将长浮点值剪切为2个小数点并复制到字符数组(Cut Long Float Value to 2 decimal points and copy to Character Array)
  • OctoberCMS侧边栏不呈现(OctoberCMS Sidebar not rendering)
  • 页面加载后对象是否有资格进行垃圾回收?(Are objects eligible for garbage collection after the page loads?)
  • codeigniter中的语言不能按预期工作(language in codeigniter doesn' t work as expected)
  • 在计算机拍照在哪里进入
  • 使用cin.get()从c ++中的输入流中丢弃不需要的字符(Using cin.get() to discard unwanted characters from the input stream in c++)
  • No for循环将在for循环中运行。(No for loop will run inside for loop. Testing for primes)
  • 单页应用程序:页面重新加载(Single Page Application: page reload)
  • 在循环中选择具有相似模式的列名称(Selecting Column Name With Similar Pattern in a Loop)
  • System.StackOverflow错误(System.StackOverflow error)
  • KnockoutJS未在嵌套模板上应用beforeRemove和afterAdd(KnockoutJS not applying beforeRemove and afterAdd on nested templates)
  • 散列包括方法和/或嵌套属性(Hash include methods and/or nested attributes)
  • android - 如何避免使用Samsung RFS文件系统延迟/冻结?(android - how to avoid lag/freezes with Samsung RFS filesystem?)
  • TensorFlow:基于索引列表创建新张量(TensorFlow: Create a new tensor based on list of indices)
  • 企业安全培训的各项内容
  • 错误:RPC失败;(error: RPC failed; curl transfer closed with outstanding read data remaining)
  • C#类名中允许哪些字符?(What characters are allowed in C# class name?)
  • NumPy:将int64值存储在np.array中并使用dtype float64并将其转换回整数是否安全?(NumPy: Is it safe to store an int64 value in an np.array with dtype float64 and later convert it back to integer?)
  • 注销后如何隐藏导航portlet?(How to hide navigation portlet after logout?)
  • 将多个行和可变行移动到列(moving multiple and variable rows to columns)
  • 提交表单时忽略基础href,而不使用Javascript(ignore base href when submitting form, without using Javascript)
  • 对setOnInfoWindowClickListener的意图(Intent on setOnInfoWindowClickListener)
  • Angular $资源不会改变方法(Angular $resource doesn't change method)
  • 在Angular 5中不是一个函数(is not a function in Angular 5)
  • 如何配置Composite C1以将.m和桌面作为同一站点提供服务(How to configure Composite C1 to serve .m and desktop as the same site)
  • 不适用:悬停在悬停时:在元素之前[复制](Don't apply :hover when hovering on :before element [duplicate])
  • 常见的python rpc和cli接口(Common python rpc and cli interface)
  • Mysql DB单个字段匹配多个其他字段(Mysql DB single field matching to multiple other fields)
  • 产品页面上的Magento Up出售对齐问题(Magento Up sell alignment issue on the products page)