首页 \ 问答 \ hadoop reducer上的JVM崩溃(JVM crash on hadoop reducer)

hadoop reducer上的JVM崩溃(JVM crash on hadoop reducer)

我在hadoop上运行java代码,但遇到此错误:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f2ffe7e1904, pid=31718, tid=139843231057664
#
# JRE version: Java(TM) SE Runtime Environment (8.0_72-b15) (build 1.8.0_72-b15)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.72-b15 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x813904]  PhaseIdealLoop::build_loop_late_post(Node*)+0x144
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /hadoop/nm-local-dir/usercache/ihradmin/appcache/application_1479451766852_3736/container_1479451766852_3736_01_000144/hs_err_pid31718.log
#
# Compiler replay data is saved as:
# /hadoop/nm-local-dir/usercache/ihradmin/appcache/application_1479451766852_3736/container_1479451766852_3736_01_000144/replay_pid31718.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp

当我转到节点管理器时,由于yarn.log-aggregation-enable is true ,因此无法找到所有日志,并且找不到log hs_err_pid31718.log和replay_pid31718.log。

通常1)JVM在减速器几分钟后崩溃,2)有时减速器的自动重试可以成功,3)一些减速器可以成功而不会失败。

Hadoop版本是2.6.0,Java是Java8。 这不是新环境,我们在群集上运行了大量作业。

我的问题:

  1. 在聚合日志并删除文件夹后,我可以在任何地方找到hs_err_pid31718.log吗? 或者是否有设置保留所有本地日志,以便我可以检查hs_err_pid31718.log,同时按纱线聚合日志?

  2. 缩小深潜范围的常见步骤是什么? 由于jvm崩溃了,我在代码中看不到任何异常。 我已经尝试了-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp这个args但是在没有reduce任务的主机上没有堆转储。

谢谢你的任何建议。


I am running java codes on hadoop, but encounter this error:

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  SIGSEGV (0xb) at pc=0x00007f2ffe7e1904, pid=31718, tid=139843231057664
#
# JRE version: Java(TM) SE Runtime Environment (8.0_72-b15) (build 1.8.0_72-b15)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.72-b15 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V  [libjvm.so+0x813904]  PhaseIdealLoop::build_loop_late_post(Node*)+0x144
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /hadoop/nm-local-dir/usercache/ihradmin/appcache/application_1479451766852_3736/container_1479451766852_3736_01_000144/hs_err_pid31718.log
#
# Compiler replay data is saved as:
# /hadoop/nm-local-dir/usercache/ihradmin/appcache/application_1479451766852_3736/container_1479451766852_3736_01_000144/replay_pid31718.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp

When I go to the node manager, all the logs are aggregated since yarn.log-aggregation-enable is true, and log hs_err_pid31718.log and replay_pid31718.log cannot be found.

Normally 1) the JVM crashes after several minutes of reducer, 2) sometimes the auto-retry of reducer can succeeds, 3) some reducers can succeed without failure.

Hadoop version is 2.6.0, Java is Java8. This is not a new environments, we have lots of jobs running on the cluster.

My questions:

  1. Can I find hs_err_pid31718.log anywhere after yarn aggregate the log and remove the folder? Or is there a setting to keep all the local logs so I can check the hs_err_pid31718.log while aggregating logs by yarn?

  2. What's the common steps to narrow down the deep dive scope? Since the jvm crashed, I cannot see any exception in code. I have tried -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp this args but there is no heap dumped on the host failing the reduce tasks.

Thanks for any suggestion.


原文:https://stackoverflow.com/questions/45038077
更新时间:2022-08-24 19:08

相关文章

更多

最新问答

更多
  • 获取MVC 4使用的DisplayMode后缀(Get the DisplayMode Suffix being used by MVC 4)
  • 如何通过引用返回对象?(How is returning an object by reference possible?)
  • 矩阵如何存储在内存中?(How are matrices stored in memory?)
  • 每个请求的Java新会话?(Java New Session For Each Request?)
  • css:浮动div中重叠的标题h1(css: overlapping headlines h1 in floated divs)
  • 无论图像如何,Caffe预测同一类(Caffe predicts same class regardless of image)
  • xcode语法颜色编码解释?(xcode syntax color coding explained?)
  • 在Access 2010 Runtime中使用Office 2000校对工具(Use Office 2000 proofing tools in Access 2010 Runtime)
  • 从单独的Web主机将图像传输到服务器上(Getting images onto server from separate web host)
  • 从旧版本复制文件并保留它们(旧/新版本)(Copy a file from old revision and keep both of them (old / new revision))
  • 西安哪有PLC可控制编程的培训
  • 在Entity Framework中选择基类(Select base class in Entity Framework)
  • 在Android中出现错误“数据集和渲染器应该不为null,并且应该具有相同数量的系列”(Error “Dataset and renderer should be not null and should have the same number of series” in Android)
  • 电脑二级VF有什么用
  • Datamapper Ruby如何添加Hook方法(Datamapper Ruby How to add Hook Method)
  • 金华英语角.
  • 手机软件如何制作
  • 用于Android webview中图像保存的上下文菜单(Context Menu for Image Saving in an Android webview)
  • 注意:未定义的偏移量:PHP(Notice: Undefined offset: PHP)
  • 如何读R中的大数据集[复制](How to read large dataset in R [duplicate])
  • Unity 5 Heighmap与地形宽度/地形长度的分辨率关系?(Unity 5 Heighmap Resolution relationship to terrain width / terrain length?)
  • 如何通知PipedOutputStream线程写入最后一个字节的PipedInputStream线程?(How to notify PipedInputStream thread that PipedOutputStream thread has written last byte?)
  • python的访问器方法有哪些
  • DeviceNetworkInformation:哪个是哪个?(DeviceNetworkInformation: Which is which?)
  • 在Ruby中对组合进行排序(Sorting a combination in Ruby)
  • 网站开发的流程?
  • 使用Zend Framework 2中的JOIN sql检索数据(Retrieve data using JOIN sql in Zend Framework 2)
  • 条带格式类型格式模式编号无法正常工作(Stripes format type format pattern number not working properly)
  • 透明度错误IE11(Transparency bug IE11)
  • linux的基本操作命令。。。