hadoop reducer上的JVM崩溃(JVM crash on hadoop reducer)
我在hadoop上运行java代码,但遇到此错误:
# # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007f2ffe7e1904, pid=31718, tid=139843231057664 # # JRE version: Java(TM) SE Runtime Environment (8.0_72-b15) (build 1.8.0_72-b15) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.72-b15 mixed mode linux-amd64 compressed oops) # Problematic frame: # V [libjvm.so+0x813904] PhaseIdealLoop::build_loop_late_post(Node*)+0x144 # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # An error report file with more information is saved as: # /hadoop/nm-local-dir/usercache/ihradmin/appcache/application_1479451766852_3736/container_1479451766852_3736_01_000144/hs_err_pid31718.log # # Compiler replay data is saved as: # /hadoop/nm-local-dir/usercache/ihradmin/appcache/application_1479451766852_3736/container_1479451766852_3736_01_000144/replay_pid31718.log # # If you would like to submit a bug report, please visit: # http://bugreport.java.com/bugreport/crash.jsp
当我转到节点管理器时,由于
yarn.log-aggregation-enable is true
,因此无法找到所有日志,并且找不到log hs_err_pid31718.log和replay_pid31718.log。通常1)JVM在减速器几分钟后崩溃,2)有时减速器的自动重试可以成功,3)一些减速器可以成功而不会失败。
Hadoop版本是2.6.0,Java是Java8。 这不是新环境,我们在群集上运行了大量作业。
我的问题:
在聚合日志并删除文件夹后,我可以在任何地方找到hs_err_pid31718.log吗? 或者是否有设置保留所有本地日志,以便我可以检查hs_err_pid31718.log,同时按纱线聚合日志?
缩小深潜范围的常见步骤是什么? 由于jvm崩溃了,我在代码中看不到任何异常。 我已经尝试了
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp
这个args但是在没有reduce任务的主机上没有堆转储。谢谢你的任何建议。
I am running java codes on hadoop, but encounter this error:
# # A fatal error has been detected by the Java Runtime Environment: # # SIGSEGV (0xb) at pc=0x00007f2ffe7e1904, pid=31718, tid=139843231057664 # # JRE version: Java(TM) SE Runtime Environment (8.0_72-b15) (build 1.8.0_72-b15) # Java VM: Java HotSpot(TM) 64-Bit Server VM (25.72-b15 mixed mode linux-amd64 compressed oops) # Problematic frame: # V [libjvm.so+0x813904] PhaseIdealLoop::build_loop_late_post(Node*)+0x144 # # Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again # # An error report file with more information is saved as: # /hadoop/nm-local-dir/usercache/ihradmin/appcache/application_1479451766852_3736/container_1479451766852_3736_01_000144/hs_err_pid31718.log # # Compiler replay data is saved as: # /hadoop/nm-local-dir/usercache/ihradmin/appcache/application_1479451766852_3736/container_1479451766852_3736_01_000144/replay_pid31718.log # # If you would like to submit a bug report, please visit: # http://bugreport.java.com/bugreport/crash.jsp
When I go to the node manager, all the logs are aggregated since
yarn.log-aggregation-enable is true
, and log hs_err_pid31718.log and replay_pid31718.log cannot be found.Normally 1) the JVM crashes after several minutes of reducer, 2) sometimes the auto-retry of reducer can succeeds, 3) some reducers can succeed without failure.
Hadoop version is 2.6.0, Java is Java8. This is not a new environments, we have lots of jobs running on the cluster.
My questions:
Can I find hs_err_pid31718.log anywhere after yarn aggregate the log and remove the folder? Or is there a setting to keep all the local logs so I can check the hs_err_pid31718.log while aggregating logs by yarn?
What's the common steps to narrow down the deep dive scope? Since the jvm crashed, I cannot see any exception in code. I have tried
-XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=/tmp
this args but there is no heap dumped on the host failing the reduce tasks.Thanks for any suggestion.
原文:https://stackoverflow.com/questions/45038077