首页 \ 问答 \ Nutch:工作失败了(Nutch: Job Failed)

Nutch:工作失败了(Nutch: Job Failed)

我在运行nutch时遇到问题以下是我正在运行的命令

bin / nutch注入bin / crawl / crawldb bin / urls

运行上面的命令后,得到以下错误

Injector: starting at 2014-04-02 13:02:29
Injector: crawlDb: bin/crawl/crawldb
Injector: urlDir: bin/urls/seed.txt
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 2
Injector: total number of urls injected after normalization and filtering: 0
Injector: Merging injected urls into crawl db.
Injector: overwrite: false
Injector: update: false
Injector: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:294)
    at org.apache.nutch.crawl.Injector.run(Injector.java:316)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Injector.main(Injector.java:306)

我第一次跑nutch。 我检查了solr,nutch安装得当。

以下详细信息来自日志文件

java.io.IOException: The temporary job-output directory file:/usr/share/apache-nutch-1.8/bin/crawl/crawldb/1639805438/_temporary doesn't exist!
    at org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250)
    at org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:244)
    at org.apache.hadoop.mapred.MapFileOutputFormat.getRecordWriter(MapFileOutputFormat.java:46)
    at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.<init>(ReduceTask.java:449)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:491)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
2014-04-02 12:54:46,251 ERROR crawl.Injector - Injector: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:294)
    at org.apache.nutch.crawl.Injector.run(Injector.java:316)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Injector.main(Injector.java:306)

i have problem while running nutch for inject following is the command i am running

bin/nutch inject bin/crawl/crawldb bin/urls

after running above command, gets following error

Injector: starting at 2014-04-02 13:02:29
Injector: crawlDb: bin/crawl/crawldb
Injector: urlDir: bin/urls/seed.txt
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 2
Injector: total number of urls injected after normalization and filtering: 0
Injector: Merging injected urls into crawl db.
Injector: overwrite: false
Injector: update: false
Injector: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:294)
    at org.apache.nutch.crawl.Injector.run(Injector.java:316)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Injector.main(Injector.java:306)

I am running nutch for the first time. i have checked solr, nutch are installed properly.

below details are from log file

java.io.IOException: The temporary job-output directory file:/usr/share/apache-nutch-1.8/bin/crawl/crawldb/1639805438/_temporary doesn't exist!
    at org.apache.hadoop.mapred.FileOutputCommitter.getWorkPath(FileOutputCommitter.java:250)
    at org.apache.hadoop.mapred.FileOutputFormat.getTaskOutputPath(FileOutputFormat.java:244)
    at org.apache.hadoop.mapred.MapFileOutputFormat.getRecordWriter(MapFileOutputFormat.java:46)
    at org.apache.hadoop.mapred.ReduceTask$OldTrackingRecordWriter.<init>(ReduceTask.java:449)
    at org.apache.hadoop.mapred.ReduceTask.runOldReducer(ReduceTask.java:491)
    at org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:421)
    at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:398)
2014-04-02 12:54:46,251 ERROR crawl.Injector - Injector: java.io.IOException: Job failed!
    at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1357)
    at org.apache.nutch.crawl.Injector.inject(Injector.java:294)
    at org.apache.nutch.crawl.Injector.run(Injector.java:316)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Injector.main(Injector.java:306)

原文:https://stackoverflow.com/questions/22804590
更新时间:2022-03-16 08:03

最满意答案

我假设您的类加载是作为某种推送类文件执行的。 不过你应该它们。 为了解释我的意思,让我们看一下普通Java类加载的简短示例:

class Main {
  public static void main(String[] args) {
    new B();
  }
}
class B extends A { }
class A { }

在创建new B()Main的类加载器基本上执行classLoader.loadClass("B") 。 此时, B的超级A尚未加载。 同时,类加载器不能知道BA作为其超类。 因此,类加载器通过在完成B的类加载之前询问自己的classLoader.loadClass("A")来负责加载类。

让我们假设类加载器不知道AB但它有一个方法来显式加载由classLoader.inject(String, byte[])接收的外部实体的类。 这个调用序列然后不会计算:

classLoader.inject("B", bBytes);
classLoader.inject("A", aBytes);

因为在加载B ,类加载器还不知道A

实现自己的类加载器时需要做的是将类存储在某种映射中,并实现类加载器的类加载方法,如:

protected Class<?> findClass(String name) throws ClassNotFoundException {
  byte[] bytes = map.get(name);
  if (bytes != null) {
    return defineClass(name, bytes, 0, bytes.length);
  } else {
    throw new ClassNotFoundException(name);
  }
}

通过允许类加载器确定加载顺序,您可以完全避免这个问题。

更准确地说,你需要在两个步骤中进行操作和加载,其中伪算法看起来像这样:

Enumeration<JarEntry> entries = jarFile.entries();
MyClassLoader classLoader = new MyClassLoader();
// First we generate ALL classes that the class loader is supposed to load.
// We then make these classes accessible to the class loader.
while (entries.hasMoreElements()) {
  JarEntry element = entries.nextElement();
  if (element.getName().endsWith(".class")) {
     // Class Manipulation via ASM
     classLoader.addClass( ... );
  }
}
// Now that the class loader knows about all classes that are to be loaded
// we trigger the loading process. That way, the class loader can query
// itself about ANY class that it should know.
while (entries.hasMoreElements()) {
  JarEntry element = entries.nextElement();
  if (element.getName().endsWith(".class")) {
     classLoader.loadClass( ... );
  }
}

I assume that your class loading is performed as some sort of pushing class files. You should however rather pull them. To explain what I mean by this, let us look at a short example of normal Java class loading:

class Main {
  public static void main(String[] args) {
    new B();
  }
}
class B extends A { }
class A { }

When creating a new B(), the class loader of Main basically executes classLoader.loadClass("B"). At this point, B's super class A is not yet loaded. At the same time, the class loader cannot know that B has A as its super class. Thus, the class loader takes responsibility for loading the class by asking itself to classLoader.loadClass("A") before the class loading of B is completed.

Let us assume that the class loader did not know about either A or B but it had a method to explicitly load classes it is receives by an external entity with classLoader.inject(String, byte[]). This calling sequence would then not compute:

classLoader.inject("B", bBytes);
classLoader.inject("A", aBytes);

because while loading B, the class loader would not yet know about A.

What you need to do when implementing your own class loader is to store the classes in some sort of map and to implement the class loader's class loading method something like:

protected Class<?> findClass(String name) throws ClassNotFoundException {
  byte[] bytes = map.get(name);
  if (bytes != null) {
    return defineClass(name, bytes, 0, bytes.length);
  } else {
    throw new ClassNotFoundException(name);
  }
}

By allowing the class loader to determine the loading order, you avoid this problem altogether.

To be even more precise, you need to do manipulation and loading in two steps where a pseudo algorithm would look something like this:

Enumeration<JarEntry> entries = jarFile.entries();
MyClassLoader classLoader = new MyClassLoader();
// First we generate ALL classes that the class loader is supposed to load.
// We then make these classes accessible to the class loader.
while (entries.hasMoreElements()) {
  JarEntry element = entries.nextElement();
  if (element.getName().endsWith(".class")) {
     // Class Manipulation via ASM
     classLoader.addClass( ... );
  }
}
// Now that the class loader knows about all classes that are to be loaded
// we trigger the loading process. That way, the class loader can query
// itself about ANY class that it should know.
while (entries.hasMoreElements()) {
  JarEntry element = entries.nextElement();
  if (element.getName().endsWith(".class")) {
     classLoader.loadClass( ... );
  }
}

相关问答

更多
  • 以下是我在运行时动态加载jar的一些代码。 我利用反射来绕过这个事实,即你并不是真的应该这样做 (也就是说,在JVM启动后修改类路径)。 只要将我的my.proprietary.exception改变为合理的。 /* * Adds the supplied library to java.class.path. * This is benign if the library is already loaded. */ public static synchron ...
  • 不是每个jar文件都是可执行的。 现在,您需要在java文件中导入jar下的类。 例如, import org.xml.sax.SAXException; 如果您正在使用IDE,那么您应该参考其文档。 或者至少在这个线程中指定你在这里使用哪一个。 这肯定会使我们能够进一步帮助你。 如果您没有使用任何IDE,那么请查看javac -cp选项。 但是,将程序打包到一个jar文件中,并且包括所有必需的jar在这里是更好的主意。 那么,为了执行你的jar ,就像, java -jar my_program.jar ...
  • 清单文件末尾需要一行空行 。 多么愚蠢的限制! :( ONE more empty line is needed at the end of manifest file. What a stupid restriction! :(
  • 以下代码从JAR文件加载所有类。 它不需要知道有关类的任何内容。 类的名称从JarEntry中提取。 JarFile jarFile = new JarFile(pathToJar); Enumeration e = jarFile.entries(); URL[] urls = { new URL("jar:file:" + pathToJar+"!/") }; URLClassLoader cl = URLClassLoader.newInstance(urls); while ...
  • 由于bin文件夹的原因,您的类在jar文件中的结构错误。 我的建议是:将其全部装入一个以bin文件夹开始的jar中。 现在,当你提取你的jar你会看到2个文件夹:META-INF和bin。 如果你从bin文件夹中制作jar,你会看到:META-INF和游戏,它会工作。 它不适合你,因为它无法找到主类,因为它在bin / game / YourClass.class中,而不在game / YourClass.class中。 ...\bin> jar cvfm t.jar manifest.mf geometr ...
  • 秘密真的很简单! 使用URLClassLoader可以解决问题。 所以, Groovy代码: ClassLoader loader = new URLClassLoader((URL[]) [ new File("C:\\Users\\errorist\\workspace\\javatest1\\bin\\").toURI().toURL() ]) Class c = loader.loadClass("src.SomeClass1") c.invokeMethod("main", (Stri ...
  • 我假设您的类加载是作为某种推送类文件执行的。 不过你应该拉它们。 为了解释我的意思,让我们看一下普通Java类加载的简短示例: class Main { public static void main(String[] args) { new B(); } } class B extends A { } class A { } 在创建new B() , Main的类加载器基本上执行classLoader.loadClass("B") 。 此时, B的超级A尚未加载。 同时,类加载器不能知道 ...
  • 如果当前目录是bin那么你应该运行java -cp ".;../libs/*" Main I have solved the problem by removing manifest from lib.jar file. Therefore there is a bit confusing conclusion: "Class-Path:" of a jar hides the classes of a jar itself. Welcome to disprove this statement.
  • Websphere允许您在搜索类时指定特定应用程序的类加载器的查询顺序(类加载器是分层结构的,从加载JRE类的最高层,到WAR中的类加载器加载类)。 在部署应用程序的过程中,您可以指定在搜索类时查询类加载器的顺序。 有两种模式 - 父类第一(即首先查询最高类加载器)和父类最后(首先查询应用类加载器)。 这可以在EAR和WAR级别指定。 将重复的jar打包到应用程序中的不同位置(例如,将一个jar包装到EAR的classpath中,另一个jar文件放在WAR的WEB-INF / lib中),并将classlo ...
  • 以下是动态加载DevicePolicyManager类的示例。 Class myClass = ClassLoader.getSystemClassLoader().loadClass("android.app.admin.DevicePolicyManager") Object DPMInstance = myClass.newInstance(); ...

相关文章

更多

最新问答

更多
  • 您如何使用git diff文件,并将其应用于同一存储库的副本的本地分支?(How do you take a git diff file, and apply it to a local branch that is a copy of the same repository?)
  • 将长浮点值剪切为2个小数点并复制到字符数组(Cut Long Float Value to 2 decimal points and copy to Character Array)
  • OctoberCMS侧边栏不呈现(OctoberCMS Sidebar not rendering)
  • 页面加载后对象是否有资格进行垃圾回收?(Are objects eligible for garbage collection after the page loads?)
  • codeigniter中的语言不能按预期工作(language in codeigniter doesn' t work as expected)
  • 在计算机拍照在哪里进入
  • 使用cin.get()从c ++中的输入流中丢弃不需要的字符(Using cin.get() to discard unwanted characters from the input stream in c++)
  • No for循环将在for循环中运行。(No for loop will run inside for loop. Testing for primes)
  • 单页应用程序:页面重新加载(Single Page Application: page reload)
  • 在循环中选择具有相似模式的列名称(Selecting Column Name With Similar Pattern in a Loop)
  • System.StackOverflow错误(System.StackOverflow error)
  • KnockoutJS未在嵌套模板上应用beforeRemove和afterAdd(KnockoutJS not applying beforeRemove and afterAdd on nested templates)
  • 散列包括方法和/或嵌套属性(Hash include methods and/or nested attributes)
  • android - 如何避免使用Samsung RFS文件系统延迟/冻结?(android - how to avoid lag/freezes with Samsung RFS filesystem?)
  • TensorFlow:基于索引列表创建新张量(TensorFlow: Create a new tensor based on list of indices)
  • 企业安全培训的各项内容
  • 错误:RPC失败;(error: RPC failed; curl transfer closed with outstanding read data remaining)
  • C#类名中允许哪些字符?(What characters are allowed in C# class name?)
  • NumPy:将int64值存储在np.array中并使用dtype float64并将其转换回整数是否安全?(NumPy: Is it safe to store an int64 value in an np.array with dtype float64 and later convert it back to integer?)
  • 注销后如何隐藏导航portlet?(How to hide navigation portlet after logout?)
  • 将多个行和可变行移动到列(moving multiple and variable rows to columns)
  • 提交表单时忽略基础href,而不使用Javascript(ignore base href when submitting form, without using Javascript)
  • 对setOnInfoWindowClickListener的意图(Intent on setOnInfoWindowClickListener)
  • Angular $资源不会改变方法(Angular $resource doesn't change method)
  • 在Angular 5中不是一个函数(is not a function in Angular 5)
  • 如何配置Composite C1以将.m和桌面作为同一站点提供服务(How to configure Composite C1 to serve .m and desktop as the same site)
  • 不适用:悬停在悬停时:在元素之前[复制](Don't apply :hover when hovering on :before element [duplicate])
  • 常见的python rpc和cli接口(Common python rpc and cli interface)
  • Mysql DB单个字段匹配多个其他字段(Mysql DB single field matching to multiple other fields)
  • 产品页面上的Magento Up出售对齐问题(Magento Up sell alignment issue on the products page)