Hadoop0.20.2中无法使用MultipleOutputFormat，多文件输出这个方法。尽管0.19.2中的方法老的方法org.apache.hadoop.mapred.lib.MultipleOutputFormat还是可以继续在0.20.2中使用，但是org.apache.hadoop.mapred下的方法都是标记为“已过时”，在hadoop下个版本中可能就不能使用了。hadoop 0.20.2中是推荐使用Configuration替换JobConf，而这个老的方法org.apache.hadoop.mapred.lib.MultipleOutputFormat中还是使用的JobConf，就是说还没有新的可替换API。

此外hadoop 0.20.2还只是一个中间版本，并不是所有API都升级到最新了，没有提供的API只能自己写。

重写MultipleOutputFormat需要2个类:

LineRecordWriter

MultipleOutputFormat

PartitionByFilenameOutputFormat是实验中需要自定义的每个文件各自输出结果

LineRecordWriter：

package cn.xmu.dm;
import java.io.DataOutputStream;
import java.io.IOException;
import java.io.UnsupportedEncodingException;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.TextOutputFormat;
public class LineRecordWriter<K, V> extends RecordWriter<K, V> {
private static final String utf8 = "UTF-8";
protected DataOutputStream out;
private final byte[] keyValueSeparator;
public LineRecordWriter(DataOutputStream out, String keyValueSeparator) {
this.out = out;
try {
this.keyValueSeparator = keyValueSeparator.getBytes(utf8);
} catch (UnsupportedEncodingException uee) {
throw new IllegalArgumentException("can't find " + utf8
+ " encoding");
}
}
public LineRecordWriter(DataOutputStream out) {
this(out, "/t");
}
private void writeObject(Object o) throws IOException {
if (o instanceof Text) {
Text to = (Text) o;
out.write(to.getBytes(), 0, to.getLength());
} else {
out.write(o.toString().getBytes(utf8));
}
}
public synchronized void write(K key, V value) throws IOException {
boolean nullKey = key == null || key instanceof NullWritable;
boolean nullValue = value == null || value instanceof NullWritable;
if (nullKey && nullValue) {
return;
}
if (!nullKey) {
writeObject(key);
}
if (!(nullKey || nullValue)) {
out.write(keyValueSeparator);
}
if (!nullValue) {
writeObject(value);
}
out.write("\r\n".getBytes());
}
public synchronized void close(TaskAttemptContext context)
throws IOException {
out.close();
}
}

相关问答

怎样导入hadoop0.20的源码[2022-01-26]

1. 下载Hadoop源代码 2. 准备编译环境 2.1. 系统 CentOS5.5 2.2. Hadoop代码版本 hadoop-0.20.2-release 2.3. 联网编译Hadoop 会依赖很多第三方库，但编译工具Ant 会自动从网上下载缺少的库，所以必须保证机器能够访问Internet。 2.4. java 编译Hadoop要用JDK1.6 以上安装好之后，请设置好JAVA_HOME 环境变量。 2.5. Ant 需要使用Ant 工具来编译Hadoop， Ant 安装好之后，请设置好ANT_ ...
Hadoop 2中的自定义log4j appender(Custom log4j appender in Hadoop 2)[2022-02-23]

1.为了在名称节点更改log4j.properties，可以更改/home/hadoop/log4j.properties。 2.为了更改容器日志的log4j.properties，您需要在容器jar中更改它，因为它们硬编码直接从项目资源加载文件。 2.1 ssh到奴隶（在EMR上，你也可以简单地将它添加为引导操作，所以你不需要ssh到每个节点）。 ssh到hadoop奴隶 2.2在jar资源上覆盖container-log4j.properties： jar uf /home/hadoop/share/h ...
如何在hadoop中获取multipleOutput(how to get multipleOutput in hadoop)[2022-12-21]

Eclipse插件主要用于提交和监视作业以及与HDFS交互，与真实或“伪”群集交互。如果你在本地模式下运行，那么我认为插件不会带来任何好处 - 因为你的工作将在一个JVM中运行。考虑到这一点，我会说在Eclipse项目的类路径中包括最新的1.x hadoop-core。无论如何， MultipleOutputFormat还没有被移植到新的mapreduce包中（在1.1.2或2.0.4-alpha中都没有），所以你需要自己移植它或者找另一种方式（也许是MultipleOutputs - Javadoc ...
Flink可以将结果写入多个文件（如Hadoop的MultipleOutputFormat）吗？(Can Flink write results into multiple files (like Hadoop's MultipleOutputFormat)?)[2021-12-26]

您可以根据需要向DataSet程序添加任意数量的数据接收器。例如，在这样的程序中： ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSet> data = env.readFromCsv(...); // apply MapFunction and emit data.map(new YourMapper()).writeToText("/f ...
在hadoop中，我只想在每个节点上执行自己的自定义程序(In hadoop, I just want to execute my own custom program on each node)[2024-02-07]

MapReduce和Tez作业都使用YARN（Yet Another Resource Negotiator）在所谓的容器中通过集群进行分发和执行。您也可以自己使用YARN来运行自己的工作。请查看Hadoop架构概述，以获得高级概述。 Both MapReduce and Tez jobs use YARN (Yet Another Resource Negotiator) to get distributed and executed over the cluster in so-called co ...
在Hadoop中实现自定义Writable？(Implementation of custom Writable in Hadoop?)[2021-12-02]

看起来像write(DataOutput)方法中的错误： @Override public void write(DataOutput arg0) throws IOException { //write the size first // arg0.write(aggValues.size()); // here you're writing an int as a byte // try this instead: arg0.writeInt(aggValues.size()); // ...
控制MultipleOutputFormat文件子路径(Control the MultipleOutputFormat files sub-path)[2023-04-27]

我发现我也可以覆盖getInputFileBasedOutputFileName方法，并在那里给它子路径。 @Override protected String getInputFileBasedOutputFileName(JobConf conf, String Name) { //your logic goes here. Simply addd the sub path to the name and return } 您仍应实现generateFileNameForKeyVal ...
在Hadoop中将输出写入多个文件[重复](Writing output in multiple files in Hadoop [duplicate])[2022-07-26]

使用MultipleOutputFormat类，输出文件名可以从键和reducer的reducer输出值中推导出来。 MultipleOutputFormat＃generateFileNameForKeyValue必须在用户定义的OutputFormat类中实现。 static class MyMultipleOutputFormat extends MultipleOutputFormat { protected String generateFileNameForKeyV ...
如何在新的Hadoop API中设置自定义输出提交者(How do i set a custom output committer in the new Hadoop API)[2020-12-11]

我想这取决于你对新API的意思 - 在1.1.1中至少不再这样做了 - 我想我已经记得读过整个mapred包已经过早弃用了，这在以后的版本中是不推荐使用的。如果通过新API，你的意思是mapreduce包over mapred，那么OutputFormats本身有一个关联的OutputCommitter，它通过OutputFormat.getOutputCommitter方法获取 I guess it depends on what you mean by new API - in 1.1.1 at le ...
使用Hadoop 0.20+生成多个输出文件(Generating Multiple Output files with Hadoop 0.20+)[2022-03-19]

对MultipleOutputs的支持不在0.20。您需要使用旧版API。它已被添加到0.21，它目前还没有发布为org.apache.hadoop.mapreduce.lib.output.MultipleOutputs。邮件列表中的这个主题讨论了这个问题。 Support for MultipleOutputs isn't in 0.20. You will need to use the older API. It has been added into 0.21 which is curren ...

知识点

相关文章

最近更新

Hadoop0.20+ custom MultipleOutputFormat

相关问答

怎样导入hadoop0.20的源码[2022-01-26]

Hadoop 2中的自定义log4j appender(Custom log4j appender in Hadoop 2)[2022-02-23]

如何在hadoop中获取multipleOutput(how to get multipleOutput in hadoop)[2022-12-21]

Flink可以将结果写入多个文件（如Hadoop的MultipleOutputFormat）吗？(Can Flink write results into multiple files (like Hadoop's MultipleOutputFormat)?)[2021-12-26]

在hadoop中，我只想在每个节点上执行自己的自定义程序(In hadoop, I just want to execute my own custom program on each node)[2024-02-07]

在Hadoop中实现自定义Writable？(Implementation of custom Writable in Hadoop?)[2021-12-02]

控制MultipleOutputFormat文件子路径(Control the MultipleOutputFormat files sub-path)[2023-04-27]

在Hadoop中将输出写入多个文件[重复](Writing output in multiple files in Hadoop [duplicate])[2022-07-26]

如何在新的Hadoop API中设置自定义输出提交者(How do i set a custom output committer in the new Hadoop API)[2020-12-11]

使用Hadoop 0.20+生成多个输出文件(Generating Multiple Output files with Hadoop 0.20+)[2022-03-19]