首页 \ 教程 \ hadoop

知识点

hadoop

MapReduce高级编程之自定义InputFormat

asp.net 输出微信自定义菜单json

Hadoop : 新版API 自定义InputFormat 把整个文件作为一条记录处理

FreeMarker自定义指令

Hadoop 自定义InputFormat实现自定义Split

自定义实现Hadoop Key-Value

页面自定义布局

自定义Hadoop Writable

微信自定义菜单

solr 使用自定义的 QueryParser

Solr 使用自定义 Query Parser

自定义Hadoop Map/Reduce输入文件切割InputFormat

实现MapReduce多文件自定义输出

2019-03-28 13:33|来源: 网络

普通maprduce中通常是有map和reduce两个阶段，在不做设置的情况下，计算结果会以part-000*输出成多个文件，并且输出的文件数量和reduce数量一样，文件内容格式也不能随心所欲。这样不利于后续结果处理。

在Hadoop中，reduce支持多个输出,输出的文件名也是可控的，就是继承MultipleTextOutputFormat类，重写generateFileNameForKey方法。如果只是想做到输出结果的文件名可控，实现自己的LogNameMultipleTextOutputFormat类，设置jobconf.setOutputFormat(LogNameMultipleTextOutputFormat.class);就可以了，但是这种方式只限于使用旧版本的hadoop api.如果想采用新版本的api接口或者自定义输出内容的格式等等更多的需求，那么就要自己动手重写一些hadoop api了。

首先需要构造一个自己的MultipleOutputFormat类实现FileOutputFormat类（注意是org.apache.hadoop.mapreduce.lib.output包的FileOutputFormat）

import java.io.DataOutputStream;
import java.io.IOException;
import java.util.HashMap;
import java.util.Iterator;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FSDataOutputStream;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.Writable;
import org.apache.hadoop.io.WritableComparable;
import org.apache.hadoop.io.compress.CompressionCodec;
import org.apache.hadoop.io.compress.GzipCodec;
import org.apache.hadoop.mapreduce.OutputCommitter;
import org.apache.hadoop.mapreduce.RecordWriter;
import org.apache.hadoop.mapreduce.TaskAttemptContext;
import org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
import org.apache.hadoop.util.ReflectionUtils;
/**
* This abstract class extends the FileOutputFormat, allowing to write the
* output data to different output files. There are three basic use cases for
* this class.
* Created on 2012-07-08
* @author zhoulongliu
* @param <K>
* @param <V>
*/
public abstract class MultipleOutputFormat<K extends WritableComparable<?>, V extends Writable> extends
FileOutputFormat<K, V> {
//接口类，需要在调用程序中实现generateFileNameForKeyValue来获取文件名
private MultiRecordWriter writer = null;
public RecordWriter<K, V> getRecordWriter(TaskAttemptContext job) throws IOException, InterruptedException {
if (writer == null) {
writer = new MultiRecordWriter(job, getTaskOutputPath(job));
}
return writer;
}
/**
* get task output path
* @param conf
* @return
* @throws IOException
*/
private Path getTaskOutputPath(TaskAttemptContext conf) throws IOException {
Path workPath = null;
OutputCommitter committer = super.getOutputCommitter(conf);
if (committer instanceof FileOutputCommitter) {
workPath = ((FileOutputCommitter) committer).getWorkPath();
} else {
Path outputPath = super.getOutputPath(conf);
if (outputPath == null) {
throw new IOException("Undefined job output-path");
}
workPath = outputPath;
}
return workPath;
}
/**
* 通过key, value, conf来确定输出文件名（含扩展名） Generate the file output file name based
* on the given key and the leaf file name. The default behavior is that the
* file name does not depend on the key.
*
* @param key the key of the output data
* @param name the leaf file name
* @param conf the configure object
* @return generated file name
*/
protected abstract String generateFileNameForKeyValue(K key, V value, Configuration conf);
/**
* 实现记录写入器RecordWriter类
* （内部类）
* @author zhoulongliu
*
*/
public class MultiRecordWriter extends RecordWriter<K, V> {
/** RecordWriter的缓存 */
private HashMap<String, RecordWriter<K, V>> recordWriters = null;
private TaskAttemptContext job = null;
/** 输出目录 */
private Path workPath = null;
public MultiRecordWriter(TaskAttemptContext job, Path workPath) {
super();
this.job = job;
this.workPath = workPath;
recordWriters = new HashMap<String, RecordWriter<K, V>>();
}
@Override
public void close(TaskAttemptContext context) throws IOException, InterruptedException {
Iterator<RecordWriter<K, V>> values = this.recordWriters.values().iterator();
while (values.hasNext()) {
values.next().close(context);
}
this.recordWriters.clear();
}
@Override
public void write(K key, V value) throws IOException, InterruptedException {
// 得到输出文件名
String baseName = generateFileNameForKeyValue(key, value, job.getConfiguration());
//如果recordWriters里没有文件名，那么就建立。否则就直接写值。
RecordWriter<K, V> rw = this.recordWriters.get(baseName);
if (rw == null) {
rw = getBaseRecordWriter(job, baseName);
this.recordWriters.put(baseName, rw);
}
rw.write(key, value);
}
// ${mapred.out.dir}/_temporary/_${taskid}/${nameWithExtension}
private RecordWriter<K, V> getBaseRecordWriter(TaskAttemptContext job, String baseName) throws IOException,
InterruptedException {
Configuration conf = job.getConfiguration();
//查看是否使用解码器
boolean isCompressed = getCompressOutput(job);
String keyValueSeparator = ",";
RecordWriter<K, V> recordWriter = null;
if (isCompressed) {
Class<? extends CompressionCodec> codecClass = getOutputCompressorClass(job, GzipCodec.class);
CompressionCodec codec = ReflectionUtils.newInstance(codecClass, conf);
Path file = new Path(workPath, baseName + codec.getDefaultExtension());
FSDataOutputStream fileOut = file.getFileSystem(conf).create(file, false);
//这里我使用的自定义的OutputFormat
recordWriter = new LineRecordWriter<K, V>(new DataOutputStream(codec.createOutputStream(fileOut)),
keyValueSeparator);
} else {
Path file = new Path(workPath, baseName);
FSDataOutputStream fileOut = file.getFileSystem(conf).create(file, false);
//这里我使用的自定义的OutputFormat
recordWriter = new LineRecordWriter<K, V>(fileOut, keyValueSeparator);
}
return recordWriter;
}
}
}

知识点

相关文章

最近更新

实现MapReduce多文件自定义输出

相关问答

大败而逃(打一电脑用语)[2023-02-26]

linux怎么在程序中用自定义日志文件输出信息？[2023-05-15]

如何自定义一个hadoop mapreducer中reducer输出的时候以csv文件输出。[2023-03-10]

设计MapReduce架构时的自定义注释(Custom Annotation in designing MapReduce Architecture)[2021-11-06]

在AWS EMR上的MapReduce代码中导入自定义函数(import custom function in MapReduce code on AWS EMR)[2022-11-11]

密码重置自定义策略中的输出自定义属性(Output custom attribute in Password Reset Custom Policy)[2024-02-01]

自定义自定义输出文件夹和exe名称的gradle脚本(Customize gradle script for custom output folder and exe name)[2022-11-03]

如何在WebStorm 6中指定CoffeeScript文件的自定义输出路径(How to specify a custom output path to CoffeeScript files in WebStorm 6)[2022-05-13]

具有自定义Writable的Hadoop MapReduce独特模式会生成重复键(Hadoop MapReduce distinct pattern with custom Writable produces duplicate keys)[2023-09-21]

自定义MapReduce Hive不是Java(Custom MapReduce Hive not Java)[2022-06-29]