public interface InputFormat<K, V> {
InputSplit[]getSplits(JobConf job, int numSplits) throws IOException;
RecordReader<K, V>getRecordReader(InputSplit split,
JobConf job,
Reporter reporter) throws IOException;
}

这两个方法有分别完成着以下工作：

方法getSplits将输入数据切分成splits，splits的个数即为map tasks的个数，splits的大小默认为块大小，即64M

方法 getSplits将每个s plit解析成records, 再依次将record解析成<K,V>对

也就是说 InputFormat完成以下工作：

InputFile --> splits --> <K,V>

系统常用的 InputFormat 又有哪些呢？

其中Text InputFormat便是最常用的，它的 <K,V>就代表 <行偏移,该行内容>

然而系统所提供的这几种固定的将 InputFile转换为 <K,V>的方式有时候并不能满足我们的需求：

此时需要我们自定义 InputFormat ，从而使Hadoop框架按照我们预设的方式来将

InputFile解析为<K,V>

在领会自定义 InputFormat 之前，需要弄懂一下几个抽象类、接口及其之间的关系：

InputFormat(interface), FileInputFormat(abstract class), TextInputFormat(class),

RecordReader (interface), Line RecordReader(class)的关系

FileInputFormatimplements InputFormat

TextInputFormatextends FileInputFormat

TextInputFormat.get RecordReadercalls Line RecordReader

Line RecordReader implements RecordReader

对于InputFormat接口，上面已经有详细的描述

再看看 FileInputFormat，它实现了 InputFormat接口中的 getSplits方法，而将 getRecordReader与isSplitable留给具体类(如 TextInputFormat )实现， isSplitable方法通常不用修改，所以只需要在自定义的 InputFormat中实现

getRecordReader方法即可，而该方法的核心是调用 Line RecordReader(即由LineRecorderReader类来实现 " 将每个s plit解析成records, 再依次将record解析成<K,V>对" )，该方法实现了接口RecordReader

public interface RecordReader<K, V> {

booleannext(K key, V value) throws IOException;
KcreateKey();
VcreateValue();
longgetPos() throws IOException;
public voidclose() throws IOException;
floatgetProgress() throws IOException;
}

相关问答

请问电脑系统操作员三级算什么级别的？[2022-02-23]

高级
遨游2的设置中心中,高级选项,自定义 UserAgent 字符串: (需要重启动) 的选项是干什么的??[2022-04-21]

User Agent (web browser) 指用户的浏览器代理。因遨游2是一个浏览网页的辅助工具，它不能独立于一个没有浏览器的环境而存在。不定义也没有关系，但是，定义了之后你的遨游理论上应该和浏览器配合的更好，运行更稳定。
设计MapReduce架构时的自定义注释(Custom Annotation in designing MapReduce Architecture)[2021-11-06]

由于FileInputFormat和RecordReader这已经是MapReduce的一个功能。我不能在这里给出比https://hadoopi.wordpress.com/2013/05/27/understand-recordreader-inputsplit/更好的例子，但基本上这两个类不参与核心map()和reduce()逻辑。 FileInputFormat负责读取和解析输入数据，然后将此数据传递给RecordReader ， RecordReader为映射器提供单个键值对。因此，映射器实际 ...
创建不带输入数据的自定义生成器Hadoop InputFormat(Creating a custom Generator Hadoop InputFormat without input data)[2021-11-15]

该错误清楚地表明该文件未找到 It seems that the issue of Hadoop trying to read non-existent files stems from the InputSplit. If the InputSplit doesn't define behavior for reading the data, then Hadoop defaults to it's own method. This is solved by implementing Writable f ...
在AWS EMR上的MapReduce代码中导入自定义函数(import custom function in MapReduce code on AWS EMR)[2022-11-11]

如你所说，我已经将testImport.py上传到map / reduce脚本的同一个桶中。除非您指定，否则EMR无法从该存储桶读取。对于java，我们在fatjar上为所有相关类创建并创建单个jar文件并执行它。对于你的python脚本，尝试创建单个map脚本和reducer脚本并运行它。 AS you have said i have uploaded testImport.py in same bucket as that of map/reduce script. EMR can not re ...
如何对自定义RecordReader和InputFormat类进行单元测试？(How to do unit testing of custom RecordReader and InputFormat classes?)[2024-03-19]

感谢user7610 从答案中编译并且有些经过测试的示例代码版本 import org.apache.hadoop.conf.Configuration; import org.apache.hadoop.fs.Path; import org.apache.hadoop.mapreduce.InputFormat; import org.apache.hadoop.mapreduce.RecordReader; import org.apache.hadoop.mapreduce.TaskAttemptC ...
高级自定义Swing组件(Advanced custom Swing component)[2022-06-08]

swingx 包含Swing GUI工具包的扩展，包括提供富客户端应用程序通常所需功能的新组件和增强组件。亮点包括：对表，树和列表进行排序，过滤，突出显示查找/搜索自动完成登录/身份验证框架TreeTable组件可折叠面板组件日期选择器组件每日提示组件在这里查看更多 http://swingx.java.net/ swingx Contains extensions to the Swing GUI toolkit, including new and enhanced components that ...
Hadoop MapReduce InputFormat已弃用？(Hadoop MapReduce InputFormat Deprecated?)[2022-03-14]

从文档：弃用的接口 ... org.apache.hadoop.mapred.InputFormat 改为使用InputFormat 。 ... 由于0.20.2的怪异弃用行为，甚至在0.20.2接口后使用实现的0.20.2建议，我挖得更深一点。该界面仍然存在于0.21.0 ，并且已弃用标签。在撰写本文时，我无法在主干中找到类似的界面。 From the documentation: Deprecated Interfaces ... org.apache.hadoop.mapred.InputFor ...
用于Excel文件的自定义InputFormat或InputReader（xls）(Custom InputFormat or InputReader for Excel files(xls))[2022-11-13]

是的，您应该创建RecordReader以从Excel文档中读取每条记录。在该记录阅读器中，你应该使用像api这样的POI来阅读excel文档。更准确地说，请执行以下步骤：扩展FileInputFromat并创建自己的CustomInputFrmat并覆盖getRecordReader 。通过扩展RecordReader创建一个CustomRecordReader ，在这里你必须编写如何从给定的filesplit生成一个键值对。因此，首先从filesplit读取字节，然后从bufferedbyte ...
自定义MapReduce Hive不是Java(Custom MapReduce Hive not Java)[2022-06-29]

查看Hadoop Streaming 。虽然它不如使用Java那么高效：该实用程序允许您使用任何可执行文件或脚本作为映射器和/或reducer创建和运行Map / Reduce作业。 Check out Hadoop Streaming. While it won't be as efficient as using Java: The utility allows you to create and run Map/Reduce jobs with any executable or script a ...

知识点

相关文章

最近更新

MapReduce高级编程之自定义InputFormat

相关问答

请问电脑系统操作员三级算什么级别的？[2022-02-23]

遨游2的设置中心中,高级选项,自定义 UserAgent 字符串: (需要重启动) 的选项是干什么的??[2022-04-21]

设计MapReduce架构时的自定义注释(Custom Annotation in designing MapReduce Architecture)[2021-11-06]

创建不带输入数据的自定义生成器Hadoop InputFormat(Creating a custom Generator Hadoop InputFormat without input data)[2021-11-15]

在AWS EMR上的MapReduce代码中导入自定义函数(import custom function in MapReduce code on AWS EMR)[2022-11-11]

如何对自定义RecordReader和InputFormat类进行单元测试？(How to do unit testing of custom RecordReader and InputFormat classes?)[2024-03-19]

高级自定义Swing组件(Advanced custom Swing component)[2022-06-08]

Hadoop MapReduce InputFormat已弃用？(Hadoop MapReduce InputFormat Deprecated?)[2022-03-14]

用于Excel文件的自定义InputFormat或InputReader（xls）(Custom InputFormat or InputReader for Excel files(xls))[2022-11-13]

自定义MapReduce Hive不是Java(Custom MapReduce Hive not Java)[2022-06-29]