首页 \ 问答 \ 使用Map Reduce（Cloudera Hadoop 0.20.2）比较两个大小接近3GB的文本文件(Comparing using Map Reduce(Cloudera Hadoop 0.20.2) two text files of size of almost 3GB)

使用Map Reduce（Cloudera Hadoop 0.20.2）比较两个大小接近3GB的文本文件(Comparing using Map Reduce(Cloudera Hadoop 0.20.2) two text files of size of almost 3GB)

 我正在尝试在hadoop map / reduce中执行以下操作（用java编写，linux内核操作系统）  
 
  文本文件'rules-1'和'rules-2'（总共3GB大小）包含一些规则，每个规则由endline字符分隔，因此可以使用readLine（）函数读取文件。  
  这些文件'rules-1'和'rules-2'需要作为一个整体从hdfs导入到我的集群中的每个映射函数中，即这些文件不能跨不同的映射函数分割。  
  映射器映射函数的输入是一个名为'record'的文本文件（每行都以endline character结尾），所以从'record'文件中我们得到（key，value）对。 该文件是可拆分的，可以作为整个地图/缩小过程中使用的不同地图功能的输入。  
  需要做的是比较每个值（即来自记录文件的行）与'rules-1'和'rules-2'内的规则，  
 
 问题是，如果我将每行rule-1和rules-2文件仅抽出一次到静态数组列表，以便每个映射器可以共享相同的arraylint并尝试比较数组列表中的元素与记录中的每个输入值文件，我得到一个内存溢出错误，因为3GB不能一次存储在数组列表中。  
 另外，如果我一次只导入rules-1和rules-2文件中的几行，并将它们与每个值进行比较，则map / reduce将花费大量时间完成其工作。  
 你们能否给我提供任何其他替代想法？如何在没有内存溢出错误的情况下完成这项工作？ 如果我将这些文件1和文件2放入hdfs支持数据库或其他内容中，会有帮助吗？ 我真的会出来的想法。真的很感激，如果你们中的一些人可以提供给我你宝贵的建议。 

I'm trying to do the following in hadoop map/reduce( written in java, linux kernel OS) 
 
 Text files 'rules-1' and 'rules-2' (total 3GB in size) contains some rules, each rule are separated by endline character, so the files can be read using readLine() function. 
 These files 'rules-1' and 'rules-2' needs to be imported as a whole from hdfs in every map function in my cluster i.e. these file are not splittable across different map function. 
 Input to the mapper's map function is a text file called 'record' (each line is terminated by endline character), so from the 'record' file we get the (key, value) pair. The file is splittable and can be given as input to different map function used in the whole map/reduce process. 
 What needs to be done is compare each value(i.e. lines from record file) with the rules inside 'rules-1' and 'rules-2' 
 
Problem is, if I pull out each line of rules-1 and rules-2 files to a static arraylist only once, so that each mapper can share the same arraylint and try to compare elements in the arraylist with the each input value from the record file, I get a memory overflow error, since 3GB cannot be stored at a time in the arraylist. 
Alternatively, if I import only few lines from the rules-1 and rules-2 files at a time and compare them to each value, map/reduce is taking a lot time to finish its job. 
Could you guys provide me any other alternative ideas how can this be done without the memory overflow error? Will it help if I put those file-1 and file-2 inside a hdfs supporting database or something? I'm going out of ideas actually.Would really appreciate if some of you guys could provide me your valuable suggestions.

原文：https://stackoverflow.com/questions/5606927

更新时间：2023-08-31 07:08

最满意答案

 简而言之 ：不，project.lock.file不应该签入源代码管理 - 你应该配置版本控制系统忽略它（即将它添加到.gitignore，如果你使用的是git）。  
 长答案 ：project.lock.json包含项目整个依赖关系树的快照 - 不仅仅是“依赖关系”部分列出的包，还包括所有已解决的依赖关系的依赖关系，等等。 但它不像ruby的Gemfile.lock 。 与Gemfile.lock不同，project.lock.json不会告诉dotnet restore哪些准确版本的软件包应该被恢复 - 它只是被覆盖。 因此，它应该像缓存文件一样对待，并且永远不会检入源代码管理。  
 如果您将其签入版本控制，那么很可能在其他机器上：  
 
  dotnet会认为所有的软件包都被恢复了，但实际上有些软件包可能会丢失，并且构建失败，而不会暗示开发者运行dotnet restore  
  project.lock.json将在dotnet restore过程中被覆盖，并且在大多数情况下会与源代码控制中存储的版本不同。 所以它几乎在每一次提交中都会被修改  
  project.lock.json将在合并期间导致冲突  

Short answer: No, project.lock.file should not be checked into source control - you should configure the version control system to ignore it (i.e. add it to .gitignore if you're using git). 
Long answer: The project.lock.json contains a snapshot of project's whole dependency tree - not just packages listed in "dependencies" sections, but also all resolved dependencies of those dependencies, and so on. But it is not like ruby's Gemfile.lock. Unlike Gemfile.lock, project.lock.json doesn't tell dotnet restore which exact versions of packages should be restored - it simply gets overwritten. As such, it should be treated like a cache file and never be checked into source control. 
If you check it into version control, then most probably on other machine: 
 
 dotnet will think that all packages are restored, but in fact some packages might be missing and the build will fail, without hinting the developer to run dotnet restore 
 project.lock.json will be overwritten during dotnet restore and in most cases will be different than the version stored in source control. So it will be modified in almost every commit 
 project.lock.json will cause conflicts during merge

使用Map Reduce（Cloudera Hadoop 0.20.2）比较两个大小接近3GB的文本文件(Comparing using Map Reduce(Cloudera Hadoop 0.20.2) two text files of size of almost 3GB)

最满意答案

相关问答

应该将project.lock.json文件签入到源代码控制中吗？(Should project.lock.json file be checked into source control? (ASP.NET Core 1.0))[2023-03-11]

如何将project.lock.json文件添加到gitignore(How to add project.lock.json file to gitignore)[2022-08-21]

什么是project.lock.json？(What is project.lock.json?)[2023-02-28]

从ASP.NET Core 1.0更新到1.1后发布(Issue after updating from ASP.NET Core 1.0 to 1.1)[2023-08-10]

Linux上的ASP.NET Core 1.0 / Kestrel没有创建unix socket(ASP.NET Core 1.0 / Kestrel on Linux is not creating unix socket)[2022-03-08]

ASP.NET Core 1.0中的会话(Sessions in ASP.NET Core 1.0)[2023-01-27]

使用ASP.NET MVC Core 1.0配置Nunit(Configuring Nunit with ASP.NET MVC Core 1.0)[2023-09-07]

应将project.lock.json签入存储库，还是在vNext VSTS版本中构建时生成(Should project.lock.json be checked in to repository or generated at build in vNext VSTS builds)[2023-03-31]

用于ASP.NET Core 1.0的Kudu部署脚本(Kudu Deployment Script for ASP.NET Core 1.0)[2022-05-25]

在编译Asp.net Core 1.0项目时复制文件(Copying files on compile for Asp.net Core 1.0 project)[2023-12-30]

相关文章

最新问答