首页 \ 问答 \ Hadoop后端有数百万条记录插入(Hadoop backend with millions of records insertion)

Hadoop后端有数百万条记录插入(Hadoop backend with millions of records insertion)

 我是hadoop的新手，有人可以建议我如何将数百万条记录上传到hadoop吗？ 我可以用蜂巢做到这一点，我在哪里可以看到我的hadoop记录？  
 到目前为止，我已经使用配置单元在hadoop上创建数据库，我使用localhost 50070访问它。但是我无法将数据从csv文件加载到终端的hadoop。 因为它给了我错误：  
 
  FAILED：语义分析出错：第2行：0无效路径''/ user / local / hadoop / share / hadoop / hdfs''：没有匹配路径的文件hdfs：// localhost：54310 / usr / local / hadoop / share / Hadoop的/ HDFS  
 
   
 有谁能建议我解决它的方法？ 

I am new to hadoop, can someone please suggest me how to upload millions of records to hadoop? Can I do this with hive and where can I see my hadoop records? 
Until now I have used hive for creation of the database on hadoop and I am accessing it with localhost 50070. But I am unable to load data from csv file to hadoop from terminal. As it is giving me error: 
 
 FAILED: Error in semantic analysis: Line 2:0 Invalid path ''/user/local/hadoop/share/hadoop/hdfs'': No files matching path hdfs://localhost:54310/usr/local/hadoop/share/hadoop/hdfs 
 
 
Can anyone suggest me some way to resolve it?

原文：https://stackoverflow.com/questions/32835802

更新时间：2023-05-01 10:05

最满意答案

 关于“哪些文件已加载到分区”：  
 
  如果您使用了EXTERNAL TABLE并且只是将原始数据文件上传到映射到LOCATION的HDFS目录中，那么您可以  
 
 （a）从命令行在该目录上运行hdfs dfs -ls （或使用等效的Java API调用）（b）运行Hive查询，例如select distinct INPUT__FILE__NAME from (...)  
 
  但在您的情况下，您将数据复制到“托管”表中，因此无法检索数据沿袭（即用于创建每个托管数据文件的日志文件）  
  ...除非你在日志文件中明确添加原始文件名，当然（在“特殊”标题记录上，或在每个记录的开头 - 可以使用旧的sed ）  
 
 
 关于“如何自动避免INSERT上的重复”：有一种方法，但它需要相当多的重新设计，并且会花费你的处理时间/（额外的Map步骤加MapJoin）/ ...  
 
  将日志文件映射到EXTERNAL TABLE以便您可以运行INSERT-SELECT查询  
  使用INPUT__FILE__NAME伪列作为源将原始文件名上载到托管表中  
  添加一个WHERE NOT EXISTS子句w /相关子查询，这样如果源文件名已经存在于目标中，那么你不再加载 
 INSERT INTO TABLE Target SELECT ColA, ColB, ColC, INPUT__FILE__NAME AS SrcFileName FROM Source src WHERE NOT EXISTS (SELECT DISTINCT 1 FROM Target trg WHERE trg.SrcFileName =src.INPUT__FILE__NAME ) 
  注意实际需要的愚蠢的DISTINCT，以避免在Mappers中吹掉RAM; 对于像Oracle这样成熟的DBMS来说它会毫无用处，但是Hive优化器仍然相当粗糙......  

About "which files have been loaded in a partition": 
 
 if you had used an EXTERNAL TABLE and just uploaded your raw data file in the HDFS directory mapped to LOCATION, then you could 
 
(a) just run a hdfs dfs -ls on that directory from command line (or use the equivalent Java API call) (b) run a Hive query such as select distinct INPUT__FILE__NAME from (...) 
 
 but in your case, you copy the data into a "managed" table, so there is no way to retrieve the data lineage (i.e. which log file was used to create each managed datafile) 
 ...unless you add explicitly the original file name inside the log file, of course (either on "special" header record, or at the beginning of each record - which can be done with good old sed) 
 
 
About "how to automagically avoid duplication on INSERT": there is a way, but it would require quite a bit of re-engineering, and would cost you in terms of processing time /(extra Map step plus MapJoin)/... 
 
 map your log file to an EXTERNAL TABLE so that you can run an INSERT-SELECT query 
 upload the original file name into your managed table using INPUT__FILE__NAME pseudo-column as source 
 add a WHERE NOT EXISTS clause w/ correlated sub-query, so that if the source file name is already present in target then you load nothing more
 INSERT INTO TABLE Target SELECT ColA, ColB, ColC, INPUT__FILE__NAME AS SrcFileName FROM Source src WHERE NOT EXISTS (SELECT DISTINCT 1 FROM Target trg WHERE trg.SrcFileName =src.INPUT__FILE__NAME )
 Note the silly DISTINCT that is actually required to avoid blowing away the RAM in your Mappers; it would be useless with a mature DBMS like Oracle, but the Hive optimizer is still rather crude...

Hadoop后端有数百万条记录插入(Hadoop backend with millions of records insertion)

最满意答案

相关问答

从多个服务器加载数据时避免数据重复(Avoiding Data Duplication when Loading Data from Multiple Servers)[2022-11-14]

从多个服务器的sql数据库中获取数据(Fetch data from sql database in multiple servers)[2023-12-04]

避免重复密钥/数据(Avoiding duplication of key/data)[2022-03-30]

协调多个服务器之间的任务(Coordinating tasks between multiple servers)[2023-03-29]

Hibernate - 多个服务器尝试更改相同的数据(Hibernate - multiple servers try to change same data)[2023-02-15]

加密多个服务器的web.config数据(encrypting web.config data for multiple servers)[2022-02-11]

大型网站如何在多台服务器上存储数据？(How do large sites store data over multiple servers?)[2022-03-09]

从多个服务器SQL 2005 SSIS中提取数据(Extracting data from multiple servers SQL 2005 SSIS)[2022-11-03]

数据如何在多个Linux服务器间同步[关闭](How data can be synchronized among multiple linux servers [closed])[2024-02-25]

Java RMI服务器：运行具有不同数据的服务器(Java RMI server: Running servers with different data)[2022-01-07]

相关文章

最新问答