首页 \ 问答 \ 将多个客户端数据加载到Hadoop的最佳实践(best practice to load multiple client data into Hadoop)

将多个客户端数据加载到Hadoop的最佳实践(best practice to load multiple client data into Hadoop)

我们使用Cloudera CDH在Hadoop框架上创建POC。 我们想要将多个客户端的数据加载到Hive表中。
截至目前，我们为SQL Server上的每个客户端都有单独的数据库。 OLTP的这种基础结构将保持不变。 Hadoop将用于OLAP。 我们有一些主要维度表，每个客户端都是相同的。 所有客户端数据库都有完全相同的模式 这些表具有相同的主键值。 到目前为止，这很好，因为我们有单独的客户端数据库。 现在我们试图将多个客户端数据加载到同一个数据容器（Hive表）中。 现在，如果我们通过Sqoop作业将数据直接从多个SQL Server数据库加载到Hive中，我们将拥有多个具有相同主键值的行。 我想在Hive表中使用代理键，但Hive不支持自动增量，但可以使用UDF实现。
我们不想在运行生产数据时修改SQL Server数据。
一个。 将多个客户端数据加载到Hadoop生态系统中的标准/通用方式/解决方案是什么？
湾 如何将sql server数据库表的主键轻松映射到Hadoop Hive表？
C。 我们如何确保一个客户端永远无法看到其他客户端的数据？
谢谢

We are creating POC on Hadoop framework with Cloudera CDH. We want to load data of multiple client into Hive tables.
As of now, we have separate database for each client on SQL Server. This infrastructure will remain same for OLTP. Hadoop will be used for OLAP. We have some primary dimension tables which are same for each client. All client database has exact same schema. These tables have same primary key value. Till now, this was fine as we have separate database for client. Now we are trying to load multiple client data into same data container (Hive tables). Now we will have multiple row with same primary key value if we load data directly into Hive from multiple SQL Server databases through Sqoop job. I am thinking to use the surrogate key in Hive tables but Hive does not support auto increment but can be achieved with UDF.
We don't want to modify the SQL Server data as it's running production data.
a. What are the standard/generic way/solution to load multiple client data into Hadoop ecosystem?
b. How primary key of sql server database table can be mapped easily to Hadoop Hive table ?
c. How we can ensure that one client is never able to see the data of other client?
Thanks

原文：https://stackoverflow.com/questions/35034754

更新时间：2022-07-30 09:07

最满意答案

 我记得遇到过类似的情况。  
 并且，正如@nulltoken所建议的那样，在尝试删除它所保存的文件之前，您必须先Dispose() Repository 。  
 using应该是最好的选择。  
using (var repo = new Repository(repositoryPath))
{
  //Your repo specific implementation.
}

//Code to Delete your local temp dir
 
 参考：从LibGit2Sharp 克隆夹具 

I remember having faced a similar situation.  
And, as advised by @nulltoken, you would have to Dispose() the Repository before trying to delete the files that are being held by it.  
using should be the best option. 
using (var repo = new Repository(repositoryPath))
{
  //Your repo specific implementation.
}

//Code to Delete your local temp dir
 
Reference: Clone Fixture from LibGit2Sharp

将多个客户端数据加载到Hadoop的最佳实践(best practice to load multiple client data into Hadoop)

最满意答案

相关问答

如何删除克隆的git存储库上的标记？(How to delete a tag on cloned git repository?)[2023-09-29]

如何从git存储库中删除文件，而不在克隆的存储库中删除它(How to remove a file from a git repository, without deleting it in cloned repositories)[2023-04-04]

用LibGit2Sharp编程删除本地存储库(Programmatically delete local repository with LibGit2Sharp)[2022-02-17]

从Mercurial存储库中删除目录(Delete directory from a Mercurial repository)[2023-09-28]

删除我的git目录是否安全？(Is it safe to delete my git directory? Cloned in the wrong location)[2023-11-17]

在Git存储库上删除无法拉到PC的目录(Delete Directory on Git Repository that cannot be Pulled to PC)[2022-11-15]

无法提交克隆的存储库(Can't commit to cloned repository)[2022-06-20]

我克隆了一个git存储库。(I cloned a git repository. How do I pull from it into a working directory?)[2022-09-18]

以编程方式删除克隆存储库的目录(Programmatically delete directory of cloned repository)[2024-01-21]

将Git工作目录转换为类似裸的存储库？(Convert a Git working directory to a bare-like repository?)[2021-11-03]

相关文章

最新问答