首页 \ 问答 \ 巨大的插入HBase(Huge insert to HBase)

巨大的插入HBase(Huge insert to HBase)

 当我尝试向HBase插入数据时遇到问题。  
 我有一个1200万行Spark DataFrame有2个字段：  
* KEY, a md5 hash
* MATCH, a boolean ("1" or "0")
 
 我需要将它存储在HBase表中，KEY是rowkey，MATCH是列。  
 我在rowkey上创建了一个分割表：  
create 'GTH_TEST', 'GTH_TEST', {SPLITS=> ['10000000000000000000000000000000',
'20000000000000000000000000000000','30000000000000000000000000000000',
'40000000000000000000000000000000','50000000000000000000000000000000',
'60000000000000000000000000000000','70000000000000000000000000000000',
'80000000000000000000000000000000','90000000000000000000000000000000',
'a0000000000000000000000000000000','b0000000000000000000000000000000',
'c0000000000000000000000000000000','d0000000000000000000000000000000',
'e0000000000000000000000000000000','f0000000000000000000000000000000']}
 
 我使用Hortonworks的HBase shc连接器，如下所示：  
df.write
  .options(Map(HBaseTableCatalog.tableCatalog -> cat_matrice))
  .format("org.apache.spark.sql.execution.datasources.hbase")
  .save()
 
 此代码永不结束。 它开始向HBase插入数据并永久运行（至少在我杀死它之前35小时）。 它执行11984/16000个任务，总是相同数量的任务。  
 我做了一个单独的更改：  
df.limit(Int.MaxValue)
  .write
  .options(Map(HBaseTableCatalog.tableCatalog -> cat_matrice))
  .format("org.apache.spark.sql.execution.datasources.hbase")
  .save()
 
 使用限制（Int.MaxValue） ，需要4/5分钟才能插入1200万行。  
 有人可以解释这种行为吗？ HBase端是否有max_connexions？ HBase或Spark方面有一些调整吗？  
 谢谢 ！  
 杰弗里 

I have an issue when I try to insert data to HBase. 
I have a 12 million lines Spark DataFrame with 2 fields : 
* KEY, a md5 hash
* MATCH, a boolean ("1" or "0")
 
I need to store it in an HBase table, KEY is the rowkey and MATCH is a column. 
I created the table with a split on rowkey : 
create 'GTH_TEST', 'GTH_TEST', {SPLITS=> ['10000000000000000000000000000000',
'20000000000000000000000000000000','30000000000000000000000000000000',
'40000000000000000000000000000000','50000000000000000000000000000000',
'60000000000000000000000000000000','70000000000000000000000000000000',
'80000000000000000000000000000000','90000000000000000000000000000000',
'a0000000000000000000000000000000','b0000000000000000000000000000000',
'c0000000000000000000000000000000','d0000000000000000000000000000000',
'e0000000000000000000000000000000','f0000000000000000000000000000000']}
 
I use the HBase shc connector from Hortonworks like this : 
df.write
  .options(Map(HBaseTableCatalog.tableCatalog -> cat_matrice))
  .format("org.apache.spark.sql.execution.datasources.hbase")
  .save()
 
This code never ends. It starts inserting data to HBase and runs forever (at least 35 hours before I killed it). It performs 11984/16000 tasks, always the same number of tasks. 
I made a single change : 
df.limit(Int.MaxValue)
  .write
  .options(Map(HBaseTableCatalog.tableCatalog -> cat_matrice))
  .format("org.apache.spark.sql.execution.datasources.hbase")
  .save()
 
With the limit(Int.MaxValue), it takes 4/5 minutes to insert 12 million lines. 
Can somebody explain this behaviour ? Is there a max_connexions on HBase side ? Is there some tuning to do on HBase or Spark side ? 
Thanks ! 
Geoffrey

原文：https://stackoverflow.com/questions/37968215

更新时间：2023-07-15 11:07

最满意答案

 我自己刚刚找到了解决方案：关键是将myCollection映射到[AnyObject] ，反之亦然，如下所示：  
class MyClass: NSObject, NSCoding {
    let myCollection: [MyProtocol]

    init(myCollection: [MyProtocol]) {
        self.myCollection = myCollection

        super.init()
    }

    required convenience init?(coder aDecoder: NSCoder) {
        let collection1 = aDecoder.decodeObjectForKey("collection") as! [AnyObject]

        let collection2: [MyProtocol] = collection1.map { $0 as! MyProtocol }


        self.init(myCollection: collection2)
    }

    func encodeWithCoder(aCoder: NSCoder) {
        let aCollection: [AnyObject] = myCollection.map { $0 as! AnyObject }

        aCoder.encodeObject(aCollection, forKey: "collection")
    }      
}

I've just found the solution myself: The key is to map myCollection into [AnyObject] and vice-versa, like so: 
class MyClass: NSObject, NSCoding {
    let myCollection: [MyProtocol]

    init(myCollection: [MyProtocol]) {
        self.myCollection = myCollection

        super.init()
    }

    required convenience init?(coder aDecoder: NSCoder) {
        let collection1 = aDecoder.decodeObjectForKey("collection") as! [AnyObject]

        let collection2: [MyProtocol] = collection1.map { $0 as! MyProtocol }


        self.init(myCollection: collection2)
    }

    func encodeWithCoder(aCoder: NSCoder) {
        let aCollection: [AnyObject] = myCollection.map { $0 as! AnyObject }

        aCoder.encodeObject(aCollection, forKey: "collection")
    }      
}

巨大的插入HBase(Huge insert to HBase)

最满意答案

相关问答

Swift：协议扩展和数组(Swift : protocol extension and arrays)[2022-03-28]

下列中不属于面向对象的编程语言的是?[2022-05-30]

编码/解码在Swift 2中实现协议的对象数组(Encoding/Decoding an array of objects which implements a protocol in Swift 2)[2023-09-12]

在Swift中使用通用协议实现委托(Implement delegate using generic protocol in Swift)[2023-06-22]

Swift中任何协议的通用约束(Generic constraint for any protocol in Swift)[2022-12-29]

Swift 3无法将符合协议的对象数组追加到该协议的集合中(Swift 3 unable to append array of objects, which conform to a protocol, to a collection of that protocol)[2022-10-21]

在Swift 4中解码泛型类的可编码树(Decoding a codable tree of generic classes in Swift 4)[2023-10-09]

通过符合Swift 2中的协议来扩展类型化数组(Extending typed array by conforming to a protocol in Swift 2)[2023-10-02]

Swift - 将协议数组转换为超级协议数组会导致错误(Swift - upcasting array of protocol to array of super protocol causes error)[2022-01-24]

协议数组转换为swift中的任何对象(conversion of protocol array to any object in swift)[2023-07-28]

相关文章

最新问答