首页 \ 问答 \ 巨大的插入HBase(Huge insert to HBase)

巨大的插入HBase(Huge insert to HBase)

当我尝试向HBase插入数据时遇到问题。

我有一个1200万行Spark DataFrame有2个字段:

* KEY, a md5 hash
* MATCH, a boolean ("1" or "0")

我需要将它存储在HBase表中,KEY是rowkey,MATCH是列。

我在rowkey上创建了一个分割表:

create 'GTH_TEST', 'GTH_TEST', {SPLITS=> ['10000000000000000000000000000000',
'20000000000000000000000000000000','30000000000000000000000000000000',
'40000000000000000000000000000000','50000000000000000000000000000000',
'60000000000000000000000000000000','70000000000000000000000000000000',
'80000000000000000000000000000000','90000000000000000000000000000000',
'a0000000000000000000000000000000','b0000000000000000000000000000000',
'c0000000000000000000000000000000','d0000000000000000000000000000000',
'e0000000000000000000000000000000','f0000000000000000000000000000000']}

我使用Hortonworks的HBase shc连接器,如下所示:

df.write
  .options(Map(HBaseTableCatalog.tableCatalog -> cat_matrice))
  .format("org.apache.spark.sql.execution.datasources.hbase")
  .save()

此代码永不结束。 它开始向HBase插入数据并永久运行(至少在我杀死它之前35小时)。 它执行11984/16000个任务,总是相同数量的任务。

我做了一个单独的更改:

df.limit(Int.MaxValue)
  .write
  .options(Map(HBaseTableCatalog.tableCatalog -> cat_matrice))
  .format("org.apache.spark.sql.execution.datasources.hbase")
  .save()

使用限制(Int.MaxValue) ,需要4/5分钟才能插入1200万行。

有人可以解释这种行为吗? HBase端是否有max_connexions? HBase或Spark方面有一些调整吗?

谢谢 !

杰弗里


I have an issue when I try to insert data to HBase.

I have a 12 million lines Spark DataFrame with 2 fields :

* KEY, a md5 hash
* MATCH, a boolean ("1" or "0")

I need to store it in an HBase table, KEY is the rowkey and MATCH is a column.

I created the table with a split on rowkey :

create 'GTH_TEST', 'GTH_TEST', {SPLITS=> ['10000000000000000000000000000000',
'20000000000000000000000000000000','30000000000000000000000000000000',
'40000000000000000000000000000000','50000000000000000000000000000000',
'60000000000000000000000000000000','70000000000000000000000000000000',
'80000000000000000000000000000000','90000000000000000000000000000000',
'a0000000000000000000000000000000','b0000000000000000000000000000000',
'c0000000000000000000000000000000','d0000000000000000000000000000000',
'e0000000000000000000000000000000','f0000000000000000000000000000000']}

I use the HBase shc connector from Hortonworks like this :

df.write
  .options(Map(HBaseTableCatalog.tableCatalog -> cat_matrice))
  .format("org.apache.spark.sql.execution.datasources.hbase")
  .save()

This code never ends. It starts inserting data to HBase and runs forever (at least 35 hours before I killed it). It performs 11984/16000 tasks, always the same number of tasks.

I made a single change :

df.limit(Int.MaxValue)
  .write
  .options(Map(HBaseTableCatalog.tableCatalog -> cat_matrice))
  .format("org.apache.spark.sql.execution.datasources.hbase")
  .save()

With the limit(Int.MaxValue), it takes 4/5 minutes to insert 12 million lines.

Can somebody explain this behaviour ? Is there a max_connexions on HBase side ? Is there some tuning to do on HBase or Spark side ?

Thanks !

Geoffrey


原文:https://stackoverflow.com/questions/37968215
更新时间:2023-07-15 11:07

最满意答案

我自己刚刚找到了解决方案:关键是将myCollection映射到[AnyObject] ,反之亦然,如下所示:

class MyClass: NSObject, NSCoding {
    let myCollection: [MyProtocol]

    init(myCollection: [MyProtocol]) {
        self.myCollection = myCollection

        super.init()
    }

    required convenience init?(coder aDecoder: NSCoder) {
        let collection1 = aDecoder.decodeObjectForKey("collection") as! [AnyObject]

        let collection2: [MyProtocol] = collection1.map { $0 as! MyProtocol }


        self.init(myCollection: collection2)
    }

    func encodeWithCoder(aCoder: NSCoder) {
        let aCollection: [AnyObject] = myCollection.map { $0 as! AnyObject }

        aCoder.encodeObject(aCollection, forKey: "collection")
    }      
}

I've just found the solution myself: The key is to map myCollection into [AnyObject] and vice-versa, like so:

class MyClass: NSObject, NSCoding {
    let myCollection: [MyProtocol]

    init(myCollection: [MyProtocol]) {
        self.myCollection = myCollection

        super.init()
    }

    required convenience init?(coder aDecoder: NSCoder) {
        let collection1 = aDecoder.decodeObjectForKey("collection") as! [AnyObject]

        let collection2: [MyProtocol] = collection1.map { $0 as! MyProtocol }


        self.init(myCollection: collection2)
    }

    func encodeWithCoder(aCoder: NSCoder) {
        let aCollection: [AnyObject] = myCollection.map { $0 as! AnyObject }

        aCoder.encodeObject(aCollection, forKey: "collection")
    }      
}

相关问答

更多

相关文章

更多

最新问答

更多
  • 您如何使用git diff文件,并将其应用于同一存储库的副本的本地分支?(How do you take a git diff file, and apply it to a local branch that is a copy of the same repository?)
  • 将长浮点值剪切为2个小数点并复制到字符数组(Cut Long Float Value to 2 decimal points and copy to Character Array)
  • OctoberCMS侧边栏不呈现(OctoberCMS Sidebar not rendering)
  • 页面加载后对象是否有资格进行垃圾回收?(Are objects eligible for garbage collection after the page loads?)
  • codeigniter中的语言不能按预期工作(language in codeigniter doesn' t work as expected)
  • 在计算机拍照在哪里进入
  • 使用cin.get()从c ++中的输入流中丢弃不需要的字符(Using cin.get() to discard unwanted characters from the input stream in c++)
  • No for循环将在for循环中运行。(No for loop will run inside for loop. Testing for primes)
  • 单页应用程序:页面重新加载(Single Page Application: page reload)
  • 在循环中选择具有相似模式的列名称(Selecting Column Name With Similar Pattern in a Loop)
  • System.StackOverflow错误(System.StackOverflow error)
  • KnockoutJS未在嵌套模板上应用beforeRemove和afterAdd(KnockoutJS not applying beforeRemove and afterAdd on nested templates)
  • 散列包括方法和/或嵌套属性(Hash include methods and/or nested attributes)
  • android - 如何避免使用Samsung RFS文件系统延迟/冻结?(android - how to avoid lag/freezes with Samsung RFS filesystem?)
  • TensorFlow:基于索引列表创建新张量(TensorFlow: Create a new tensor based on list of indices)
  • 企业安全培训的各项内容
  • 错误:RPC失败;(error: RPC failed; curl transfer closed with outstanding read data remaining)
  • C#类名中允许哪些字符?(What characters are allowed in C# class name?)
  • NumPy:将int64值存储在np.array中并使用dtype float64并将其转换回整数是否安全?(NumPy: Is it safe to store an int64 value in an np.array with dtype float64 and later convert it back to integer?)
  • 注销后如何隐藏导航portlet?(How to hide navigation portlet after logout?)
  • 将多个行和可变行移动到列(moving multiple and variable rows to columns)
  • 提交表单时忽略基础href,而不使用Javascript(ignore base href when submitting form, without using Javascript)
  • 对setOnInfoWindowClickListener的意图(Intent on setOnInfoWindowClickListener)
  • Angular $资源不会改变方法(Angular $resource doesn't change method)
  • 在Angular 5中不是一个函数(is not a function in Angular 5)
  • 如何配置Composite C1以将.m和桌面作为同一站点提供服务(How to configure Composite C1 to serve .m and desktop as the same site)
  • 不适用:悬停在悬停时:在元素之前[复制](Don't apply :hover when hovering on :before element [duplicate])
  • 常见的python rpc和cli接口(Common python rpc and cli interface)
  • Mysql DB单个字段匹配多个其他字段(Mysql DB single field matching to multiple other fields)
  • 产品页面上的Magento Up出售对齐问题(Magento Up sell alignment issue on the products page)