巨大的插入HBase(Huge insert to HBase)
当我尝试向HBase插入数据时遇到问题。
我有一个1200万行Spark DataFrame有2个字段:
* KEY, a md5 hash * MATCH, a boolean ("1" or "0")
我需要将它存储在HBase表中,KEY是rowkey,MATCH是列。
我在rowkey上创建了一个分割表:
create 'GTH_TEST', 'GTH_TEST', {SPLITS=> ['10000000000000000000000000000000', '20000000000000000000000000000000','30000000000000000000000000000000', '40000000000000000000000000000000','50000000000000000000000000000000', '60000000000000000000000000000000','70000000000000000000000000000000', '80000000000000000000000000000000','90000000000000000000000000000000', 'a0000000000000000000000000000000','b0000000000000000000000000000000', 'c0000000000000000000000000000000','d0000000000000000000000000000000', 'e0000000000000000000000000000000','f0000000000000000000000000000000']}
我使用Hortonworks的HBase shc连接器,如下所示:
df.write .options(Map(HBaseTableCatalog.tableCatalog -> cat_matrice)) .format("org.apache.spark.sql.execution.datasources.hbase") .save()
此代码永不结束。 它开始向HBase插入数据并永久运行(至少在我杀死它之前35小时)。 它执行11984/16000个任务,总是相同数量的任务。
我做了一个单独的更改:
df.limit(Int.MaxValue) .write .options(Map(HBaseTableCatalog.tableCatalog -> cat_matrice)) .format("org.apache.spark.sql.execution.datasources.hbase") .save()
使用限制(Int.MaxValue) ,需要4/5分钟才能插入1200万行。
有人可以解释这种行为吗? HBase端是否有max_connexions? HBase或Spark方面有一些调整吗?
谢谢 !
杰弗里
I have an issue when I try to insert data to HBase.
I have a 12 million lines Spark DataFrame with 2 fields :
* KEY, a md5 hash * MATCH, a boolean ("1" or "0")
I need to store it in an HBase table, KEY is the rowkey and MATCH is a column.
I created the table with a split on rowkey :
create 'GTH_TEST', 'GTH_TEST', {SPLITS=> ['10000000000000000000000000000000', '20000000000000000000000000000000','30000000000000000000000000000000', '40000000000000000000000000000000','50000000000000000000000000000000', '60000000000000000000000000000000','70000000000000000000000000000000', '80000000000000000000000000000000','90000000000000000000000000000000', 'a0000000000000000000000000000000','b0000000000000000000000000000000', 'c0000000000000000000000000000000','d0000000000000000000000000000000', 'e0000000000000000000000000000000','f0000000000000000000000000000000']}
I use the HBase shc connector from Hortonworks like this :
df.write .options(Map(HBaseTableCatalog.tableCatalog -> cat_matrice)) .format("org.apache.spark.sql.execution.datasources.hbase") .save()
This code never ends. It starts inserting data to HBase and runs forever (at least 35 hours before I killed it). It performs 11984/16000 tasks, always the same number of tasks.
I made a single change :
df.limit(Int.MaxValue) .write .options(Map(HBaseTableCatalog.tableCatalog -> cat_matrice)) .format("org.apache.spark.sql.execution.datasources.hbase") .save()
With the limit(Int.MaxValue), it takes 4/5 minutes to insert 12 million lines.
Can somebody explain this behaviour ? Is there a max_connexions on HBase side ? Is there some tuning to do on HBase or Spark side ?
Thanks !
Geoffrey
原文:https://stackoverflow.com/questions/37968215
最满意答案
我自己刚刚找到了解决方案:关键是将
myCollection
映射到[AnyObject]
,反之亦然,如下所示:class MyClass: NSObject, NSCoding { let myCollection: [MyProtocol] init(myCollection: [MyProtocol]) { self.myCollection = myCollection super.init() } required convenience init?(coder aDecoder: NSCoder) { let collection1 = aDecoder.decodeObjectForKey("collection") as! [AnyObject] let collection2: [MyProtocol] = collection1.map { $0 as! MyProtocol } self.init(myCollection: collection2) } func encodeWithCoder(aCoder: NSCoder) { let aCollection: [AnyObject] = myCollection.map { $0 as! AnyObject } aCoder.encodeObject(aCollection, forKey: "collection") } }
I've just found the solution myself: The key is to map
myCollection
into[AnyObject]
and vice-versa, like so:class MyClass: NSObject, NSCoding { let myCollection: [MyProtocol] init(myCollection: [MyProtocol]) { self.myCollection = myCollection super.init() } required convenience init?(coder aDecoder: NSCoder) { let collection1 = aDecoder.decodeObjectForKey("collection") as! [AnyObject] let collection2: [MyProtocol] = collection1.map { $0 as! MyProtocol } self.init(myCollection: collection2) } func encodeWithCoder(aCoder: NSCoder) { let aCollection: [AnyObject] = myCollection.map { $0 as! AnyObject } aCoder.encodeObject(aCollection, forKey: "collection") } }
相关问答
更多-
将Cat数组映射到Selectable数组: let selectableArray = catArray.map { $0 as Selectable } Map Cat array to Selectable array: let selectableArray = catArray.map { $0 as Selectable }
-
下列中不属于面向对象的编程语言的是?[2022-05-30]
a -
编码/解码在Swift 2中实现协议的对象数组(Encoding/Decoding an array of objects which implements a protocol in Swift 2)[2023-09-12]
我自己刚刚找到了解决方案:关键是将myCollection映射到[AnyObject] ,反之亦然,如下所示: class MyClass: NSObject, NSCoding { let myCollection: [MyProtocol] init(myCollection: [MyProtocol]) { self.myCollection = myCollection super.init() } required conven ... -
APIControllerProtocol中的APIControllerProtocol声明需要是一个泛型函数,以便将泛型类型T正确传递给它。 protocol APIControllerProtocol { typealias T func didReceiveAPIResults
(results: [T]) } 注意:这意味着您的委托定义需要定义T是什么: class TestDelegate: APIControllerProtocol { typealias T = ... -
以下回答指出了我正确的方向,我能够提出以下解决方案来实现允许重复和nil (便利)条目的协议引用的弱列表。 struct Weak
{ weak var value: AnyObject? init (value: T?) { if value != nil { guard value is AnyObject else { fatalError("Object (\(value)) should be subclass ... -
append(_ newElement: Element)附加单个元素。 你想要的是append(contentsOf newElements: C) 。 但你必须明确地将 [MyClass]数组转换为[MyProtocol] : collection.append(contentsOf: myClassCollection as [MyProtocol]) // or: collection += myClassCollection as [MyProtocol] 正如在Swift中使用协议时类型转换中 ...
-
我有更改只是删除unowned var wrappedTreePartNode: TreePartNodeBase从unowned var wrappedTreePartNode: TreePartNodeBase行和编译相同的代码。 结果: 打印已解码的中继节点时的预期结果:Trunk {Branch {Apple},Branch {Branch {Apple}}}开始解码[Trunk,Branch,Branch,Branch,Apple,Apple] 码: // // ViewController.s ...
-
你不能应用很多逻辑来符合。 它要么符合要么不符合要求。 但是,您可以将一点逻辑应用于扩展。 下面的代码可以很容易地设置一致性的具体实现。 哪个是重要的部分。 稍后将其用作类型化约束。 class SomeType { } 这是你的协议 protocol SomeProtocol { func foo() } 这是协议的延伸。 在SomeProtocol的扩展中实现foo()会创建一个默认值。 extension SomeProtocol { func foo() { ...
-
Swift - 将协议数组转换为超级协议数组会导致错误(Swift - upcasting array of protocol to array of super protocol causes error)[2022-01-24]
原因与协议如何从类继承不同有关。 首先考虑协议可以具有默认实现 ,例如: protocol MammalLocomotion { func legs() -> Int } extension MammalLocomotion { func legs () -> Int { return 2 } } protocol CowLocomotion : MammalLocomotion { } extension CowLocomotion { func l ... -
将@objc添加到您的协议声明中,它应该没问题, @objc protocol HasImage { var imageUrl : String? {get} } Add @objc to your protocol declaration and it should be fine, @objc protocol HasImage { var imageUrl : String? {get} }