首页 \ 问答 \ 如何在Spark中使用DenseVector作为键将groupByKey组为RDD?(How to groupByKey a RDD, with DenseVector as key, in Spark?)

如何在Spark中使用DenseVector作为键将groupByKey组为RDD?(How to groupByKey a RDD, with DenseVector as key, in Spark?)

我创建了一个RDD,每个成员都是一个键值对,键是DenseVector ,值是int 。 例如

[(DenseVector([3,4]),10),  (DenseVector([3,4]),20)]

现在我想按键k1分组: DenseVector([3,4]) 。 我希望行为将键k1所有值分组为1020 。 但我得到的结果是

[(DenseVector([3,4]), 10), (DenseVector([3,4]), 20)] 

代替

[(DenseVector([3,4]), [10,20])]

如果我错过了什么,请告诉我。

相同的代码是:

#simplified version of code
#rdd1 is an rdd containing [(DenseVector([3,4]),10),  (DenseVector([3,4]),20)]
rdd1.groupByKey().map(lambda x : (x[0], list(x[1])))
print(rdd1.collect())

I have created an RDD with each member being a key value pair with the key being a DenseVector and value being an int. e.g.

[(DenseVector([3,4]),10),  (DenseVector([3,4]),20)]

Now I want to group by the key k1: DenseVector([3,4]). I expect the behaviour to be grouping all the values of the key k1 which are 10 and 20. But the result I get is

[(DenseVector([3,4]), 10), (DenseVector([3,4]), 20)] 

instead of

[(DenseVector([3,4]), [10,20])]

Please let me know if I am missing something.

The code for the same is :

#simplified version of code
#rdd1 is an rdd containing [(DenseVector([3,4]),10),  (DenseVector([3,4]),20)]
rdd1.groupByKey().map(lambda x : (x[0], list(x[1])))
print(rdd1.collect())

原文:https://stackoverflow.com/questions/31449412
更新时间:2023-03-19 11:03

最满意答案

我想你可能会在这里混淆“异步”。 创建元组的过程将始终阻止。 因此,您可能想要做的是创建一个算法,该算法仅在需要时根据某些参数生成元组,然后将其缓存以供日后使用。

既然你已将其标记为node.js,我将假设这是感兴趣的编程语言。 基于这个假设,以及您实际上不希望阻塞的假设,您最好的选择是生成多个进程并管理创建这些元组的过程。 这是一个非常粗略的示例脚本(强调粗略 ):

var cluster = require('cluster');
var names = ['Jon', 'Stewart', 'Oliver'];

if (cluster.isWorker) {
  var count = +process.env.tupple_count;
  var tuples = [];

  // Process tuple here, then return it.

  process.send(JSON.stringify(tuples));
  return;
}

cluster.fork({ tupple_count: 2 }).on('message', function(msg) {
  // Receive tuple here:
  var tuple = JSON.parse(msg);
  console.log(tuple);
});

// Go about my life.

然后你可以写一个通用算法来返回这些。 以下是关于如何执行此操作的良好链接: 从n返回k个元素的所有组合的算法


I think you might be confusing "asynchronous" here. The process of creating the tuples will always block. So possibly what you'll want to do is create an algorithm that only generates a tuple when it's required, based on some parameters, then cache it for later.

Since you've tagged this as node.js I'm going to assume that's the programming language of interest. Based on that assumption, and the assumption that you actually don't want this to be blocking, your best bet is to spawn multiple processes and pipe out the process creating these tuples. Here's a very rough example script (emphasis on rough):

var cluster = require('cluster');
var names = ['Jon', 'Stewart', 'Oliver'];

if (cluster.isWorker) {
  var count = +process.env.tupple_count;
  var tuples = [];

  // Process tuple here, then return it.

  process.send(JSON.stringify(tuples));
  return;
}

cluster.fork({ tupple_count: 2 }).on('message', function(msg) {
  // Receive tuple here:
  var tuple = JSON.parse(msg);
  console.log(tuple);
});

// Go about my life.

Then you could write a general algorithm to return these. Here's a good link on how to do this: Algorithm to return all combinations of k elements from n

相关问答

更多

相关文章

更多

最新问答

更多
  • 您如何使用git diff文件,并将其应用于同一存储库的副本的本地分支?(How do you take a git diff file, and apply it to a local branch that is a copy of the same repository?)
  • 将长浮点值剪切为2个小数点并复制到字符数组(Cut Long Float Value to 2 decimal points and copy to Character Array)
  • OctoberCMS侧边栏不呈现(OctoberCMS Sidebar not rendering)
  • 页面加载后对象是否有资格进行垃圾回收?(Are objects eligible for garbage collection after the page loads?)
  • codeigniter中的语言不能按预期工作(language in codeigniter doesn' t work as expected)
  • 在计算机拍照在哪里进入
  • 使用cin.get()从c ++中的输入流中丢弃不需要的字符(Using cin.get() to discard unwanted characters from the input stream in c++)
  • No for循环将在for循环中运行。(No for loop will run inside for loop. Testing for primes)
  • 单页应用程序:页面重新加载(Single Page Application: page reload)
  • 在循环中选择具有相似模式的列名称(Selecting Column Name With Similar Pattern in a Loop)
  • System.StackOverflow错误(System.StackOverflow error)
  • KnockoutJS未在嵌套模板上应用beforeRemove和afterAdd(KnockoutJS not applying beforeRemove and afterAdd on nested templates)
  • 散列包括方法和/或嵌套属性(Hash include methods and/or nested attributes)
  • android - 如何避免使用Samsung RFS文件系统延迟/冻结?(android - how to avoid lag/freezes with Samsung RFS filesystem?)
  • TensorFlow:基于索引列表创建新张量(TensorFlow: Create a new tensor based on list of indices)
  • 企业安全培训的各项内容
  • 错误:RPC失败;(error: RPC failed; curl transfer closed with outstanding read data remaining)
  • C#类名中允许哪些字符?(What characters are allowed in C# class name?)
  • NumPy:将int64值存储在np.array中并使用dtype float64并将其转换回整数是否安全?(NumPy: Is it safe to store an int64 value in an np.array with dtype float64 and later convert it back to integer?)
  • 注销后如何隐藏导航portlet?(How to hide navigation portlet after logout?)
  • 将多个行和可变行移动到列(moving multiple and variable rows to columns)
  • 提交表单时忽略基础href,而不使用Javascript(ignore base href when submitting form, without using Javascript)
  • 对setOnInfoWindowClickListener的意图(Intent on setOnInfoWindowClickListener)
  • Angular $资源不会改变方法(Angular $resource doesn't change method)
  • 在Angular 5中不是一个函数(is not a function in Angular 5)
  • 如何配置Composite C1以将.m和桌面作为同一站点提供服务(How to configure Composite C1 to serve .m and desktop as the same site)
  • 不适用:悬停在悬停时:在元素之前[复制](Don't apply :hover when hovering on :before element [duplicate])
  • 常见的python rpc和cli接口(Common python rpc and cli interface)
  • Mysql DB单个字段匹配多个其他字段(Mysql DB single field matching to multiple other fields)
  • 产品页面上的Magento Up出售对齐问题(Magento Up sell alignment issue on the products page)