如何在Spark中使用DenseVector作为键将groupByKey组为RDD?(How to groupByKey a RDD, with DenseVector as key, in Spark?)
我创建了一个RDD,每个成员都是一个键值对,键是
DenseVector
,值是int
。 例如[(DenseVector([3,4]),10), (DenseVector([3,4]),20)]
现在我想按键
k1
分组:DenseVector([3,4])
。 我希望行为将键k1
所有值分组为10
和20
。 但我得到的结果是[(DenseVector([3,4]), 10), (DenseVector([3,4]), 20)]
代替
[(DenseVector([3,4]), [10,20])]
如果我错过了什么,请告诉我。
相同的代码是:
#simplified version of code #rdd1 is an rdd containing [(DenseVector([3,4]),10), (DenseVector([3,4]),20)] rdd1.groupByKey().map(lambda x : (x[0], list(x[1]))) print(rdd1.collect())
I have created an RDD with each member being a key value pair with the key being a
DenseVector
and value being anint
. e.g.[(DenseVector([3,4]),10), (DenseVector([3,4]),20)]
Now I want to group by the key
k1
:DenseVector([3,4])
. I expect the behaviour to be grouping all the values of the keyk1
which are10
and20
. But the result I get is[(DenseVector([3,4]), 10), (DenseVector([3,4]), 20)]
instead of
[(DenseVector([3,4]), [10,20])]
Please let me know if I am missing something.
The code for the same is :
#simplified version of code #rdd1 is an rdd containing [(DenseVector([3,4]),10), (DenseVector([3,4]),20)] rdd1.groupByKey().map(lambda x : (x[0], list(x[1]))) print(rdd1.collect())
原文:https://stackoverflow.com/questions/31449412
最满意答案
我想你可能会在这里混淆“异步”。 创建元组的过程将始终阻止。 因此,您可能想要做的是创建一个算法,该算法仅在需要时根据某些参数生成元组,然后将其缓存以供日后使用。
既然你已将其标记为node.js,我将假设这是感兴趣的编程语言。 基于这个假设,以及您实际上不希望阻塞的假设,您最好的选择是生成多个进程并管理创建这些元组的过程。 这是一个非常粗略的示例脚本(强调粗略 ):
var cluster = require('cluster'); var names = ['Jon', 'Stewart', 'Oliver']; if (cluster.isWorker) { var count = +process.env.tupple_count; var tuples = []; // Process tuple here, then return it. process.send(JSON.stringify(tuples)); return; } cluster.fork({ tupple_count: 2 }).on('message', function(msg) { // Receive tuple here: var tuple = JSON.parse(msg); console.log(tuple); }); // Go about my life.
然后你可以写一个通用算法来返回这些。 以下是关于如何执行此操作的良好链接: 从n返回k个元素的所有组合的算法
I think you might be confusing "asynchronous" here. The process of creating the tuples will always block. So possibly what you'll want to do is create an algorithm that only generates a tuple when it's required, based on some parameters, then cache it for later.
Since you've tagged this as node.js I'm going to assume that's the programming language of interest. Based on that assumption, and the assumption that you actually don't want this to be blocking, your best bet is to spawn multiple processes and pipe out the process creating these tuples. Here's a very rough example script (emphasis on rough):
var cluster = require('cluster'); var names = ['Jon', 'Stewart', 'Oliver']; if (cluster.isWorker) { var count = +process.env.tupple_count; var tuples = []; // Process tuple here, then return it. process.send(JSON.stringify(tuples)); return; } cluster.fork({ tupple_count: 2 }).on('message', function(msg) { // Receive tuple here: var tuple = JSON.parse(msg); console.log(tuple); }); // Go about my life.
Then you could write a general algorithm to return these. Here's a good link on how to do this: Algorithm to return all combinations of k elements from n
相关问答
更多-
这种“天真”的方式怎么样? import numpy as np result = np.array([x for x in datarelmax_0 if x in datarelmax_1]) 很简单。 也许通过使用一些numpy方法有更好/更快/更好的方式,但这应该适用于现在。 编辑:要回答您编辑的问题,您可以这样做: result = [x for x in zip(datarelmax_0[0], datarelmax_0[1]) if x in zip(datarelmax_1[0], da ...
-
从列表中递归创建元组(recursively create tuples from lists)[2022-04-02]
你有一个列表列表(lol),然后,从列表列表中弹出第一个项目,并生成具有连接剩余列表的笛卡尔积: import itertools lol = [[1,2,3],[4,5,6],[7,8,9]] result = list() while lol: l=lol.pop(0) o=itertools.chain(*lol) result += itertools.product( l,o ) 结果[(1,4),(1,5),(1,6),(1,7),(1,8),(1,9),(2,4),( ... -
你必须这样做丑陋的方式。 如果你真的想要这样的东西,你可以查看一下CoffeeScript ,它具有这样的功能,还有很多其他的功能,使它看起来更像是python(抱歉让它听起来像一个广告,但我真的很喜欢它)。 You have to do it the ugly way. If you really want something like this, you can check out CoffeeScript, which has that and a whole lot of other feature ...
-
免责声明:在纯JavaScript中计算带小数的数字是一个糟糕的,非常糟糕的主意! developer.mozilla.org上的浮点文字 您可以使用Math.js或任何其他已经正确操作小数的数学库。 你想要的东西 - 被称为“ 有限状态机 ” 您的代码可能如下所示: var rates = { // here where your constants should be defined 'AIC': 0, 'BIC': 0, 'CIC': 0, }; var actions = ...
-
在python中创建元组的等级(Create Rank of tuples in python)[2022-10-08]
import pprint LL= [ ('A123', 'A120', '2011-03'), ('A133', 'A123', '2011-03'), ('D123', 'D120', '2011-04'), ('D140', 'D123', '2011-04'),] LL = [row+(i,) for i,row in enumerate(LL,1)] pprint.pprint(LL) 产量 [('A123', 'A120', '2011-03', 1), ... -
这可能是一个编译器错误。 如何在我如何创建一个元组数组? ,您可以将类型别名定义为一种解决方法: typealias NameTuple = (firstName: String, middleName: String?) var tupleArray: [NameTuple] = [] tupleArray.append( (firstName: "Bob", middleName: nil) ) tupleArray.append( (firstName: "Tom", middleName: "Sm ...
-
您可以使用Object.freeze冻结数组: "use strict"; var a = Object.freeze([0, 1, 2]); console.log(a); try { a.push(3); // Error } catch (e) { console.error(e); } try { a[0] = "zero"; // Error } catch (e) { console.error(e); } console.log(a); 这是不 ...
-
我想你可能会在这里混淆“异步”。 创建元组的过程将始终阻止。 因此,您可能想要做的是创建一个算法,该算法仅在需要时根据某些参数生成元组,然后将其缓存以供日后使用。 既然你已将其标记为node.js,我将假设这是感兴趣的编程语言。 基于这个假设,以及您实际上不希望阻塞的假设,您最好的选择是生成多个进程并管理创建这些元组的过程。 这是一个非常粗略的示例脚本(强调粗略 ): var cluster = require('cluster'); var names = ['Jon', 'Stewart', 'Oliv ...
-
从2d数组创建元组列表(Create list of tuples from 2d array)[2023-09-07]
标准(抱歉,没有创意 - 但相当快)numpy方式将是间接的排序: import numpy as np data = np.array([[ 0., 1., 2., 3., 4., 5., 6.], [ 1., 2., 1., 2., 2., 1., 1.]]) index = np.argsort(data[1], kind='mergesort') # mergesort is a bit ... -
Javascript检查元组数组中是否存在元组的第一个元素(Javascript check if first element of tuple exists in array of tuples)[2023-05-15]
如果您使用的是相对现代的浏览器 ,则可以执行以下操作: array.some(function(a){return a[0]==='string that you want'}) 请参阅Array.some If you are using a relatively modern browser you can just do this: array.some(function(a){return a[0]==='string that you want'}) see Array.some