首页 \ 问答 \ 如何在Spark中使用DenseVector作为键将groupByKey组为RDD？(How to groupByKey a RDD, with DenseVector as key, in Spark?)

如何在Spark中使用DenseVector作为键将groupByKey组为RDD？(How to groupByKey a RDD, with DenseVector as key, in Spark?)

 我创建了一个RDD，每个成员都是一个键值对，键是DenseVector ，值是int 。 例如  
[(DenseVector([3,4]),10),  (DenseVector([3,4]),20)]
 
 现在我想按键k1分组： DenseVector([3,4]) 。 我希望行为将键k1所有值分组为10和20 。 但我得到的结果是  
[(DenseVector([3,4]), 10), (DenseVector([3,4]), 20)] 
 
 代替  
[(DenseVector([3,4]), [10,20])]
 
 如果我错过了什么，请告诉我。  
 相同的代码是：  
#simplified version of code
#rdd1 is an rdd containing [(DenseVector([3,4]),10),  (DenseVector([3,4]),20)]
rdd1.groupByKey().map(lambda x : (x[0], list(x[1])))
print(rdd1.collect())

I have created an RDD with each member being a key value pair with the key being a DenseVector and value being an int. e.g.  
[(DenseVector([3,4]),10),  (DenseVector([3,4]),20)]
 
Now I want to group by the key k1: DenseVector([3,4]). I expect the behaviour to be grouping all the values of the key k1 which are 10 and 20. But the result I get is  
[(DenseVector([3,4]), 10), (DenseVector([3,4]), 20)] 
 
instead of  
[(DenseVector([3,4]), [10,20])]
 
Please let me know if I am missing something.  
The code for the same is : 
#simplified version of code
#rdd1 is an rdd containing [(DenseVector([3,4]),10),  (DenseVector([3,4]),20)]
rdd1.groupByKey().map(lambda x : (x[0], list(x[1])))
print(rdd1.collect())

原文：https://stackoverflow.com/questions/31449412

更新时间：2023-03-19 11:03

最满意答案

 我想你可能会在这里混淆“异步”。 创建元组的过程将始终阻止。 因此，您可能想要做的是创建一个算法，该算法仅在需要时根据某些参数生成元组，然后将其缓存以供日后使用。  
 既然你已将其标记为node.js，我将假设这是感兴趣的编程语言。 基于这个假设，以及您实际上不希望阻塞的假设，您最好的选择是生成多个进程并管理创建这些元组的过程。 这是一个非常粗略的示例脚本（强调粗略 ）：  
var cluster = require('cluster');
var names = ['Jon', 'Stewart', 'Oliver'];

if (cluster.isWorker) {
  var count = +process.env.tupple_count;
  var tuples = [];

  // Process tuple here, then return it.

  process.send(JSON.stringify(tuples));
  return;
}

cluster.fork({ tupple_count: 2 }).on('message', function(msg) {
  // Receive tuple here:
  var tuple = JSON.parse(msg);
  console.log(tuple);
});

// Go about my life.
 
 然后你可以写一个通用算法来返回这些。 以下是关于如何执行此操作的良好链接： 从n返回k个元素的所有组合的算法 

I think you might be confusing "asynchronous" here. The process of creating the tuples will always block. So possibly what you'll want to do is create an algorithm that only generates a tuple when it's required, based on some parameters, then cache it for later. 
Since you've tagged this as node.js I'm going to assume that's the programming language of interest. Based on that assumption, and the assumption that you actually don't want this to be blocking, your best bet is to spawn multiple processes and pipe out the process creating these tuples. Here's a very rough example script (emphasis on rough): 
var cluster = require('cluster');
var names = ['Jon', 'Stewart', 'Oliver'];

if (cluster.isWorker) {
  var count = +process.env.tupple_count;
  var tuples = [];

  // Process tuple here, then return it.

  process.send(JSON.stringify(tuples));
  return;
}

cluster.fork({ tupple_count: 2 }).on('message', function(msg) {
  // Receive tuple here:
  var tuple = JSON.parse(msg);
  console.log(tuple);
});

// Go about my life.
 
Then you could write a general algorithm to return these. Here's a good link on how to do this: Algorithm to return all combinations of k elements from n

如何在Spark中使用DenseVector作为键将groupByKey组为RDD？(How to groupByKey a RDD, with DenseVector as key, in Spark?)

最满意答案

相关问答

如何从两个元组列表创建一个numpy数组，但仅在元组相同时才创建(How to create a numpy array from two lists of tuples, but only when the tuples are the same)[2024-01-25]

从列表中递归创建元组(recursively create tuples from lists)[2022-04-02]

来自元组的JavaScript变量赋值(JavaScript Variable Assignments from Tuples)[2023-06-13]

在javascript中创建数组而不是if ... else(create array instead of if…else in javascript)[2022-05-21]

在python中创建元组的等级(Create Rank of tuples in python)[2022-10-08]

如何在Swift中创建一个元组数组，其中元组中的某个元素是可选的？(How do I create an array of tuples in Swift where one of the items in the tuple is optional?)[2022-04-18]

如何在javascript中创建静态数组(How to create static array in javascript)[2022-06-21]

如何在JavaScript中从数组创建元组？(How to create tuples from an array in JavaScript?)[2023-03-25]

从2d数组创建元组列表(Create list of tuples from 2d array)[2023-09-07]

Javascript检查元组数组中是否存在元组的第一个元素(Javascript check if first element of tuple exists in array of tuples)[2023-05-15]

相关文章

最新问答