首页 \ 问答 \ 如何在mongo聚合框架中的流水线阶段之后加入文档(How to join documents after a pipeline stage in mongo aggregation framwork)

如何在mongo聚合框架中的流水线阶段之后加入文档(How to join documents after a pipeline stage in mongo aggregation framwork)

因此,让我们说在聚合的第一阶段之后,我已经将所有文档按中心分组,所以我有这样的内容:

{
center:"A",
gender:"Male",
count:50
}
{
center:"A",
gender:"Female",
count:20
}

我想加入这两个文件,使最终的文件看起来像

{
center:A,
Male:50,
Female:20
}

So lets say after the first stage of aggregation I have grouped all the documents by the center so i have something like this:

{
center:"A",
gender:"Male",
count:50
}
{
center:"A",
gender:"Female",
count:20
}

I want to join these two documents such that the final document looks something like

{
center:A,
Male:50,
Female:20
}

原文:https://stackoverflow.com/questions/34627212
更新时间:2023-07-28 06:07

最满意答案

您可以使用带有skipinitialspace=Truecsv.reader跳过空格,然后压缩行以获取列,我们使用itertools.izip_longest因为缺少最后一列中的值。 转换set中的列并使用set.intersection获取交集:

from itertools import izip_longest
import csv

with open('test') as f:
    reader = csv.reader(f, delimiter=' ', skipinitialspace=True)
    cols = map(set, izip_longest(*reader))

print set.intersection(*cols)

注意你的文件不正确是一个csv,如果你在一个不是最后一个列的列中缺少值,这将不正确地解释你的输入。 考虑至少使用不是空格的分隔符。

使用StringIO解析字符串并显示它适用于测试用例:

from itertools import izip_longest
import csv
import StringIO

data='''table1    table2    table3  table4   table5
paper     paper     pen     book     book
pen       pencil    pencil  charger  apple
apple     pen       charger beatroot sandle
beatroot  mobile    apple   pen      paper
sandle    book      paper   paper'''

f = StringIO.StringIO(data)
reader = csv.reader(f, delimiter=' ', skipinitialspace=True)
cols = map(set, izip_longest(*reader))

print set.intersection(*cols)

产量

set(['paper'])

You can use the csv.reader with skipinitialspace=True to skip the spaces, then zip the rows to get the columns, we use itertools.izip_longest because a value in the last column is missing. Convert the columns in set and take the intersection using set.intersection:

from itertools import izip_longest
import csv

with open('test') as f:
    reader = csv.reader(f, delimiter=' ', skipinitialspace=True)
    cols = map(set, izip_longest(*reader))

print set.intersection(*cols)

Watch out that your file is not properly a csv, and if you have missing values in a column that is not the last one this will interpret your input not properly. Consider at least using a delimiter that is not space.

Example

Using StringIO to parse a string and show that it works for the test case:

from itertools import izip_longest
import csv
import StringIO

data='''table1    table2    table3  table4   table5
paper     paper     pen     book     book
pen       pencil    pencil  charger  apple
apple     pen       charger beatroot sandle
beatroot  mobile    apple   pen      paper
sandle    book      paper   paper'''

f = StringIO.StringIO(data)
reader = csv.reader(f, delimiter=' ', skipinitialspace=True)
cols = map(set, izip_longest(*reader))

print set.intersection(*cols)

Output

set(['paper'])

相关问答

更多
  • 你需要在循环中调用.writerow() : for item in r: screen_name = item['user']['screen_name'].encode('utf-8') created_at = item['created_at'].encode('utf-8') tweet = item['text'].encode('utf-8') writer.writerow([screen_name, created_at, tweet]) 或者,收集列表列 ...
  • 字典不是有序的,如果要强制执行列排序,则需要明确指定 import csv headers = ['Party', 'Period', 'Date', 'ExTime', 'Name'] # Don't use my_dict.keys() with open('header.csv', 'w') as f: w = csv.DictWriter(f, fieldnames=headers) w.writeheader() 看到 $ python sample.py && cat head ...
  • 您从此代码获取最后一列的唯一方法是,如果您的print语句不包含在 for循环中。 这很可能是你的代码的结尾: for row in reader: content = list(row[i] for i in included_cols) print content 你想要这样做: for row in reader: content = list(row[i] for i in included_cols) print content 现在我们已经涵盖了你的错 ...
  • 我无法解释为什么你会看到你所看到的内容,我会认为你所看到的实际结果是,你的数据库中只有一行与CSV中的每一行一遍又一遍地更新。 我也非常惊讶,因为你puts $INPUT_LINE_NUMBER放在你的循环中你还没有看到我期望的东西(一遍又一遍地打印一个数字)。 这是因为在Rails中每个each没有自动设置$INPUT_LINE_NUMBER ,它甚至不是由File.read设置的,所以在你的代码中它将是在最后一个IO循环结束时发生的任何事情。 最简单的方法是使用循环索引作为您的id,而不是尝试使用行号, ...
  • 试试这个,看看手册中的fgetcsv()和fputcsv()
  • 由于熊猫不能使用,我会使用numpy如下: # first get all the columns of each csv file as lists csv1_cols = ['ColumnA','ColumnB','ColumnF','ColumnC'] csv2_cols = ['ColumnD','ColumnA','ColumnC','ColumnB','ColumnH'] csv3_cols = ['ColumnH','ColumnJ','ColumnA','ColumnB','ColumnC' ...
  • 您需要将column1.extend(row[0])更改为column1.append(row[0]) (对于column2,显然也是如此)。 Extend用于将一个列表的内容添加到另一个列表,append用于添加单个元素。 Extend告诉python将字符串视为其字符列表并附加每个字符。 >>> lst = [] >>> lst.extend("foo") >>> lst ['f', 'o', 'o'] >>> lst.append("foo") >>> lst ['f', 'o', 'o', 'foo ...
  • 如果要将所有值存储在数组或散列数组中,则可以使用Enumerable#sort 。 sort_index = 1 # or Hash Key "Value1" values.sort { |a, b| a[sort_index] <=> b[sort_index] } 注意:不使用爆炸! 这将返回排序列表。 如果你想让它改变它,请使用#sort! 。 I found SmarterCSV, so I was able to easily do: array_of_hashes = SmarterCSV.p ...
  • 您可以使用带有skipinitialspace=True的csv.reader跳过空格,然后压缩行以获取列,我们使用itertools.izip_longest因为缺少最后一列中的值。 转换set中的列并使用set.intersection获取交集: from itertools import izip_longest import csv with open('test') as f: reader = csv.reader(f, delimiter=' ', skipinitialspace= ...
  • 您应该使用pd.concat(..., axis=1)参数来水平连接DF: import os import glob import pandas as pd In [46]: files = glob.glob(r'D:\temp\.data\42011160\*.csv') In [47]: pd.concat([pd.read_csv(f, usecols=['hour', 'energy'], index_col='hour') ...: .rename(col ...

相关文章

更多

最新问答

更多
  • 您如何使用git diff文件,并将其应用于同一存储库的副本的本地分支?(How do you take a git diff file, and apply it to a local branch that is a copy of the same repository?)
  • 将长浮点值剪切为2个小数点并复制到字符数组(Cut Long Float Value to 2 decimal points and copy to Character Array)
  • OctoberCMS侧边栏不呈现(OctoberCMS Sidebar not rendering)
  • 页面加载后对象是否有资格进行垃圾回收?(Are objects eligible for garbage collection after the page loads?)
  • codeigniter中的语言不能按预期工作(language in codeigniter doesn' t work as expected)
  • 在计算机拍照在哪里进入
  • 使用cin.get()从c ++中的输入流中丢弃不需要的字符(Using cin.get() to discard unwanted characters from the input stream in c++)
  • No for循环将在for循环中运行。(No for loop will run inside for loop. Testing for primes)
  • 单页应用程序:页面重新加载(Single Page Application: page reload)
  • 在循环中选择具有相似模式的列名称(Selecting Column Name With Similar Pattern in a Loop)
  • System.StackOverflow错误(System.StackOverflow error)
  • KnockoutJS未在嵌套模板上应用beforeRemove和afterAdd(KnockoutJS not applying beforeRemove and afterAdd on nested templates)
  • 散列包括方法和/或嵌套属性(Hash include methods and/or nested attributes)
  • android - 如何避免使用Samsung RFS文件系统延迟/冻结?(android - how to avoid lag/freezes with Samsung RFS filesystem?)
  • TensorFlow:基于索引列表创建新张量(TensorFlow: Create a new tensor based on list of indices)
  • 企业安全培训的各项内容
  • 错误:RPC失败;(error: RPC failed; curl transfer closed with outstanding read data remaining)
  • C#类名中允许哪些字符?(What characters are allowed in C# class name?)
  • NumPy:将int64值存储在np.array中并使用dtype float64并将其转换回整数是否安全?(NumPy: Is it safe to store an int64 value in an np.array with dtype float64 and later convert it back to integer?)
  • 注销后如何隐藏导航portlet?(How to hide navigation portlet after logout?)
  • 将多个行和可变行移动到列(moving multiple and variable rows to columns)
  • 提交表单时忽略基础href,而不使用Javascript(ignore base href when submitting form, without using Javascript)
  • 对setOnInfoWindowClickListener的意图(Intent on setOnInfoWindowClickListener)
  • Angular $资源不会改变方法(Angular $resource doesn't change method)
  • 在Angular 5中不是一个函数(is not a function in Angular 5)
  • 如何配置Composite C1以将.m和桌面作为同一站点提供服务(How to configure Composite C1 to serve .m and desktop as the same site)
  • 不适用:悬停在悬停时:在元素之前[复制](Don't apply :hover when hovering on :before element [duplicate])
  • 常见的python rpc和cli接口(Common python rpc and cli interface)
  • Mysql DB单个字段匹配多个其他字段(Mysql DB single field matching to multiple other fields)
  • 产品页面上的Magento Up出售对齐问题(Magento Up sell alignment issue on the products page)