如何解决MongoDB中的“Digg”问题(how to Solve the “Digg” problem in MongoDB)
不久前,一位Digg开发者发布了这个博客“ http://about.digg.com/blog/looking-future-cassandra ”,他描述了MySQL中没有最佳解决的问题之一。 这被认为是他们搬到卡桑德拉的原因之一。
我一直在玩MongoDB,我想了解如何
为此问题实现MongoDB集合
从文章中,MySQL中此信息的架构:
CREATE TABLE `Diggs` ( `id` INT(11), `itemid` INT(11), `userid` INT(11), `digdate` DATETIME, PRIMARY KEY (`id`), KEY `user` (`userid`), KEY `item` (`itemid`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; CREATE TABLE `Friends` ( `id` INT(10) AUTO_INCREMENT, `userid` INT(10), `username` VARCHAR(15), `friendid` INT(10), `friendname` VARCHAR(15), `mutual` TINYINT(1), `date_created` DATETIME, PRIMARY KEY (`id`), UNIQUE KEY `Friend_unique` (`userid`,`friendid`), KEY `Friend_friend` (`friendid`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
这个问题在社交网络场景实现中无处不在。 人们与很多人交往,他们反过来挖掘很多东西。 快速向用户展示他/她的朋友所做的事情是非常关键的。
据我所知,自那时以来,有几个博客提供了一个纯RDBM解决方案,并为此问题提供了索引。 但我很好奇如何在MongoDB中解决这个问题。
A while back,a Digg developer had posted this blog ,"http://about.digg.com/blog/looking-future-cassandra", where the he described one of the issues that were not optimally solved in MySQL. This was cited as one of the reasons for their move to Cassandra.
I have been playing with MongoDB and I would like to understand how to
implement the MongoDB collections for this problem
From the article, the schema for this information in MySQL :
CREATE TABLE `Diggs` ( `id` INT(11), `itemid` INT(11), `userid` INT(11), `digdate` DATETIME, PRIMARY KEY (`id`), KEY `user` (`userid`), KEY `item` (`itemid`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8; CREATE TABLE `Friends` ( `id` INT(10) AUTO_INCREMENT, `userid` INT(10), `username` VARCHAR(15), `friendid` INT(10), `friendname` VARCHAR(15), `mutual` TINYINT(1), `date_created` DATETIME, PRIMARY KEY (`id`), UNIQUE KEY `Friend_unique` (`userid`,`friendid`), KEY `Friend_friend` (`friendid`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
This problem is ubiquitous in social networking scenario implementation. People befriend a lot of people and they in turn digg a lot of things. Quickly showing a user what his/her friends are up to is very critical.
I understand that several blogs have since then provided a pure RDBMs solution with indexes for this issue; however I am curious as to how this could be solved in MongoDB.
原文:https://stackoverflow.com/questions/2823165
最满意答案
itertools的文档说“它大致相当于生成器表达式中的嵌套for循环”。 因此,itertools.product永远不会成为内存的敌人,但如果将结果存储在列表中,那么该列表就是。 因此:
for element in itertools.product(...): print element
没关系,但是
myList = [element for itertools.product(...)]
或等效的循环
for element in itertools.product(...): myList.append(element)
不是! 所以你希望itertools为你生成结果,但是你不想存储它们,而是在它们生成时使用它们。 想想这行代码:
c = [str(x)+y for x,y in itertools.product(nums,chars)]
鉴于nums和chars可以是巨大的列表,在它们之上构建另一个巨大的所有组合列表肯定会扼杀你的系统。
现在,正如评论中所提到的,如果你用生成器替换掉所有太胖而无法容纳到内存中的列表 (刚刚产生的函数),内存就不再是一个问题了。
这是我的完整代码。 我基本上将你的字符和nums列表更改为生成器,并删除了c的最终列表。
import itertools def char_range(c1, c2): """Generates the characters from `c1` to `c2`""" for c in range(ord(c1), ord(c2)+1): yield chr(c) def char(a): for combination in itertools.product(char_range(str(a[0]),str(a[1])),repeat=4): yield ''.join(map(str, combination)) def num(n): for combination in itertools.product(range(n),repeat=4): yield ''.join(map(str, combination)) def final(one,two): for foo in char(one): for bar in num(two): print str(bar)+str(foo)
现在让我们问一下['a','b']和范围(2)的每个组合是什么:
final(['a','b'],2)
产生这个:
0000aaaa 0001aaaa 0010aaaa 0011aaaa 0100aaaa 0101aaaa 0110aaaa 0111aaaa 1000aaaa 1001aaaa 1010aaaa 1011aaaa 1100aaaa 1101aaaa 1110aaaa 1111aaaa 0000aaab 0001aaab 0010aaab 0011aaab 0100aaab 0101aaab 0110aaab 0111aaab 1000aaab 1001aaab 1010aaab 1011aaab 1100aaab 1101aaab 1110aaab 1111aaab 0000aaba 0001aaba 0010aaba 0011aaba 0100aaba 0101aaba 0110aaba 0111aaba 1000aaba 1001aaba 1010aaba 1011aaba 1100aaba 1101aaba 1110aaba 1111aaba 0000aabb 0001aabb 0010aabb 0011aabb 0100aabb 0101aabb 0110aabb 0111aabb 1000aabb 1001aabb 1010aabb 1011aabb 1100aabb 1101aabb 1110aabb 1111aabb 0000abaa 0001abaa 0010abaa 0011abaa 0100abaa 0101abaa 0110abaa 0111abaa 1000abaa 1001abaa 1010abaa 1011abaa 1100abaa 1101abaa 1110abaa 1111abaa 0000abab 0001abab 0010abab 0011abab 0100abab 0101abab 0110abab 0111abab 1000abab 1001abab 1010abab 1011abab 1100abab 1101abab 1110abab 1111abab 0000abba 0001abba 0010abba 0011abba 0100abba 0101abba 0110abba 0111abba 1000abba 1001abba 1010abba 1011abba 1100abba 1101abba 1110abba 1111abba 0000abbb 0001abbb 0010abbb 0011abbb 0100abbb 0101abbb 0110abbb 0111abbb 1000abbb 1001abbb 1010abbb 1011abbb 1100abbb 1101abbb 1110abbb 1111abbb 0000baaa 0001baaa 0010baaa 0011baaa 0100baaa 0101baaa 0110baaa 0111baaa 1000baaa 1001baaa 1010baaa 1011baaa 1100baaa 1101baaa 1110baaa 1111baaa 0000baab 0001baab 0010baab 0011baab 0100baab 0101baab 0110baab 0111baab 1000baab 1001baab 1010baab 1011baab 1100baab 1101baab 1110baab 1111baab 0000baba 0001baba 0010baba 0011baba 0100baba 0101baba 0110baba 0111baba 1000baba 1001baba 1010baba 1011baba 1100baba 1101baba 1110baba 1111baba 0000babb 0001babb 0010babb 0011babb 0100babb 0101babb 0110babb 0111babb 1000babb 1001babb 1010babb 1011babb 1100babb 1101babb 1110babb 1111babb 0000bbaa 0001bbaa 0010bbaa 0011bbaa 0100bbaa 0101bbaa 0110bbaa 0111bbaa 1000bbaa 1001bbaa 1010bbaa 1011bbaa 1100bbaa 1101bbaa 1110bbaa 1111bbaa 0000bbab 0001bbab 0010bbab 0011bbab 0100bbab 0101bbab 0110bbab 0111bbab 1000bbab 1001bbab 1010bbab 1011bbab 1100bbab 1101bbab 1110bbab 1111bbab 0000bbba 0001bbba 0010bbba 0011bbba 0100bbba 0101bbba 0110bbba 0111bbba 1000bbba 1001bbba 1010bbba 1011bbba 1100bbba 1101bbba 1110bbba 1111bbba 0000bbbb 0001bbbb 0010bbbb 0011bbbb 0100bbbb 0101bbbb 0110bbbb 0111bbbb 1000bbbb 1001bbbb 1010bbbb 1011bbbb 1100bbbb 1101bbbb 1110bbbb 1111bbbb
这是您正在寻找的确切结果。 此结果的每个元素都是动态生成的,因此永远不会产生内存问题。 您现在可以尝试看到更大的操作,例如final(['a','z'],10)对CPU友好。
The documentation of itertools says that "it is roughly equivalent to nested for-loops in a generator expression". So itertools.product is never an enemy of memory, but if you store its results in a list, that list is. Therefore:
for element in itertools.product(...): print element
is okay, but
myList = [element for itertools.product(...)]
or the equivalent loop of
for element in itertools.product(...): myList.append(element)
is not! So you want itertools to generate results for you, but you don't want to store them, rather use them as they are generated. Think about this line of your code:
c = [str(x)+y for x,y in itertools.product(nums,chars)]
Given that nums and chars can be huge lists, building another gigantic list of all combinations on top of them is definitely going to choke your system.
Now, as mentioned in the comments, if you replace all the lists that are too fat to fit into the memory with generators (functions that just yield), memory is not going to be a concern anymore.
Here is my full code. I basically changed your lists of chars and nums to generators, and got rid of the final list of c.
import itertools def char_range(c1, c2): """Generates the characters from `c1` to `c2`""" for c in range(ord(c1), ord(c2)+1): yield chr(c) def char(a): for combination in itertools.product(char_range(str(a[0]),str(a[1])),repeat=4): yield ''.join(map(str, combination)) def num(n): for combination in itertools.product(range(n),repeat=4): yield ''.join(map(str, combination)) def final(one,two): for foo in char(one): for bar in num(two): print str(bar)+str(foo)
Now let's ask what every combination of ['a','b'] and range(2) is:
final(['a','b'],2)
Produces this:
0000aaaa 0001aaaa 0010aaaa 0011aaaa 0100aaaa 0101aaaa 0110aaaa 0111aaaa 1000aaaa 1001aaaa 1010aaaa 1011aaaa 1100aaaa 1101aaaa 1110aaaa 1111aaaa 0000aaab 0001aaab 0010aaab 0011aaab 0100aaab 0101aaab 0110aaab 0111aaab 1000aaab 1001aaab 1010aaab 1011aaab 1100aaab 1101aaab 1110aaab 1111aaab 0000aaba 0001aaba 0010aaba 0011aaba 0100aaba 0101aaba 0110aaba 0111aaba 1000aaba 1001aaba 1010aaba 1011aaba 1100aaba 1101aaba 1110aaba 1111aaba 0000aabb 0001aabb 0010aabb 0011aabb 0100aabb 0101aabb 0110aabb 0111aabb 1000aabb 1001aabb 1010aabb 1011aabb 1100aabb 1101aabb 1110aabb 1111aabb 0000abaa 0001abaa 0010abaa 0011abaa 0100abaa 0101abaa 0110abaa 0111abaa 1000abaa 1001abaa 1010abaa 1011abaa 1100abaa 1101abaa 1110abaa 1111abaa 0000abab 0001abab 0010abab 0011abab 0100abab 0101abab 0110abab 0111abab 1000abab 1001abab 1010abab 1011abab 1100abab 1101abab 1110abab 1111abab 0000abba 0001abba 0010abba 0011abba 0100abba 0101abba 0110abba 0111abba 1000abba 1001abba 1010abba 1011abba 1100abba 1101abba 1110abba 1111abba 0000abbb 0001abbb 0010abbb 0011abbb 0100abbb 0101abbb 0110abbb 0111abbb 1000abbb 1001abbb 1010abbb 1011abbb 1100abbb 1101abbb 1110abbb 1111abbb 0000baaa 0001baaa 0010baaa 0011baaa 0100baaa 0101baaa 0110baaa 0111baaa 1000baaa 1001baaa 1010baaa 1011baaa 1100baaa 1101baaa 1110baaa 1111baaa 0000baab 0001baab 0010baab 0011baab 0100baab 0101baab 0110baab 0111baab 1000baab 1001baab 1010baab 1011baab 1100baab 1101baab 1110baab 1111baab 0000baba 0001baba 0010baba 0011baba 0100baba 0101baba 0110baba 0111baba 1000baba 1001baba 1010baba 1011baba 1100baba 1101baba 1110baba 1111baba 0000babb 0001babb 0010babb 0011babb 0100babb 0101babb 0110babb 0111babb 1000babb 1001babb 1010babb 1011babb 1100babb 1101babb 1110babb 1111babb 0000bbaa 0001bbaa 0010bbaa 0011bbaa 0100bbaa 0101bbaa 0110bbaa 0111bbaa 1000bbaa 1001bbaa 1010bbaa 1011bbaa 1100bbaa 1101bbaa 1110bbaa 1111bbaa 0000bbab 0001bbab 0010bbab 0011bbab 0100bbab 0101bbab 0110bbab 0111bbab 1000bbab 1001bbab 1010bbab 1011bbab 1100bbab 1101bbab 1110bbab 1111bbab 0000bbba 0001bbba 0010bbba 0011bbba 0100bbba 0101bbba 0110bbba 0111bbba 1000bbba 1001bbba 1010bbba 1011bbba 1100bbba 1101bbba 1110bbba 1111bbba 0000bbbb 0001bbbb 0010bbbb 0011bbbb 0100bbbb 0101bbbb 0110bbbb 0111bbbb 1000bbbb 1001bbbb 1010bbbb 1011bbbb 1100bbbb 1101bbbb 1110bbbb 1111bbbb
Which is the exact result you are looking for. Each element of this result is generated on the fly, hence never creates a memory problem. You can now try and see that much bigger operations such as final(['a','z'],10) are CPU-friendly.
相关问答
更多-
你可能正在寻找像Resorvoir Sampling这样的东西。 从具有前k元素的初始数组开始,并用降低概率的新元素对其进行修改: java像伪代码一样: E[] r = new E[k]; //not really, cannot create an array of generic type, but just pseudo code int i = 0; for (E e : list) { //assign first k elements: if (i < k) { r[i++] = ...
-
Python:如何有效地继续搜索不同元素的列表?(Python: How do I continue searching a list for different elements efficiently?)[2023-01-24]
打印您请求的字段的短代码: x=["Name", "Location", "House"] y=iter(x) z=y.next() for a in personal_info: if a.startswith(z): print a try: z=y.next() except StopIteration: break 你可以用正则表达式替换“startswith”,用任何其他动作“打印”。 a s ... -
如何在Python中有效地提取列表元素的特定子集(How to efficiently extract specific subsets of list elements in Python)[2023-11-23]
您可以使用itertools.compress ,它会在选择器中生成与true对应的元素。 然而,这将需要复制bits并反转副本以选择零的元素,这将最终得到: from operator import not_ true_values = list(compress(sequence, bits)) false_values = list(compress(sequence, map(not_, bits))) 我相信使用简单的for循环会更容易,更快,因为它只进行一次迭代: true_values = ... -
你可以写一个生成器: idx = [1, 4, 8, 10, 22] def differences(nums): n = len(nums) for i in xrange(n-1): for j in xrange(i+1,n): yield abs(nums[i]-nums[j]) for d in differences(idx): print d 输出: 3 7 9 21 4 6 18 2 14 12 这会产生一个接一个的差异,只有很 ...
-
itertools的文档说“它大致相当于生成器表达式中的嵌套for循环”。 因此,itertools.product永远不会成为内存的敌人,但如果将结果存储在列表中,那么该列表就是。 因此: for element in itertools.product(...): print element 没关系,但是 myList = [element for itertools.product(...)] 或等效的循环 for element in itertools.product(...): ...
-
from collections import Counter dict1 = {1:[ "red","blue","green"], 2: ["blue","blue","red"]} weight = {1: 2, 2: 20} score = 0 for k,v in dict1.iteritems(): score += weight[k] * Counter(v)["red"] * Counter(v)["blue"] 结果: >>> score 42 我的代码的最后部分可以重 ...
-
重新有效地分配一个列表(Re assign a list efficiently)[2021-05-27]
有一个更好的方法来转置你的行和列: b = zip(*a) 演示: >>> a = [[1,2,3], [4,5,6], [7,8,9], [10,11,12]] >>> zip(*a) [(1, 4, 7, 10), (2, 5, 8, 11), (3, 6, 9, 12)] zip()将多个序列作为参数,并将每个元素配对以形成新列表。 通过传入* splat参数,我们要求Python将zip()扩展为单独的参数。 请注意,输出提供了一个元组列表; 根据需要将元素映射回列表: b = map(list ... -
C:有效地循环大阵列(C: Looping the big array efficiently)[2024-04-12]
首先,您应该对其进行分析。 我们谈论的最多只有500 * 100 = 50,000次操作。 一台普通的现代计算机能够在十分之一秒内完成它,除非你编写效率非常低。 假设你想要优化它,你应该对主数组进行排序,并对随机数组的每个元素运行二进制搜索 。 这会将操作次数从50,000减少到最多900,因为500个数字的二进制搜索最多需要9次比较。 这是一个使用标准C库的内置排序和二进制搜索功能( qsort和bsearch )的实现: int less_int(const void* left, const void ... -
是deque s在这里适用,你应该使用它们,如果它们非常靠近前面它会非常快,但如果起始指数位于中间位置则会慢一些。 索引访问在两端都是O(1),但在中间减慢到O(n)。 >>> from collections import deque >>> def delete_slice(d, start, stop): d.rotate(-start) for i in range(stop-start): # use xrange on Python 2 d ...
-
看看bisect模块。 文档建议以下用于在排序列表中查找元素: def index(a, x): 'Locate the leftmost value exactly equal to x' i = bisect_left(a, x) if i != len(a) and a[i] == x: return i raise ValueError Take a look at the bisect module. The docs suggest the fo ...