首页 \ 问答 \ 如何解决MongoDB中的“Digg”问题(how to Solve the “Digg” problem in MongoDB)

如何解决MongoDB中的“Digg”问题(how to Solve the “Digg” problem in MongoDB)

不久前,一位Digg开发者发布了这个博客“ http://about.digg.com/blog/looking-future-cassandra ”,他描述了MySQL中没有最佳解决的问题之一。 这被认为是他们搬到卡桑德拉的原因之一。

我一直在玩MongoDB,我想了解如何

为此问题实现MongoDB集合

从文章中,MySQL中此信息的架构:

CREATE TABLE `Diggs` (
  `id`      INT(11),
  `itemid`  INT(11),
  `userid`  INT(11),
  `digdate` DATETIME,
  PRIMARY KEY (`id`),
  KEY `user`  (`userid`),
  KEY `item`  (`itemid`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `Friends` (
  `id`           INT(10) AUTO_INCREMENT,
  `userid`       INT(10),
  `username`     VARCHAR(15),
  `friendid`     INT(10),
  `friendname`   VARCHAR(15),
  `mutual`       TINYINT(1),
  `date_created` DATETIME,
  PRIMARY KEY                (`id`),
  UNIQUE KEY `Friend_unique` (`userid`,`friendid`),
  KEY        `Friend_friend` (`friendid`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

这个问题在社交网络场景实现中无处不在。 人们与很多人交往,他们反过来挖掘很多东西。 快速向用户展示他/她的朋友所做的事情是非常关键的。

据我所知,自那时以来,有几个博客提供了一个纯RDBM解决方案,并为此问题提供了索引。 但我很好奇如何在MongoDB中解决这个问题。


A while back,a Digg developer had posted this blog ,"http://about.digg.com/blog/looking-future-cassandra", where the he described one of the issues that were not optimally solved in MySQL. This was cited as one of the reasons for their move to Cassandra.

I have been playing with MongoDB and I would like to understand how to

implement the MongoDB collections for this problem

From the article, the schema for this information in MySQL :

CREATE TABLE `Diggs` (
  `id`      INT(11),
  `itemid`  INT(11),
  `userid`  INT(11),
  `digdate` DATETIME,
  PRIMARY KEY (`id`),
  KEY `user`  (`userid`),
  KEY `item`  (`itemid`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

CREATE TABLE `Friends` (
  `id`           INT(10) AUTO_INCREMENT,
  `userid`       INT(10),
  `username`     VARCHAR(15),
  `friendid`     INT(10),
  `friendname`   VARCHAR(15),
  `mutual`       TINYINT(1),
  `date_created` DATETIME,
  PRIMARY KEY                (`id`),
  UNIQUE KEY `Friend_unique` (`userid`,`friendid`),
  KEY        `Friend_friend` (`friendid`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

This problem is ubiquitous in social networking scenario implementation. People befriend a lot of people and they in turn digg a lot of things. Quickly showing a user what his/her friends are up to is very critical.

I understand that several blogs have since then provided a pure RDBMs solution with indexes for this issue; however I am curious as to how this could be solved in MongoDB.


原文:https://stackoverflow.com/questions/2823165
更新时间:2022-06-15 10:06

最满意答案

itertools的文档说“它大致相当于生成器表达式中的嵌套for循环”。 因此,itertools.product永远不会成为内存的敌人,但如果将结果存储在列表中,那么该列表就是。 因此:

for element in itertools.product(...):
    print element

没关系,但是

myList = [element for itertools.product(...)]

或等效的循环

for element in itertools.product(...):
    myList.append(element)

不是! 所以你希望itertools为你生成结果,但是你不想存储它们,而是在它们生成时使用它们。 想想这行代码:

c = [str(x)+y for x,y in itertools.product(nums,chars)]

鉴于nums和chars可以是巨大的列表,在它们之上构建另一个巨大的所有组合列表肯定会扼杀你的系统。

现在,正如评论中所提到的,如果你用生成器替换掉所有太胖而无法容纳到内存中列表 (刚刚产生的函数),内存就不再是一个问题了。

这是我的完整代码。 我基本上将你的字符和nums列表更改为生成器,并删除了c的最终列表。

import itertools

def char_range(c1, c2):
    """Generates the characters from `c1` to `c2`"""
    for c in range(ord(c1), ord(c2)+1):
        yield chr(c)

def char(a):
    for combination in itertools.product(char_range(str(a[0]),str(a[1])),repeat=4):
        yield ''.join(map(str, combination))

def num(n):
    for combination in itertools.product(range(n),repeat=4):
        yield ''.join(map(str, combination))

def final(one,two):
    for foo in char(one):
        for bar in num(two):
            print str(bar)+str(foo)

现在让我们问一下['a','b']和范围(2)的每个组合是什么:

final(['a','b'],2)

产生这个:

0000aaaa
0001aaaa
0010aaaa
0011aaaa
0100aaaa
0101aaaa
0110aaaa
0111aaaa
1000aaaa
1001aaaa
1010aaaa
1011aaaa
1100aaaa
1101aaaa
1110aaaa
1111aaaa
0000aaab
0001aaab
0010aaab
0011aaab
0100aaab
0101aaab
0110aaab
0111aaab
1000aaab
1001aaab
1010aaab
1011aaab
1100aaab
1101aaab
1110aaab
1111aaab
0000aaba
0001aaba
0010aaba
0011aaba
0100aaba
0101aaba
0110aaba
0111aaba
1000aaba
1001aaba
1010aaba
1011aaba
1100aaba
1101aaba
1110aaba
1111aaba
0000aabb
0001aabb
0010aabb
0011aabb
0100aabb
0101aabb
0110aabb
0111aabb
1000aabb
1001aabb
1010aabb
1011aabb
1100aabb
1101aabb
1110aabb
1111aabb
0000abaa
0001abaa
0010abaa
0011abaa
0100abaa
0101abaa
0110abaa
0111abaa
1000abaa
1001abaa
1010abaa
1011abaa
1100abaa
1101abaa
1110abaa
1111abaa
0000abab
0001abab
0010abab
0011abab
0100abab
0101abab
0110abab
0111abab
1000abab
1001abab
1010abab
1011abab
1100abab
1101abab
1110abab
1111abab
0000abba
0001abba
0010abba
0011abba
0100abba
0101abba
0110abba
0111abba
1000abba
1001abba
1010abba
1011abba
1100abba
1101abba
1110abba
1111abba
0000abbb
0001abbb
0010abbb
0011abbb
0100abbb
0101abbb
0110abbb
0111abbb
1000abbb
1001abbb
1010abbb
1011abbb
1100abbb
1101abbb
1110abbb
1111abbb
0000baaa
0001baaa
0010baaa
0011baaa
0100baaa
0101baaa
0110baaa
0111baaa
1000baaa
1001baaa
1010baaa
1011baaa
1100baaa
1101baaa
1110baaa
1111baaa
0000baab
0001baab
0010baab
0011baab
0100baab
0101baab
0110baab
0111baab
1000baab
1001baab
1010baab
1011baab
1100baab
1101baab
1110baab
1111baab
0000baba
0001baba
0010baba
0011baba
0100baba
0101baba
0110baba
0111baba
1000baba
1001baba
1010baba
1011baba
1100baba
1101baba
1110baba
1111baba
0000babb
0001babb
0010babb
0011babb
0100babb
0101babb
0110babb
0111babb
1000babb
1001babb
1010babb
1011babb
1100babb
1101babb
1110babb
1111babb
0000bbaa
0001bbaa
0010bbaa
0011bbaa
0100bbaa
0101bbaa
0110bbaa
0111bbaa
1000bbaa
1001bbaa
1010bbaa
1011bbaa
1100bbaa
1101bbaa
1110bbaa
1111bbaa
0000bbab
0001bbab
0010bbab
0011bbab
0100bbab
0101bbab
0110bbab
0111bbab
1000bbab
1001bbab
1010bbab
1011bbab
1100bbab
1101bbab
1110bbab
1111bbab
0000bbba
0001bbba
0010bbba
0011bbba
0100bbba
0101bbba
0110bbba
0111bbba
1000bbba
1001bbba
1010bbba
1011bbba
1100bbba
1101bbba
1110bbba
1111bbba
0000bbbb
0001bbbb
0010bbbb
0011bbbb
0100bbbb
0101bbbb
0110bbbb
0111bbbb
1000bbbb
1001bbbb
1010bbbb
1011bbbb
1100bbbb
1101bbbb
1110bbbb
1111bbbb

这是您正在寻找的确切结果。 此结果的每个元素都是动态生成的,因此永远不会产生内存问题。 您现在可以尝试看到更大的操作,例如final(['a','z'],10)对CPU友好。


The documentation of itertools says that "it is roughly equivalent to nested for-loops in a generator expression". So itertools.product is never an enemy of memory, but if you store its results in a list, that list is. Therefore:

for element in itertools.product(...):
    print element

is okay, but

myList = [element for itertools.product(...)]

or the equivalent loop of

for element in itertools.product(...):
    myList.append(element)

is not! So you want itertools to generate results for you, but you don't want to store them, rather use them as they are generated. Think about this line of your code:

c = [str(x)+y for x,y in itertools.product(nums,chars)]

Given that nums and chars can be huge lists, building another gigantic list of all combinations on top of them is definitely going to choke your system.

Now, as mentioned in the comments, if you replace all the lists that are too fat to fit into the memory with generators (functions that just yield), memory is not going to be a concern anymore.

Here is my full code. I basically changed your lists of chars and nums to generators, and got rid of the final list of c.

import itertools

def char_range(c1, c2):
    """Generates the characters from `c1` to `c2`"""
    for c in range(ord(c1), ord(c2)+1):
        yield chr(c)

def char(a):
    for combination in itertools.product(char_range(str(a[0]),str(a[1])),repeat=4):
        yield ''.join(map(str, combination))

def num(n):
    for combination in itertools.product(range(n),repeat=4):
        yield ''.join(map(str, combination))

def final(one,two):
    for foo in char(one):
        for bar in num(two):
            print str(bar)+str(foo)

Now let's ask what every combination of ['a','b'] and range(2) is:

final(['a','b'],2)

Produces this:

0000aaaa
0001aaaa
0010aaaa
0011aaaa
0100aaaa
0101aaaa
0110aaaa
0111aaaa
1000aaaa
1001aaaa
1010aaaa
1011aaaa
1100aaaa
1101aaaa
1110aaaa
1111aaaa
0000aaab
0001aaab
0010aaab
0011aaab
0100aaab
0101aaab
0110aaab
0111aaab
1000aaab
1001aaab
1010aaab
1011aaab
1100aaab
1101aaab
1110aaab
1111aaab
0000aaba
0001aaba
0010aaba
0011aaba
0100aaba
0101aaba
0110aaba
0111aaba
1000aaba
1001aaba
1010aaba
1011aaba
1100aaba
1101aaba
1110aaba
1111aaba
0000aabb
0001aabb
0010aabb
0011aabb
0100aabb
0101aabb
0110aabb
0111aabb
1000aabb
1001aabb
1010aabb
1011aabb
1100aabb
1101aabb
1110aabb
1111aabb
0000abaa
0001abaa
0010abaa
0011abaa
0100abaa
0101abaa
0110abaa
0111abaa
1000abaa
1001abaa
1010abaa
1011abaa
1100abaa
1101abaa
1110abaa
1111abaa
0000abab
0001abab
0010abab
0011abab
0100abab
0101abab
0110abab
0111abab
1000abab
1001abab
1010abab
1011abab
1100abab
1101abab
1110abab
1111abab
0000abba
0001abba
0010abba
0011abba
0100abba
0101abba
0110abba
0111abba
1000abba
1001abba
1010abba
1011abba
1100abba
1101abba
1110abba
1111abba
0000abbb
0001abbb
0010abbb
0011abbb
0100abbb
0101abbb
0110abbb
0111abbb
1000abbb
1001abbb
1010abbb
1011abbb
1100abbb
1101abbb
1110abbb
1111abbb
0000baaa
0001baaa
0010baaa
0011baaa
0100baaa
0101baaa
0110baaa
0111baaa
1000baaa
1001baaa
1010baaa
1011baaa
1100baaa
1101baaa
1110baaa
1111baaa
0000baab
0001baab
0010baab
0011baab
0100baab
0101baab
0110baab
0111baab
1000baab
1001baab
1010baab
1011baab
1100baab
1101baab
1110baab
1111baab
0000baba
0001baba
0010baba
0011baba
0100baba
0101baba
0110baba
0111baba
1000baba
1001baba
1010baba
1011baba
1100baba
1101baba
1110baba
1111baba
0000babb
0001babb
0010babb
0011babb
0100babb
0101babb
0110babb
0111babb
1000babb
1001babb
1010babb
1011babb
1100babb
1101babb
1110babb
1111babb
0000bbaa
0001bbaa
0010bbaa
0011bbaa
0100bbaa
0101bbaa
0110bbaa
0111bbaa
1000bbaa
1001bbaa
1010bbaa
1011bbaa
1100bbaa
1101bbaa
1110bbaa
1111bbaa
0000bbab
0001bbab
0010bbab
0011bbab
0100bbab
0101bbab
0110bbab
0111bbab
1000bbab
1001bbab
1010bbab
1011bbab
1100bbab
1101bbab
1110bbab
1111bbab
0000bbba
0001bbba
0010bbba
0011bbba
0100bbba
0101bbba
0110bbba
0111bbba
1000bbba
1001bbba
1010bbba
1011bbba
1100bbba
1101bbba
1110bbba
1111bbba
0000bbbb
0001bbbb
0010bbbb
0011bbbb
0100bbbb
0101bbbb
0110bbbb
0111bbbb
1000bbbb
1001bbbb
1010bbbb
1011bbbb
1100bbbb
1101bbbb
1110bbbb
1111bbbb

Which is the exact result you are looking for. Each element of this result is generated on the fly, hence never creates a memory problem. You can now try and see that much bigger operations such as final(['a','z'],10) are CPU-friendly.

相关问答

更多

相关文章

更多

最新问答

更多
  • 获取MVC 4使用的DisplayMode后缀(Get the DisplayMode Suffix being used by MVC 4)
  • 如何通过引用返回对象?(How is returning an object by reference possible?)
  • 矩阵如何存储在内存中?(How are matrices stored in memory?)
  • 每个请求的Java新会话?(Java New Session For Each Request?)
  • css:浮动div中重叠的标题h1(css: overlapping headlines h1 in floated divs)
  • 无论图像如何,Caffe预测同一类(Caffe predicts same class regardless of image)
  • xcode语法颜色编码解释?(xcode syntax color coding explained?)
  • 在Access 2010 Runtime中使用Office 2000校对工具(Use Office 2000 proofing tools in Access 2010 Runtime)
  • 从单独的Web主机将图像传输到服务器上(Getting images onto server from separate web host)
  • 从旧版本复制文件并保留它们(旧/新版本)(Copy a file from old revision and keep both of them (old / new revision))
  • 西安哪有PLC可控制编程的培训
  • 在Entity Framework中选择基类(Select base class in Entity Framework)
  • 在Android中出现错误“数据集和渲染器应该不为null,并且应该具有相同数量的系列”(Error “Dataset and renderer should be not null and should have the same number of series” in Android)
  • 电脑二级VF有什么用
  • Datamapper Ruby如何添加Hook方法(Datamapper Ruby How to add Hook Method)
  • 金华英语角.
  • 手机软件如何制作
  • 用于Android webview中图像保存的上下文菜单(Context Menu for Image Saving in an Android webview)
  • 注意:未定义的偏移量:PHP(Notice: Undefined offset: PHP)
  • 如何读R中的大数据集[复制](How to read large dataset in R [duplicate])
  • Unity 5 Heighmap与地形宽度/地形长度的分辨率关系?(Unity 5 Heighmap Resolution relationship to terrain width / terrain length?)
  • 如何通知PipedOutputStream线程写入最后一个字节的PipedInputStream线程?(How to notify PipedInputStream thread that PipedOutputStream thread has written last byte?)
  • python的访问器方法有哪些
  • DeviceNetworkInformation:哪个是哪个?(DeviceNetworkInformation: Which is which?)
  • 在Ruby中对组合进行排序(Sorting a combination in Ruby)
  • 网站开发的流程?
  • 使用Zend Framework 2中的JOIN sql检索数据(Retrieve data using JOIN sql in Zend Framework 2)
  • 条带格式类型格式模式编号无法正常工作(Stripes format type format pattern number not working properly)
  • 透明度错误IE11(Transparency bug IE11)
  • linux的基本操作命令。。。