首页 \ 问答 \ 内存高效的替代Python字典(Memory Efficient Alternatives to Python Dictionaries)

内存高效的替代Python字典(Memory Efficient Alternatives to Python Dictionaries)

在我目前的一个项目中,我正在扫描一些文字,查看单词三元组的频率。 在我第一次去,我使用默认字典三级深。 换句话说, topDict[word1][word2][word3]返回这些单词出现在文本中的次数, topDict[word1][word2]返回一个字典,其中包含单词1和2之后出现的所有单词等。

这功能正常,但是它非常内存密集。 在我的初始测试中,它使用像存储三进制存储在文本文件中的内存的20倍,这似乎是一个过多的内存开销。

我的怀疑是,许多这些字典正在创造出比实际使用的更多的插槽,所以我想用这种方式使用更高内存效率的其他内容替换字典。 我强烈希望有一个解决方案,允许按照字典的关键查找。

从我所知道的数据结构,使用红黑或AVL的平衡二叉搜索树可能是理想的,但我真的不愿意自己实现。 如果可能的话,我更喜欢坚持使用标准的python库,但如果他们能最好的工作,我肯定会接受其他替代方案。

那么有没有人对我有任何建议?

编辑添加:

感谢您的答复。 到目前为止,有几个答案建议使用元组,当我将前两个词汇集成一个元组时,这对我来说并没有太大的帮助。 我犹豫使用这三个键作为关键,因为我希望能够容易地查找前两者的所有第三个字。 (即我想要像topDict[word1, word2].keys() )的结果。

我正在玩的当前数据集是Wikipedia For Schools的最新版本。 例如,解析一千页的结果对于一个文本文件是像11MB,其中每行是三个单词,并且count个tab分隔。 以字典格式存储文本我现在使用大约需要185MB。 我知道指针会有一些额外的开销,但是差异似乎是过分的。


In one of my current side projects, I am scanning through some text looking at the frequency of word triplets. In my first go at it, I used the default dictionary three levels deep. In other words, topDict[word1][word2][word3] returns the number of times these words appear in the text, topDict[word1][word2] returns a dictionary with all the words that appeared following words 1 and 2, etc.

This functions correctly, but it is very memory intensive. In my initial tests it used something like 20 times the memory of just storing the triplets in a text file, which seems like an overly large amount of memory overhead.

My suspicion is that many of these dictionaries are being created with many more slots than are actually being used, so I want to replace the dictionaries with something else that is more memory efficient when used in this manner. I would strongly prefer a solution that allows key lookups along the lines of the dictionaries.

From what I know of data structures, a balanced binary search tree using something like red-black or AVL would probably be ideal, but I would really prefer not to implement them myself. If possible, I'd prefer to stick with standard python libraries, but I'm definitely open to other alternatives if they would work best.

So, does anyone have any suggestions for me?

Edited to add:

Thanks for the responses so far. A few of the answers so far have suggested using tuples, which didn't really do much for me when I condensed the first two words into a tuple. I am hesitant to use all three as a key since I want it to be easy to look up all third words given the first two. (i.e. I want something like the result of topDict[word1, word2].keys()).

The current dataset I am playing around with is the most recent version of Wikipedia For Schools. The results of parsing the first thousand pages, for example, is something like 11MB for a text file where each line is the three words and the count all tab separated. Storing the text in the dictionary format I am now using takes around 185MB. I know that there will be some additional overhead for pointers and whatnot, but the difference seems excessive.


原文:https://stackoverflow.com/questions/327223
更新时间:2023-11-25 13:11

最满意答案

这些符号没有任何意义,它们只是用来更好地读取树的输出!

下面是一个更复杂的输出,可以更好地查看它的功能,以及spring-webmvc依赖关系:

[INFO] +- org.springframework:spring-webmvc:jar:4.2.2.RELEASE:compile
[INFO] |  +- org.springframework:spring-beans:jar:4.2.2.RELEASE:compile
[INFO] |  +- org.springframework:spring-context:jar:4.2.2.RELEASE:compile
[INFO] |  |  \- org.springframework:spring-aop:jar:4.2.2.RELEASE:compile
[INFO] |  |     \- aopalliance:aopalliance:jar:1.0:compile
[INFO] |  +- org.springframework:spring-core:jar:4.2.2.RELEASE:compile
[INFO] |  |  \- commons-logging:commons-logging:jar:1.2:compile
[INFO] |  +- org.springframework:spring-expression:jar:4.2.2.RELEASE:compile

将依赖关系树视为级别:第一级对应于直接依赖关系; 第二级对应于那些直接依赖关系的传递依赖关系等。

基本上,如果对同一个工件有相同级别的多个依赖关系,则会显示+- ,否则将显示\- ,表示树的“结束”(即通向叶子的路径) 。


Those symbols do not have any meaning whatsoever, they are just present to read the output of the tree better!

Here's a more complex output to see better what it does, on a spring-webmvc dependency:

[INFO] +- org.springframework:spring-webmvc:jar:4.2.2.RELEASE:compile
[INFO] |  +- org.springframework:spring-beans:jar:4.2.2.RELEASE:compile
[INFO] |  +- org.springframework:spring-context:jar:4.2.2.RELEASE:compile
[INFO] |  |  \- org.springframework:spring-aop:jar:4.2.2.RELEASE:compile
[INFO] |  |     \- aopalliance:aopalliance:jar:1.0:compile
[INFO] |  +- org.springframework:spring-core:jar:4.2.2.RELEASE:compile
[INFO] |  |  \- commons-logging:commons-logging:jar:1.2:compile
[INFO] |  +- org.springframework:spring-expression:jar:4.2.2.RELEASE:compile

Consider the dependency tree as levels: the first level correspond to the direct dependencies; the second level corresponds to the transitive dependencies of those direct dependencies, etc.

Basically, if there is more than one dependency on the same level for the same artifact, a +- will be shown, otherwise a \- will be shown, indicating an "end" of the tree (i.e. a path leading to a leaf).

相关问答

更多

相关文章

更多

最新问答

更多
  • 如何使用自由职业者帐户登录我的php网站?(How can I login into my php website using freelancer account? [closed])
  • 如何打破按钮上的生命周期循环(How to break do-while loop on button)
  • C#使用EF访问MVC上的部分类的自定义属性(C# access custom attributes of a partial class on MVC with EF)
  • 如何获得facebook app的publish_stream权限?(How to get publish_stream permissions for facebook app?)
  • 如何并排放置两个元件?(How to position two elements side by side?)
  • 在MySQL和/或多列中使用多个表用于Rails应用程序(Using multiple tables in MySQL and/or multiple columns for a Rails application)
  • 如何隐藏谷歌地图上的登录按钮?(How to hide the Sign in button from Google maps?)
  • Mysql左连接旋转90°表(Mysql Left join rotate 90° table)
  • 带有ImageMagick和许多图像的GIF动画(GIF animation with ImageMagick and many images)
  • 电脑高中毕业学习去哪里培训
  • 电脑系统专业就业状况如何啊?
  • IEnumerable linq表达式(IEnumerable linq expressions)
  • 如何在Spring测试中连接依赖关系(How to wire dependencies in Spring tests)
  • Solr可以在没有Lucene的情况下运行吗?(Can Solr run without Lucene?)
  • 如何保证Task在当前线程上同步运行?(How to guarantee that a Task runs synchronously on the current thread?)
  • 在保持每列的类的同时向数据框添加行(Adding row to data frame while maintaining the class of each column)
  • 的?(The ? marks in emacs/haskell and ghc mode)
  • 一个线程可以调用SuspendThread传递自己的线程ID吗?(Can a thread call SuspendThread passing its own thread ID?)
  • 延迟socket.io响应,并“警告 - websocket连接无效”(Delayed socket.io response, and “warn - websocket connection invalid”)
  • 悬停时的图像转换(Image transition on hover)
  • IIS 7.5仅显示homecontroller(IIS 7.5 only shows homecontroller)
  • 没有JavaScript的复选框“关闭”值(Checkbox 'off' value without JavaScript)
  • java分布式框架有哪些
  • Python:填写表单并点击按钮确认[关闭](Python: fill out a form and confirm with a button click [closed])
  • PHP将文件链接到根文件目录(PHP Linking Files to Root File Directory)
  • 我如何删除ListView中的项目?(How I can remove a item in my ListView?)
  • 您是否必须为TFS(云)中的每个BUG创建一个TASK以跟踪时间?(Do you have to create a TASK for every BUG in TFS (Cloud) to track time?)
  • typoscript TMENU ATagParams小写(typoscript TMENU ATagParams lowercase)
  • 武陟会计培训类的学校哪个好点?
  • 从链接中删除文本修饰(Remove text decoration from links)