首页 \ 问答 \ 内存高效的替代Python字典(Memory Efficient Alternatives to Python Dictionaries)

内存高效的替代Python字典(Memory Efficient Alternatives to Python Dictionaries)

 在我目前的一个项目中，我正在扫描一些文字，查看单词三元组的频率。 在我第一次去，我使用默认字典三级深。 换句话说， topDict[word1][word2][word3]返回这些单词出现在文本中的次数， topDict[word1][word2]返回一个字典，其中包含单词1和2之后出现的所有单词等。  
 这功能正常，但是它非常内存密集。 在我的初始测试中，它使用像存储三进制存储在文本文件中的内存的20倍，这似乎是一个过多的内存开销。  
 我的怀疑是，许多这些字典正在创造出比实际使用的更多的插槽，所以我想用这种方式使用更高内存效率的其他内容替换字典。 我强烈希望有一个解决方案，允许按照字典的关键查找。  
 从我所知道的数据结构，使用红黑或AVL的平衡二叉搜索树可能是理想的，但我真的不愿意自己实现。 如果可能的话，我更喜欢坚持使用标准的python库，但如果他们能最好的工作，我肯定会接受其他替代方案。  
 那么有没有人对我有任何建议？  
 编辑添加：  
 感谢您的答复。 到目前为止，有几个答案建议使用元组，当我将前两个词汇集成一个元组时，这对我来说并没有太大的帮助。 我犹豫使用这三个键作为关键，因为我希望能够容易地查找前两者的所有第三个字。 （即我想要像topDict[word1, word2].keys() ）的结果。  
 我正在玩的当前数据集是Wikipedia For Schools的最新版本。 例如，解析一千页的结果对于一个文本文件是像11MB，其中每行是三个单词，并且count个tab分隔。 以字典格式存储文本我现在使用大约需要185MB。 我知道指针会有一些额外的开销，但是差异似乎是过分的。 

In one of my current side projects, I am scanning through some text looking at the frequency of word triplets. In my first go at it, I used the default dictionary three levels deep. In other words, topDict[word1][word2][word3] returns the number of times these words appear in the text, topDict[word1][word2] returns a dictionary with all the words that appeared following words 1 and 2, etc. 
This functions correctly, but it is very memory intensive. In my initial tests it used something like 20 times the memory of just storing the triplets in a text file, which seems like an overly large amount of memory overhead. 
My suspicion is that many of these dictionaries are being created with many more slots than are actually being used, so I want to replace the dictionaries with something else that is more memory efficient when used in this manner. I would strongly prefer a solution that allows key lookups along the lines of the dictionaries. 
From what I know of data structures, a balanced binary search tree using something like red-black or AVL would probably be ideal, but I would really prefer not to implement them myself. If possible, I'd prefer to stick with standard python libraries, but I'm definitely open to other alternatives if they would work best. 
So, does anyone have any suggestions for me? 
Edited to add: 
Thanks for the responses so far. A few of the answers so far have suggested using tuples, which didn't really do much for me when I condensed the first two words into a tuple. I am hesitant to use all three as a key since I want it to be easy to look up all third words given the first two. (i.e. I want something like the result of topDict[word1, word2].keys()).  
The current dataset I am playing around with is the most recent version of Wikipedia For Schools. The results of parsing the first thousand pages, for example, is something like 11MB for a text file where each line is the three words and the count all tab separated. Storing the text in the dictionary format I am now using takes around 185MB. I know that there will be some additional overhead for pointers and whatnot, but the difference seems excessive.

原文：https://stackoverflow.com/questions/327223

更新时间：2023-11-25 13:11

最满意答案

 这些符号没有任何意义，它们只是用来更好地读取树的输出！  
 下面是一个更复杂的输出，可以更好地查看它的功能，以及spring-webmvc依赖关系：  
[INFO] +- org.springframework:spring-webmvc:jar:4.2.2.RELEASE:compile
[INFO] |  +- org.springframework:spring-beans:jar:4.2.2.RELEASE:compile
[INFO] |  +- org.springframework:spring-context:jar:4.2.2.RELEASE:compile
[INFO] |  |  \- org.springframework:spring-aop:jar:4.2.2.RELEASE:compile
[INFO] |  |     \- aopalliance:aopalliance:jar:1.0:compile
[INFO] |  +- org.springframework:spring-core:jar:4.2.2.RELEASE:compile
[INFO] |  |  \- commons-logging:commons-logging:jar:1.2:compile
[INFO] |  +- org.springframework:spring-expression:jar:4.2.2.RELEASE:compile
 
 将依赖关系树视为级别：第一级对应于直接依赖关系; 第二级对应于那些直接依赖关系的传递依赖关系等。  
 基本上，如果对同一个工件有相同级别的多个依赖关系，则会显示+- ，否则将显示\- ，表示树的“结束”（即通向叶子的路径） 。 

Those symbols do not have any meaning whatsoever, they are just present to read the output of the tree better! 
Here's a more complex output to see better what it does, on a spring-webmvc dependency: 
[INFO] +- org.springframework:spring-webmvc:jar:4.2.2.RELEASE:compile
[INFO] |  +- org.springframework:spring-beans:jar:4.2.2.RELEASE:compile
[INFO] |  +- org.springframework:spring-context:jar:4.2.2.RELEASE:compile
[INFO] |  |  \- org.springframework:spring-aop:jar:4.2.2.RELEASE:compile
[INFO] |  |     \- aopalliance:aopalliance:jar:1.0:compile
[INFO] |  +- org.springframework:spring-core:jar:4.2.2.RELEASE:compile
[INFO] |  |  \- commons-logging:commons-logging:jar:1.2:compile
[INFO] |  +- org.springframework:spring-expression:jar:4.2.2.RELEASE:compile
 
Consider the dependency tree as levels: the first level correspond to the direct dependencies; the second level corresponds to the transitive dependencies of those direct dependencies, etc. 
Basically, if there is more than one dependency on the same level for the same artifact, a +- will be shown, otherwise a \- will be shown, indicating an "end" of the tree (i.e. a path leading to a leaf).

内存高效的替代Python字典(Memory Efficient Alternatives to Python Dictionaries)

最满意答案

相关问答

maven依赖关系树输出中的“+ - ”和“\ - ”有什么区别？(What is the difference between “+-” and “\-” in maven dependency tree output?)[2022-03-28]

Nexus和Maven有什么区别？(What is the difference between Nexus and Maven?)[2024-01-29]

在pom xml中的依赖关系和插件标签之间的maven有什么区别？(What is the difference in maven between dependency and plugin tags in pom xml?)[2021-09-28]

如何在项目中显示插件的Maven依赖关系树？(How can you display the Maven dependency tree for the plugins in your project?)[2022-01-29]

传递Maven依赖项显示在依赖项中：树，但不在lib目录中(Transitive Maven dependency shows up in dependency:tree but not in lib directory)[2022-05-10]

Ascii树解析的正则表达式模式（Maven / Gradle依赖关系输出）(Regex Pattern for Ascii Tree Parsing (Maven/Gradle Dependency output))[2021-11-28]

Maven依赖与多模块？(Maven dependency vs multimodule?)[2021-09-25]

如何在Maven 3插件中获得依赖关系树？(How to get the dependency tree in a Maven 3 plugin?)[2023-12-22]

标签parent，dependency和plugin（Maven）之间有什么区别？(What is the difference between tags parent, dependency and plugin (Maven))[2023-04-16]

“mvn dependency：tree” - 是否有可用于“详细”输出的等效项？(“mvn dependency:tree” - is there an equivalent available for “verbose” output?)[2021-12-05]

相关文章

最新问答

内存高效的替代Python字典(Memory Efficient Alternatives to Python Dictionaries)

最满意答案

相关问答

maven依赖关系树输出中的“+ - ”和“\ - ”有什么区别？(What is the difference between “+-” and “\-” in maven dependency tree output?)[2022-03-28]

Nexus和Maven有什么区别？(What is the difference between Nexus and Maven?)[2024-01-29]

在pom xml中的依赖关系和插件标签之间的maven有什么区别？(What is the difference in maven between dependency and plugin tags in pom xml?)[2021-09-28]

如何在项目中显示*插件*的Maven依赖关系树？(How can you display the Maven dependency tree for the *plugins* in your project?)[2022-01-29]

传递Maven依赖项显示在依赖项中：树，但不在lib目录中(Transitive Maven dependency shows up in dependency:tree but not in lib directory)[2022-05-10]

Ascii树解析的正则表达式模式（Maven / Gradle依赖关系输出）(Regex Pattern for Ascii Tree Parsing (Maven/Gradle Dependency output))[2021-11-28]

Maven依赖与多模块？(Maven dependency vs multimodule?)[2021-09-25]

如何在Maven 3插件中获得依赖关系树？(How to get the dependency tree in a Maven 3 plugin?)[2023-12-22]

标签parent，dependency和plugin（Maven）之间有什么区别？(What is the difference between tags parent, dependency and plugin (Maven))[2023-04-16]

“mvn dependency：tree” - 是否有可用于“详细”输出的等效项？(“mvn dependency:tree” - is there an equivalent available for “verbose” output?)[2021-12-05]

相关文章

最新问答

如何在项目中显示插件的Maven依赖关系树？(How can you display the Maven dependency tree for the plugins in your project?)[2022-01-29]