首页 \ 问答 \ 优化python csv处理到父和EAV子表(Optimize python csv processing into parent and EAV child table)

优化python csv处理到父和EAV子表(Optimize python csv processing into parent and EAV child table)

在线有几个类似的问题,用于使用python将大型csv文件处理成多个postgresql表。 但是,似乎没有人解决围绕优化数据库读/写和系统内存/处理的几个问题。

假设我有一行产品数据如下所示:

名称,SKU,日期时间,小数,小数,小数,小数,小数,小数

如果名称和sku存储在一个表(父)中,则每个十进制字段存储在子EAV表中,该表基本上包含decimal,parent_id和datetime。

假设我在csv文件中有20000个这样的行,所以我最终将它们分块。 现在,我采取了2000行的这些行并逐行循环。 每次迭代检查产品是否存在,如果不存在则创建它,检索parent_id。 然后,我有一个大的insert子语句列表,为子表生成十进制值。 如果用户选择仅覆盖未修改的十进制值,那么这也会检查每个单独的十进制值,以查看它是否在添加到插入列表之前已被修改。

在这个例子中,如果我遇到了最糟糕的情况,我最终会进行160,000次数据库读取以及10-20010次写入的任何地方。 我还将在每个块的内存列表中存储多达12000个插入语句(但是,这只是一个列表,因此该部分不是那么糟糕)。

我的主要问题是:

  • 如何优化此更快,使用更少的数据库操作(因为这也会影响网络流量),并使用更少的处理和内存? 如果它可以节省其他两个优化,我宁愿让处理速度变慢,因为那些在转换为AWS等服务器/数据库处理定价时会花费更多的钱。

一些子问题是:

  • 有没有一种方法可以组合所有产品读/写并在执行小数之前将其替换为文件?
  • 我应该做一个较小的块大小来帮助记忆吗?
  • 我应该使用线程还是保持线性?
  • 我是否可以构建一个更有效的SQL查询,如果不存在并且引用内联,产品创建,从而将一些处理转移到sql而不是python?
  • 我可以优化子插入语句来做比数千个INSERT INTO语句更好的操作吗?

There's been several similar questions around online for processing large csv files into multiple postgresql tables with python. However, none seem to address a couple concerns around optimizing database reads/writes and system memory/processing.

Say I have a row of product data that looks like this:

name,sku,datetime,decimal,decimal,decimal,decimal,decimal,decimal

Where the name and sku are stored in one table (parent), then each decimal field is stored in a child EAV table that essentially contains the decimal, parent_id, and datetime.

Let's say I have 20000 of these rows in a csv file, so I end up chunking them up. Right now, I take chunks of 2000 of these rows and loop line by line. Each iteration checks to see if the product exists and creates it if not, retrieving the parent_id. Then, I have a large list of insert statements generated for the child table with the decimal values. If the user has selected to only overwrite non-modified decimal values, then this also checks each individual decimal value to see if it has been modified before adding to the insert list.

In this example, if I had the worst case scenario, I'd end up doing 160,000 database reads and anywhere from 10-20010 writes. I'd also be storing up to 12000 insert statements in a list in memory for each chunk (however, this would only be one list, so that part isn't as bad).

My main question is:

  • How can I optimize this to be faster, use less database operations (since this also affects network traffic), and use less processing and memory? I'd also rather have the processing speed to be slower if it could save on the other two optimizations, as those ones cost more money when translated to server/database processing pricing in something like AWS.

Some sub questions are:

  • Is there a way I can combine all the product read/writes and replace them in the file before doing the decimals?
  • Should I be doing a smaller chunk size to help with memory?
  • Should I be utilizing threads or keeping it linear?
  • Could I have it build a more efficient sql query that does the product create if not exists and referencing inline, thus moving some of the processing into sql rather than python?
  • Could I optimize the child insert statements to do something better than thousands of INSERT INTO statements?

原文:https://stackoverflow.com/questions/44829082
更新时间:2023-08-10 18:08

最满意答案

我应该使用哪个版本的phpunit与zf-1.12以及我将无法使用旧版本的phpunit的新功能。

你说你的研究提出了phpunit 3.4和zf 1.12。 我不认为你会注意到很多不同,所有主要的断言都会存在。

如果我将我的代码升级到ZF2,那么我可以使用哪个版本的phpunit。

如果这是一个选项,那就去做吧! 不仅适用于较新版本的phpunit,还适用于较新版本的Zend。 您还可以使用composer来管理依赖项。

如果有更好的其他方式来测试zf应用程序,请告诉我,而不是使用phpunit。

Phpunit是标准和推荐的方式。 对于模拟,你可以使用Mockery和phpunit。

但是,作为替代方案,您可以使用phpspec。


Which version of phpunit i should use with zf-1.12 and what new feature I won't be able to use with older version of phpunit.

You said your research came up with phpunit 3.4 with zf 1.12. I don't think you will notice much difference, all the main asserts will be there.

If I upgrade my code to ZF2 then what version of phpunit I would be able to use.

If this is an option, do it! Not just for a newer version of phpunit, but for a newer version Zend. Also you will be able to use composer to manage your dependencies.

Please also let me know if there is better other way to unit test zf application than using phpunit.

Phpunit is the standard & recommended way. For mocking, you could use Mockery with phpunit.

However, as an alternative, you could use phpspec.

相关问答

更多

相关文章

更多

最新问答

更多
  • 获取MVC 4使用的DisplayMode后缀(Get the DisplayMode Suffix being used by MVC 4)
  • 如何通过引用返回对象?(How is returning an object by reference possible?)
  • 矩阵如何存储在内存中?(How are matrices stored in memory?)
  • 每个请求的Java新会话?(Java New Session For Each Request?)
  • css:浮动div中重叠的标题h1(css: overlapping headlines h1 in floated divs)
  • 无论图像如何,Caffe预测同一类(Caffe predicts same class regardless of image)
  • xcode语法颜色编码解释?(xcode syntax color coding explained?)
  • 在Access 2010 Runtime中使用Office 2000校对工具(Use Office 2000 proofing tools in Access 2010 Runtime)
  • 从单独的Web主机将图像传输到服务器上(Getting images onto server from separate web host)
  • 从旧版本复制文件并保留它们(旧/新版本)(Copy a file from old revision and keep both of them (old / new revision))
  • 西安哪有PLC可控制编程的培训
  • 在Entity Framework中选择基类(Select base class in Entity Framework)
  • 在Android中出现错误“数据集和渲染器应该不为null,并且应该具有相同数量的系列”(Error “Dataset and renderer should be not null and should have the same number of series” in Android)
  • 电脑二级VF有什么用
  • Datamapper Ruby如何添加Hook方法(Datamapper Ruby How to add Hook Method)
  • 金华英语角.
  • 手机软件如何制作
  • 用于Android webview中图像保存的上下文菜单(Context Menu for Image Saving in an Android webview)
  • 注意:未定义的偏移量:PHP(Notice: Undefined offset: PHP)
  • 如何读R中的大数据集[复制](How to read large dataset in R [duplicate])
  • Unity 5 Heighmap与地形宽度/地形长度的分辨率关系?(Unity 5 Heighmap Resolution relationship to terrain width / terrain length?)
  • 如何通知PipedOutputStream线程写入最后一个字节的PipedInputStream线程?(How to notify PipedInputStream thread that PipedOutputStream thread has written last byte?)
  • python的访问器方法有哪些
  • DeviceNetworkInformation:哪个是哪个?(DeviceNetworkInformation: Which is which?)
  • 在Ruby中对组合进行排序(Sorting a combination in Ruby)
  • 网站开发的流程?
  • 使用Zend Framework 2中的JOIN sql检索数据(Retrieve data using JOIN sql in Zend Framework 2)
  • 条带格式类型格式模式编号无法正常工作(Stripes format type format pattern number not working properly)
  • 透明度错误IE11(Transparency bug IE11)
  • linux的基本操作命令。。。