根据CUDA 5.0编程指南,如果我同时使用L1和L2缓存(在Fermi或Kepler上),所有全局内存操作都是使用128字节内存事务完成的。 但是,如果我仅使用L2,则使用32字节内存事务(第F.4.2章)。

我们假设所有缓存都是空的。 如果我有一个warp,每个线程以完全对齐的方式访问一个4字节字,这将导致L1 + L2情况下的1x128B事务,以及仅L2情况下的4x32B事务。 是对的吗?

我的问题是 - 4个32B交易是否比单个128B交易慢? 我对前费米硬件的直觉表明它会更慢,但也许在新硬件上不再如此? 或者我可以只看一下带宽利用率来判断我的内存访问效率?

According to CUDA 5.0 Programming Guide, if I am using both L1 and L2 caching (on Fermi or Kepler), all global memory operations are done using 128-byte memory transactions. However, if I am using L2 only, 32-byte memory transactions are used (chapter F.4.2).

Let us assume that all caches are empty. If I have a warp, with each thread accessing a single 4-byte word, in a perfectly aligned fashion, this will result in 1x128B transaction in L1+L2 case, and in 4x32B transaction in L2-only case. Is that right?

My question is - are the 4 32B transactions any slower than a single 128B transaction? My intuition from pre-Fermi hardware suggests that it would be slower, but perhaps this is no longer true on the newer hardware? Or maybe I should just look at the amount of bandwidth utilization to judge the efficiency of my memory access?

更新时间:2022-04-05 08:04


