GPU devices offer great performance when dealing with algorithms that require intense computational resources. A developer can configure the L1 cache memory of the latest GPU Kepler architecture with different cache size and cache set associativity, per Streaming Multiprocessors (SM). The performance of the computation intensive algorithms can be affected by these cache parameters. In this paper, we evaluate influence of the performance for all possible configurations of L1 cache size and associativity, for dense matrix-matrix multiplication algorithm for various problem sizes. The results show a small impact of L1 cache memory for the overall performance of the algorithm.
Cache Memory, SIMD, GPGPU.