Matrix multiplication is compute intensive, memory demand and cache intensive algorithm. It performs O(N 3) operations, demands storing O(N 2) elements and accesses O(N) times each element, where N is the matrix size. Implementation of cache intensive algorithms can achieve speedups due to cache memory behavior if the algorithms frequently reuse the data. A block replacement of already stored elements is initiated when the requirements exceed the limitations of cache size. Cache misses are produced when data of replaced block is to be used again. Several cache replacement policies are proposed to speedup different program executions. In this paper we analyze and compare two most implemented cache replacement policies First-In-First-Out (FIFO) and Least-Recently-Used (LRU). The results of the experiments show the optimal solutions for sequential and parallel dense matrix multiplication algorithm. As the number of operations does not depend on cache replacement policy, we define and determine the average memory cycles per instruction that the algorithm performs, since it mostly affects the performance.
FIFO HPC LRU Performance Speedup