By Akhil Bagaria on 10 November 2013

CASE STUDY: PENTIUM 4 CACHE ORGANIZATION

  • All Pentium processors have 2 on-chip L1 caches – one for data caches and the other for instruction caches.
  • The Pentium 4 has an L2 cache which feeds off both the L1 caches.
  • These are the components of the processor core for the Pentium 4:

    • Fetch/decode unit: Fetches instructions from the L2 cache, breaks them into simpler micro instructions and then feeds it to the L1 cache.
    • Out of order execution logic: This logic executed the instructions retrieved and in the L1 cache. However, it executes them on the basis of data dependencies and resource availability – thereby making the order of execution not the same as the order in which the instructions are loaded into the L1 cache.
    • Execution Units: These units execute the instruction in the L1 cache by loading them onto registers etc.
    • Memory Subsystem: This system consists of the L2 cache and the system bus – both of which are accessed when there is an L1 and an L2 miss.
  • Unlike most other processors, the Pentium 4 instruction cache sits between the instructions decode logic and the execution core. The reasoning behind this is:

    • The Pentium processor decodes Pentium machine instructions into simple RISC-like instructions called micro-operations.
    • These micro-operations are fixed length and simple to process and enables the use of pipelining instructions and scheduling techniques that enhance performance.
    • However, the decoding of the Pentium instructions into RISC type instructions itself if cumbersome as the Pentium instructions are not fixed length and simple.
    • It has been found that performance is enhanced if this decoding is done independently of the scheduling and the pipelining logic.
  • The data cache in the proccesor employs a write back policy but can be configured to use a write through policy.

  • The L1 data cache is by 2 bits in one of the control registers.

  • The INVD Pentium instruction flushes the internal cache and invalidates the external cache.

  • The WBINVD writes back and invalidates the internal and external caches.

Key Elements of Cache Design:

  1. Cache Size:

We want the cache size to be small. Large Caches are slow as for larger caches, more gates are required to address the cache.

  1. Mapping Function:

Since the number of lines in the cache is lesser than the number of blocks in the main memory, a mapping algorithm is needed to map data in the memory to lines in the cache. The choice of the mapping function defines the organization of the cache. And there are 3 techniques to achieve such a mapping:

a) Direct Mapping: Each block in memory is mapped to a single line in the cache.

The mapping is:

    I = j * modulo (m)

    Where I = cache line number
                J = Block number
               M = number of lines in the cache

Each block in memory gets sequentially mapped to a line in the cache. And when you get to a block in memory such that there are no more lines in the cache to map to, then you go back to the first line in the cache. This is achieved by the modulo function. This means that multiple blocks can get mapped to the same line in the cache. So how are they differentiated? By tagging the data that you map!

b) Associative Mapping: Each memory block is mapped to any line in the cache. The unique address of each block of memory forms the tag in each line of cache. Thus to see if a particular block is in the cache, you would have to search the tag field of all the lines in the cache until you hit.

c) Set Associative Mapping: This algorithm for mapping is a compromise b/w direct and associative mapping. This means that it divides the cache into sets which get mapped like the lines get mapped in associative mapping, but then the lines in each set is mapped by the direct mapping algorithm.

  1. Replacement Algorithms: When a new block is brought into the cache, then one of the existing blocks must be replaced. For direct mapping, there is only one line for each block and so no choice is possible. For the other 2, a replacement algorithm is needed. The most effective replacement algorithm is LRU or Least Recently Used. We replace the block in the set that has least references made to it with an incoming block. This algorithm usually records the highest cache hit rate.

  2. Write Policy: Before you write to the cache, you want to know if the word has been altered in the cache but not in the main memory. If the line in the cache is different from the corresponding block in the memory, then the memory block has to be updated. If the main memory does not match with the cache, then the main memory can become invalid. Using write through, all write operations are made simultaneously to the main memory and to the cache. In Write Back, the line in the cache is written. Then the update bit is set to one. When a block is replaced, then the block in memory is updated if and only if its update bit is set to one. The disadvantage to this is that the IO stream has to go via the cache and so the circuitry becomes really complicated.

  3. Line Size/Block Size: When a block is retrieved from memory and placed into the cache, then not only the desired word, but some other adjacent words in the block are retrieved as well. Because of the principle of locality which states that data in the vicinity of the data being retrieved has a higher chance of being retrieved, increasing the block size would increase the cache hit rate. But after a while a given segment of data which needs to be retrieved becomes too far from the retrieved data word in memory.

  4. Number of Caches: An on chip cache reduces the processor’s external bus activity and therefore speeds up execution time and improves overall system performance. When some requested data is found in the internal on chip cache, then the access to the external bus is eliminated. The short data paths internal to the processor are much quicker than the external buses and the use of the short data paths also means that the external buses are freed up for other use. A multi-level cache is usually more desirable. The on-chip cache is known as the Level 1 cache (L1) and the external cache is called the L2 cache. Using the L2 cache in the event of an L1 miss is still more desirable than using main memory.

  5. Unified versus Split Caches: the single unified cache has the advantage that it records more number of hits than the instruction or data caches individually. Thus if the computer keeps making instruction hits, then the cache will update accordingly. However, for parallel computing purposes, modern day processors use split caching.

My Questions:

  • Fully understand why we need Split Caches as Unified Caches seem more efficient.
  • In the Out of order instruction execution logic, is it not a problem that the order is all mixed up?


blog comments powered by Disqus