Anna University. Memory Hierarchy Design. The average memory access time due to cache misses predicts processor performance. The First Miss Penalty Reduction Technique follows the Adding another level of cache between the original cache and memory. The first-level cache can be small enough to match the clock cycle time of the fast CPU and the second-level cache can be large enough to capture many accesses that would go to main memory, thereby the effective miss penalty.

Author:Duzshura Yozahn
Language:English (Spanish)
Published (Last):6 September 2012
PDF File Size:18.96 Mb
ePub File Size:6.27 Mb
Price:Free* [*Free Regsitration Required]

Anna University. Memory Hierarchy Design. The average memory access time due to cache misses predicts processor performance. The First Miss Penalty Reduction Technique follows the Adding another level of cache between the original cache and memory. The first-level cache can be small enough to match the clock cycle time of the fast CPU and the second-level cache can be large enough to capture many accesses that would go to main memory, thereby the effective miss penalty.

The definition of average memory access time for a two-level cache. L ocal miss rate —This rate is simply the number of misses in a cache divided by the total number of memory accesses to this cache. As you would expect, for the first-level cache it is equal to Miss rateL1 and for the second-level cache it is Miss rateL2. Global miss rate —The number of misses in the cache divided by the total num-ber of memory accesses generated by the CPU. This local miss rate is large for second level caches because the first-level cache skims the cream of the memory accesses.

This is why the global miss rate is the more useful measure: it indicates what fraction of the memory accesses that leave the CPU go all the way to memory. Here is a place where the misses per instruction metric shines. Instead of confusion about local or global miss rates, we just expand memory stalls per instruction to add the impact of a second level cache.

The foremost difference between the two levels is that the speed of the first-level cache affects the clock rate of the CPU, while the speed of the second-level cache only affects the miss penalty of the first-level cache. The initial decision is the size of a second-level cache. Since everything in the first- level cache is likely to be in the second-level cache, the second-level cache should be much bigger than the first.

If second-level caches are just a little bigger, the local miss rate will be high. Multilevel caches require extra hardware to reduce miss penalty, but not this second technique. It is based on the observation that the CPU normally needs just one word of the block at a time.

Here are two specific strategies:. C ritical word first —Request the missed word first from memory and send it to the CPU as soon as it arrives; let the CPU continue execution while filling the rest of the words in the block. Critical-word-first fetch is also called wrapped fetch and requested word first. Early restart —Fetch the words in normal order, but as soon as the requested word of the block arrives, send it to the CPU and let the CPU continue execution.

Generally these techniques only benefit designs with large cache blocks, since the benefit is low unless blocks are large. The problem is that given spatial locality, there is more than random chance that the next miss is to the remainder of the block. In such cases, the effective miss penalty is the time from the miss until the second piece arrives. This optimization serves reads before writes have been completed.

We start with looking at complexities of a write buffer. With a write-through cache the most important improvement is a write buffer of the proper size.

Write buffers, however, do complicate memory accesses in that they might hold the updated value of a location needed on a read miss. The simplest way out of this is for the read miss to wait until the write buffer is empty. The alternative is to check the contents of the write buffer on a read miss, and if there are no conflicts and the memory system is available, let the read miss continue.

Virtually all desktop and server processors use the latter approach, giving reads priority over writes. The cost of writes by the processor in a write-back cache can also be reduced. Suppose a read miss will replace a dirty memory block. Instead of writing the dirty block to memory, and then reading memory, we could copy the dirty block to a buffer, then read memory, and then write memory.

This way the CPU read, for which the processor is probably waiting, will finish sooner. Similar to the situation above, if a read miss occurs, the processor can either stall until the buffer is empty or check the addresses of the words in the buffer for conflicts.

This technique also involves write buffers, this time improving their efficiency. Write through caches rely on write buffers, as all stores must be sent to the next lower level of the hierarchy. As mentioned above, even write back caches use a simple buffer when a block is replaced. If the write buffer is empty, the data and the full address are written in the buffer, and the write is finished from the CPU's perspective; the CPU continues working while the write buffer prepares to write the word to memory.

If the buffer contains other modified blocks, the addresses can be checked to see if the address of this new data matches the address of the valid write buffer entry.

If so, the new data are combined with that entry, called write merging. If the buffer is full and there is no address match, the cache and CPU must wait until the buffer has an empty entry. This optimization uses the memory more efficiently since multiword writes are usually faster than writes performed one word at a time. The optimization also reduces stalls due to the write buffer being full. Figure 5. Assume we had four entries in the write buffer, and each entry could hold four bit words.

Without this optimization, four stores to sequential addresses would fill the buffer at one word per entry, even though these four words when merged exactly fit within a single entry of the write buffer. The four writes are merged into a single buffer entry with write merging; without it, the buffer is full even though three-fourths of each entry is wasted. The buffer has four entries, and each entry holds four bit words. The address for each entry is on the left, with valid bits V indicating whether or not the next sequential eight bytes are occupied in this entry.

Without write merging, the words to the right in the upper drawing would only be used for instructions which wrote multiple words at the same time. One approach to lower miss penalty is to remember what was discarded in case it is needed again. Since the discarded data has already been fetched, it can be used again at small cost. If it is found there, the victim block and cache block are swapped. The AMD Athlon has a victim cache with eight entries.

Jouppi [] found that victim caches of one to five entries are effective at reducing misses, especially for small, direct-mapped data caches. Depending on the program, a four-entry victim cache might remove one quarter of the misses in a 4-KB direct-mapped data cache.

The classical approach to improving cache behavior is to reduce miss rates, and there are five techniques to reduce miss rate. C ompulsory —The very first access to a block cannot be in the cache, so the block must be brought into the cache. These are also called cold start misses or first reference misses. C apacity —If the cache cannot contain all the blocks needed during execution of a program, capacity misses in addition to compulsory misses will occur be-cause of blocks being discarded and later retrieved.

These misses are also called collision misses or interference misses. The idea is that hits in a fully associative cache which become misses in an N-way set associative cache are due to more than N requests on some popular sets.

The simplest way to reduce miss rate is to increase the block size. Larger block sizes will reduce compulsory misses. This reduction occurs because the principle of locality has two components: temporal locality and spatial locality. Larger blocks take advantage of spatial locality. At the same time, larger blocks increase the miss penalty.

Since they reduce the number of blocks in the cache, larger blocks may increase conflict misses and even capacity misses if the cache is small. Clearly, there is little reason to increase the block size to such a size that it increases the miss rate.

There is also no benefit to reducing miss rate if it increases the average memory access time. The increase in miss penalty may outweigh the decrease in miss rate. The obvious way to reduce capacity misses in the above is to increases capacity of the cache. The obvious drawback is longer hit time and higher cost. This technique has been especially popular in off-chip caches: The size of second or third level caches in equals the size of main memory in desktop computers.

Generally the miss rates improves with higher associativity. There are two general rules of thumb that can be drawn. The first is that eight-way set associative is for practical purposes as effective in reducing misses for these sized caches as fully associative. You can see the difference by comparing the 8-way entries to the capacity miss, since capacity misses are calculated using fully associative cache.

The second observation, called the. This held for cache sizes less than KB. In way-prediction , extra bits are kept in the cache to predict the set of the next cache access. This prediction means the multiplexer is set early to select the desired set, and only a single tag comparison is performed that clock cycle.

A miss results in checking the other sets for matches in subsequent clock cycles. The Alpha uses way prediction in its instruction cache.

Added to each block of the instruction cache is a set predictor bit. The bit is used to select which of the two sets to try on the next cache access. If the predictor is correct, the instruction cache latency is one clock cycle.

If not, it tries the other set, changes the set predi ctor, and has a latency of three clock cycles. In addition to improving performance, way prediction can reduce power for embedded applications. By only supplying power to the half of the tags that are expected to be used, the MIPS R series lowers power consumption with the same benefits.

A related approach is called pseudo-associative or column associative. Accesses proceed just as in the direct-mapped cache for a hit. On a miss, however, before going to the next lower level of the memory hierarchy, a second cache entry is checked to see if it matches there.

Pseudo-associative caches then have one fast and one slow hit time—corresponding to a regular hit and a pseudo hit—in addition to the miss penalty. One danger would be if many fast hit times of the direct-mapped cache became slow hit times in the pseudo-associative cache.


CS2354 Advanced Computer Architecture

Anna University. The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared by all processors through a shared bus which is shown below. Private data is used by a single processor, while shared data is used by multiple processors, essentially providing communication among the processors thro ugh reads and writes of the shared data. When a private item is cached, its location is migrated to the cache, reducing the average access time as well as the memory bandwidth required.


Note for Advanced Computer Architecture - ACA by Rajib Swain






Related Articles