Misplaced Pages

NEC SX-6

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

The SX-6 is a NEC SX supercomputer built by NEC Corporation that debuted in 2001; the SX-6 was sold under license by Cray Inc. in the U.S. Each SX-6 single- node system contains up to eight vector processors , which share up to 64 GB of computer memory . The SX-6 processor is a single chip implementation containing a vector processor unit and a scalar processor fabricated in a 0.15 μm CMOS process with copper interconnects , whereas the SX-5 was a multi-chip implementation. The Earth Simulator is based on the SX-6 architecture.

#357642

84-414: The vector processor is made up of eight vector pipeline units each with seventy-two 256-word vector registers. The vector unit performs add/shift, multiply, divide and logical operations. The scalar unit is 64 bits wide and contains a 64 KB cache . The scalar unit can decode, issue and complete four instructions per clock cycle. Branch prediction and speculative execution is supported. A multi-node system

168-443: A multi-core processor ), in which case the copy in the cache may become out-of-date or stale. Alternatively, when a CPU in a multiprocessor system updates data in the cache, copies of data in caches associated with other CPUs become stale. Communication protocols between the cache managers that keep the data consistent are known as cache coherence protocols. Cache performance measurement has become important in recent times where

252-511: A page mode , where words of a page (256, 512, or 1024 words) can be read sequentially with a significantly shorter access time (typically approximately 30 ns). The page is selected by setting the upper address lines and then words are sequentially read by stepping through the lower address lines. With the introduction of the FinFET transistor implementation of SRAM cells, they started to suffer from increasing inefficiencies in cell sizes. Over

336-423: A 96 KiB L1 instruction cache (and 128 KiB L1 data cache), and Intel Ice Lake -based processors from 2018, having 48 KiB L1 data cache and 48 KiB L1 instruction cache. In 2020, some Intel Atom CPUs (with up to 24 cores) have (multiple of) 4.5 MiB and 15 MiB cache sizes. Data is transferred between memory and cache in blocks of fixed size, called cache lines or cache blocks . When

420-423: A DRAM, the bit line is connected to storage capacitors and charge sharing causes the bit line to swing upwards or downwards. The symmetric structure of SRAMs also allows for differential signaling , which makes small voltage swings more easily detectable. Another difference with DRAM that contributes to making SRAM faster is that commercial chips accept all address bits at a time. By comparison, commodity DRAMs have

504-422: A cache line is copied from memory into the cache, a cache entry is created. The cache entry will include the copied data as well as the requested memory location (called a tag). When the processor needs to read or write a location in memory, it first checks for a corresponding entry in the cache. The cache checks for the contents of the requested memory location in any cache lines that might contain that address. If

588-535: A capacity of 2 = 2,048 = 2 k words) and an 8-bit word, so they are referred to as 2k × 8 SRAM . The dimensions of an SRAM cell on an IC is determined by the minimum feature size of the process used to make the IC. ‹The template Manual is being considered for merging .›   An SRAM cell has three states: SRAM operating in read and write modes should have readability and write stability , respectively. The three different states work as follows: If

672-431: A common virtual address space. A program executes by calculating, comparing, reading and writing to addresses of its virtual address space, rather than addresses of physical address space, making programs simpler and thus easier to write. Virtual memory requires the processor to translate virtual addresses generated by the program into physical addresses in main memory. The portion of the processor that does this translation

756-583: A configuration that became known as the Farber-Schlig cell. That year they submitted an invention disclosure, but it was initially rejected. In 1965, Benjamin Agusta and his team at IBM created a 16-bit silicon memory chip based on the Farber-Schlig cell, with 84 transistors, 64 resistors, and 4 diodes. In April 1969, Intel Inc. introduced its first product, Intel 3101, a SRAM memory chip intended to replace bulky magnetic-core memory modules; Its capacity

840-439: A density and cost advantage over true SRAM, and without the access complexity of DRAM. In the 1990s, asynchronous SRAM used to be employed for fast access time. Asynchronous SRAM was used as main memory for small cache-less embedded processors used in everything from industrial electronics and measurement systems to hard disks and networking equipment, among many other applications. Nowadays, synchronous SRAM (e.g. DDR SRAM)

924-429: A direct-mapped cache, closer to the miss rate of a fully associative cache. Comparing with a direct-mapped cache, a set associative cache has a reduced number of bits for its cache set index that maps to a cache set, where multiple ways or blocks stays, such as 2 blocks for a 2-way set associative cache and 4 blocks for a 4-way set associative cache. Comparing with a direct mapped cache, the unused cache index bits become

SECTION 10

#1732797684358

1008-400: A location in the main memory, the processor checks whether the data from that location is already in the cache. If so, the processor will read from or write to the cache instead of the much slower main memory. Many modern desktop , server , and industrial CPUs have at least three independent levels of caches (L1, L2 and L3) and different types of caches: Early examples of CPU caches include

1092-535: A mapping table held in core memory before every programmed access to main memory. With no caches, and with the mapping table memory running at the same speed as main memory this effectively cut the speed of memory access in half. Two early machines that used a page table in main memory for mapping, the IBM System/360 Model 67 and the GE 645 , both had a small associative memory as a cache for accesses to

1176-453: A more complex process is used in practice: The read cycle is started by precharging both bit lines BL and BL , to high (logic 1 ) voltage. Then asserting the word line WL enables both the access transistors M 5 and M 6 , which causes one bit line BL voltage to slightly drop. Then the BL and BL lines will have a small voltage difference between them. A sense amplifier will sense which line has

1260-505: A part of the tag bits. For example, a 2-way set associative cache contributes 1 bit to the tag and a 4-way set associative cache contributes 2 bits to the tag. The basic idea of the multicolumn cache is to use the set index to map to a cache set as a conventional set associative cache does, and to use the added tag bits to index a way in the set. For example, in a 4-way set associative cache, the two bits are used to index way 00, way 01, way 10, and way 11, respectively. This double cache indexing

1344-402: Is a failed attempt to read or write a piece of data in the cache, which results in a main memory access with much longer latency. There are three kinds of cache misses: instruction read miss, data read miss, and data write miss. Cache read misses from an instruction cache generally cause the largest delay, because the processor, or at least the thread of execution , has to wait (stall) until

1428-399: Is a type of random-access memory (RAM) that uses latching circuitry (flip-flop) to store each bit. SRAM is volatile memory ; data is lost when power is removed. The static qualifier differentiates SRAM from dynamic random-access memory (DRAM): Semiconductor bipolar SRAM was invented in 1963 by Robert Norman at Fairchild Semiconductor . Metal–oxide–semiconductor SRAM (MOS-SRAM)

1512-518: Is also embedded in practically all modern appliances, toys, etc. that implement an electronic user interface. SRAM in its dual-ported form is sometimes used for real-time digital signal processing circuits. SRAM is also used in personal computers, workstations, routers and peripheral equipment: CPU register files , internal CPU caches , internal GPU caches and external burst mode SRAM caches, hard disk buffers, router buffers, etc. LCD screens and printers also normally employ SRAM to hold

1596-445: Is available for a multinode installation. The default batch processing system is NQSII, but open source batch systems such as Sun Grid Engine are also supported. This supercomputer-related article is a stub . You can help Misplaced Pages by expanding it . CPU cache A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost (time or energy) to access data from

1680-399: Is called a "major location mapping", and its latency is equivalent to a direct-mapped access. Extensive experiments in multicolumn cache design shows that the hit ratio to major locations is as high as 90%. If cache mapping conflicts with a cache block in the major location, the existing cache block will be moved to another cache way in the same set, which is called "selected location". Because

1764-474: Is called a stall. As CPUs become faster compared to main memory, stalls due to cache misses displace more potential computation; modern CPUs can execute hundreds of instructions in the time taken to fetch a single cache line from main memory. Various techniques have been employed to keep the CPU busy during this time, including out-of-order execution in which the CPU attempts to execute independent instructions after

SECTION 20

#1732797684358

1848-531: Is configured by interconnecting up to 128 single-node systems via a high-speed, low-latency IXS (Internode Crossbar Switch ). The peak performance of the SX-6 series vector processors is 8 GFLOPS . Thus a single-node system provides a peak performance of 64 GFLOPS, while a multi-node system provides up to 8 TFLOPS of peak floating-point performance. The SX-6 uses SUPER-UX , a Unix-like operating system developed by NEC. A SAN-based global file system (NEC's GFS)

1932-520: Is crucial to CPU performance, and so most modern level-1 caches are virtually indexed, which at least allows the MMU's TLB lookup to proceed in parallel with fetching the data from the cache RAM. But virtual indexing is not the best choice for all cache levels. The cost of dealing with virtual aliases grows with cache size, and as a result most level-2 and larger caches are physically indexed. Caches have historically used both virtual and physical addresses for

2016-594: Is equal to the number of cache blocks divided by the number of ways of associativity, what leads to 128 / 4 = 32 sets, and hence 2  = 32 different indices. There are 2  = 64 possible offsets. Since the CPU address is 32 bits wide, this implies 32 − 5 − 6 = 21 bits for the tag field. The original Pentium 4 processor also had an eight-way set associative L2 integrated cache 256 KiB in size, with 128-byte cache blocks. This implies 32 − 8 − 7 = 17 bits for

2100-496: Is extra latency from computing the hash function. Additionally, when it comes time to load a new line and evict an old line, it may be difficult to determine which existing line was least recently used, because the new line conflicts with data at different indexes in each way; LRU tracking for non-skewed caches is usually done on a per-set basis. Nevertheless, skewed-associative caches have major advantages over conventional set-associative ones. A true set-associative cache tests all

2184-399: Is free to choose any entry in the cache to hold the copy, the cache is called fully associative . At the other extreme, if each entry in the main memory can go in just one place in the cache, the cache is direct-mapped . Many caches implement a compromise in which each entry in the main memory can go to any one of N places in the cache, and are described as N-way set associative. For example,

2268-623: Is generally dynamic random-access memory (DRAM) on a separate die or chip, rather than static random-access memory (SRAM). An exception to this is when eDRAM is used for all levels of cache, down to L1. Historically L1 was also on a separate die, however bigger die sizes have allowed integration of it as well as other cache levels, with the possible exception of the last level. Each extra level of cache tends to be bigger and optimized differently. Caches (like for RAM historically) have generally been sized in powers of: 2, 4, 8, 16 etc. KiB ; when up to MiB sizes (i.e. for larger non-L1), very early on

2352-401: Is increased static power due to the constant current flow through one of the pull-down transistors (M1 or M2). This is sometimes used to implement more than one (read and/or write) port, which may be useful in certain types of video memory and register files implemented with multi-ported SRAM circuitry. Generally, the fewer transistors needed per cell, the smaller each cell can be. Since

2436-495: Is known as the memory management unit (MMU). The fast path through the MMU can perform those translations stored in the translation lookaside buffer (TLB), which is a cache of mappings from the operating system's page table , segment table, or both. For the purposes of the present discussion, there are three important features of address translation: One early virtual memory system, the IBM M44/44X , required an access to

2520-469: Is less dense and more expensive than DRAM and also has a higher power consumption during read or write access. The power consumption of SRAM varies widely depending on how frequently it is accessed. Many categories of industrial and scientific subsystems, automotive electronics, and similar embedded systems , contain SRAM which, in this context, may be referred to as ESRAM . Some amount (kilobytes or less)

2604-416: Is mainly used for CPU cache , small on-chip memory, FIFOs or other small buffers. A typical SRAM cell is made up of six MOSFETs , and is often called a 6T SRAM cell . Each bit in the cell is stored on four transistors (M1, M2, M3, M4) that form two cross-coupled inverters. This storage cell has two stable states which are used to denote 0 and 1. Two additional access transistors serve to control

NEC SX-6 - Misplaced Pages Continue

2688-471: Is no perfect method to choose among the variety of replacement policies available. One popular replacement policy, least-recently used (LRU), replaces the least recently accessed entry. Marking some memory ranges as non-cacheable can improve performance, by avoiding caching of memory regions that are rarely re-accessed. This avoids the overhead of loading something into the cache without having any reuse. Cache entries may also be disabled or locked depending on

2772-427: Is only slightly overridden by the write process, the opposite transistors pair (M 1 and M 2 ) gate voltage is also changed. This means that the M 1 and M 2 transistors can be easier overridden, and so on. Thus, cross-coupled inverters magnify the writing process. RAM with an access time of 70 ns will output valid data within 70 ns from the time that the address lines are valid. Some SRAM cells have

2856-494: Is rather employed similarly to synchronous DRAM – DDR SDRAM memory is rather used than asynchronous DRAM . Synchronous memory interface is much faster as access time can be significantly reduced by employing pipeline architecture. Furthermore, as DRAM is much cheaper than SRAM, SRAM is often replaced by DRAM, especially in the case when a large volume of data is required. SRAM memory is, however, much faster for random (not block / burst) access. Therefore, SRAM memory

2940-454: Is written back to the main memory only when that data is evicted from the cache. For this reason, a read miss in a write-back cache may sometimes require two memory accesses to service: one to first write the dirty location to main memory, and then another to read the new location from memory. Also, a write to a main memory location that is not yet mapped in a write-back cache may evict an already dirty location, thereby freeing that cache space for

3024-567: The Atlas 2 and the IBM System/360 Model 85 in the 1960s. The first CPUs that used a cache had only one level of cache; unlike later level 1 cache, it was not split into L1d (for data) and L1i (for instructions). Split L1 cache started in 1976 with the IBM 801 CPU, became mainstream in the late 1980s, and in 1997 entered the embedded CPU market with the ARMv5TE. In 2015, even sub-dollar SoCs split

3108-447: The main memory . A cache is a smaller, faster memory, located closer to a processor core , which stores copies of the data from frequently used main memory locations . Most CPUs have a hierarchy of multiple cache levels (L1, L2, often L3, and rarely even L4), with different instruction-specific and data-specific caches at level 1. The cache memory is typically implemented with static random-access memory (SRAM), in modern CPUs by far

3192-430: The skewed cache , where the index for way 0 is direct, as above, but the index for way 1 is formed with a hash function . A good hash function has the property that addresses which conflict with the direct mapping tend not to conflict when mapped with the hash function, and so it is less likely that a program will suffer from an unexpectedly large number of conflict misses due to a pathological access pattern. The downside

3276-449: The L1 cache. They also have L2 caches and, for larger processors, L3 caches as well. The L2 cache is usually not split, and acts as a common repository for the already split L1 cache. Every core of a multi-core processor has a dedicated L1 cache and is usually not shared between the cores. The L2 cache, and higher-level caches, may be shared between the cores. L4 cache is currently uncommon, and

3360-508: The access to a storage cell during read and write operations. 6T SRAM is the most common kind of SRAM. In addition to 6T SRAM, other kinds of SRAM use 4, 5, 7, 8, 9, 10 (4T, 5T, 7T 8T, 9T, 10T SRAM), or more transistors per bit. Four-transistor SRAM is quite common in stand-alone SRAM devices (as opposed to SRAM used for CPU caches), implemented in special processes with an extra layer of polysilicon , allowing for very high-resistance pull-up resistors. The principal drawback of using 4T SRAM

3444-450: The address multiplexed in two halves, i.e. higher bits followed by lower bits, over the same package pins in order to keep their size and cost down. The size of an SRAM with m address lines and n data lines is 2 words, or 2  × n bits. The most common word size is 8 bits, meaning that a single byte can be read or written to each of 2 different words within the SRAM chip. Several common SRAM chips have 11 address lines (thus

NEC SX-6 - Misplaced Pages Continue

3528-409: The advantages of a direct-mapped cache is that it allows simple and fast speculation . Once the address has been computed, the one cache index which might have a copy of that location in memory is known. That cache entry can be read, and the processor can continue to work with that data before it finishes checking that the tag actually matches the requested address. The idea of having the processor use

3612-488: The associativity of their caches in low-power states, which acts as a power-saving measure. In order of worse but simple to better but complex: In this cache organization, each location in the main memory can go in only one entry in the cache. Therefore, a direct-mapped cache can also be called a "one-way set associative" cache. It does not have a placement policy as such, since there is no choice of which cache entry's contents to evict. This means that if two locations map to

3696-404: The cache do not have to include that part of the main memory address which is implied by the cache memory's index. Since the cache tags have fewer bits, they require fewer transistors, take less space on the processor circuit board or on the microprocessor chip, and can be read and compared faster. Also LRU algorithm is especially simple since only one bit needs to be stored for each pair. One of

3780-409: The cache performance, reducing the miss rate becomes one of the necessary steps among other steps. Decreasing the access time to the cache also gives a boost to its performance and helps with optimization. The time taken to fetch one cache line from memory (read latency due to a cache miss) matters because the CPU will run out of work while waiting for the cache line. When a CPU reaches this state, it

3864-575: The cache tags, although virtual tagging is now uncommon. If the TLB lookup can finish before the cache RAM lookup, then the physical address is available in time for tag compare, and there is no need for virtual tagging. Large caches, then, tend to be physically tagged, and only small, very low latency caches are virtually tagged. In recent general-purpose CPUs, virtual tagging has been superseded by vhints, as described below. Static random-access memory Static random-access memory ( static RAM or SRAM )

3948-583: The cache. (The tag, flag and error correction code bits are not included in the size, although they do affect the physical area of a cache.) An effective memory address which goes along with the cache line (memory block) is split ( MSB to LSB ) into the tag, the index and the block offset. The index describes which cache set that the data has been put in. The index length is ⌈ log 2 ⁡ ( s ) ⌉ {\displaystyle \lceil \log _{2}(s)\rceil } bits for s cache sets. The block offset specifies

4032-455: The cached data before the tag match completes can be applied to associative caches as well. A subset of the tag, called a hint , can be used to pick just one of the possible cache entries mapping to the requested address. The entry selected by the hint can then be used in parallel with checking the full tag. The hint technique works best when used in the context of address translation, as explained below. Other schemes have been suggested, such as

4116-463: The cell should be connected to the bit lines: BL and BL. They are used to transfer data for both read and write operations. Although it is not strictly necessary to have two bit lines, both the signal and its inverse are typically provided in order to improve noise margins and speed. During read accesses, the bit lines are actively driven high and low by the inverters in the SRAM cell. This improves SRAM bandwidth compared to DRAMs – in

4200-653: The cell's temperature rises. The cell power drain occurs in both active and idle states, thus wasting useful energy without any useful work done. Even though in the last 20 years the issue was partially addressed by the Data Retention Voltage technique (DRV) with reduction rates ranging from 5 to 10, the decrease in node size caused reduction rates to fall to about 2. With these two issues it became more challenging to develop energy-efficient and dense SRAM memories, prompting semiconductor industry to look for alternatives such as STT-MRAM and F-RAM . In 2019

4284-408: The contents of the cache. To make room for the new entry on a cache miss, the cache may have to evict one of the existing entries. The heuristic it uses to choose the entry to evict is called the replacement policy. The fundamental problem with any replacement policy is that it must predict which existing cache entry is least likely to be used in the future. Predicting the future is difficult, so there

SECTION 50

#1732797684358

4368-471: The context. If data is written to the cache, at some point it must also be written to main memory; the timing of this write is known as the write policy. In a write-through cache, every write to the cache causes a write to main memory. Alternatively, in a write-back or copy-back cache, writes are not immediately mirrored to the main memory, and the cache instead tracks which locations have been written over, marking them as dirty . The data in these locations

4452-459: The cost of processing a silicon wafer is relatively fixed, using smaller cells and so packing more bits on one wafer reduces the cost per bit of memory. Memory cells that use fewer than four transistors are possible; however, such 3T or 1T cells are DRAM, not SRAM (even the so-called 1T-SRAM ). Access to the cell is enabled by the word line (WL in figure) which controls the two access transistors M 5 and M 6 which, in turn, control whether

4536-490: The current set (the set has been retrieved by index) to see if this set contains the requested address. If it does, a cache hit occurs. The tag length in bits is as follows: Some authors refer to the block offset as simply the "offset" or the "displacement". The original Pentium 4 processor had a four-way set associative L1 data cache of 8  KiB in size, with 64-byte cache blocks. Hence, there are 8 KiB / 64 = 128 cache blocks. The number of sets

4620-428: The data when the power supply is lost, ensuring preservation of critical information. nvSRAMs are used in a wide range of situations – networking, aerospace, and medical, among many others  – where the preservation of data is critical and where batteries are impractical. Pseudostatic RAM (PSRAM) is DRAM combined with a self-refresh circuit. It appears externally as slower SRAM, albeit with

4704-434: The desired data within the stored data block within the cache row. Typically the effective address is in bytes, so the block offset length is ⌈ log 2 ⁡ ( b ) ⌉ {\displaystyle \lceil \log _{2}(b)\rceil } bits, where b is the number of bytes per data block. The tag contains the most significant bits of the address, which are checked against all rows in

4788-436: The ease of interfacing. It is much easier to work with than DRAM as there are no refresh cycles and the address and data buses are often directly accessible. In addition to buses and power connections, SRAM usually requires only three controls: Chip Enable (CE), Write Enable (WE) and Output Enable (OE). In synchronous SRAM, Clock (CLK) is also included. Non-volatile SRAM (nvSRAM) has standard SRAM functionality, but they save

4872-435: The execution of subsequent instructions; the processor can continue until the queue is full. For a detailed introduction to the types of misses, see cache performance measurement and metric . Most general purpose CPUs implement some form of virtual memory . To summarize, either each program running on the machine sees its own simplified address space , which contains code and data for that program only, or all programs run in

4956-425: The following structure: The data block (cache line) contains the actual data fetched from the main memory. The tag contains (part of) the address of the actual data fetched from the main memory. The flag bits are discussed below . The "size" of the cache is the amount of main memory data it can hold. This size can be calculated as the number of bytes stored in each data block times the number of blocks stored in

5040-469: The higher voltage and thus determine whether there was 1 or 0 stored. The higher the sensitivity of the sense amplifier, the faster the read operation. As the NMOS is more powerful, the pull-down is easier. Therefore, bit lines are traditionally precharged to high voltage. Many researchers are also trying to precharge at a slightly low voltage to reduce the power consumption. The write cycle begins by applying

5124-540: The image displayed (or to be printed). LCDs can have SRAM in their LCD controllers. SRAM was used for the main memory of many early personal computers such as the ZX80 , TRS-80 Model 100 , and VIC-20 . Some early memory cards in the late 1980s to early 1990s used SRAM as a storage medium, which required a lithium battery to keep the contents of the SRAM. SRAM may be integrated on chip for: Hobbyists, specifically home-built processor enthusiasts, often prefer SRAM due to

SECTION 60

#1732797684358

5208-408: The in-memory page table. Both machines predated the first machine with a cache for main memory, the IBM System/360 Model 85 , so the first hardware cache used in a computer system was not a data or instruction cache, but rather a TLB. Caches can be divided into four types, based on whether the index or tag correspond to physical or virtual addresses: The speed of this recurrence (the load latency )

5292-443: The instruction is fetched from main memory. Cache read misses from a data cache usually cause a smaller delay, because instructions not dependent on the cache read can be issued and continue execution until the data is returned from main memory, and the dependent instructions can resume execution. Cache write misses to a data cache generally cause the shortest delay, because the write can be queued and there are few limitations on

5376-400: The instruction that is waiting for the cache miss data. Another technology, used by many processors, is simultaneous multithreading (SMT), which allows an alternate thread to use the CPU core while the first thread waits for required CPU resources to become available. The placement policy decides where in the cache a copy of a particular entry of main memory will go. If the placement policy

5460-472: The largest part of them by chip area, but SRAM is not always used for all levels (of I- or D-cache), or even any level, sometimes some latter or all levels are implemented with eDRAM . Other types of caches exist (that are not counted towards the "cache size" of the most important caches mentioned above), such as the translation lookaside buffer (TLB) which is part of the memory management unit (MMU) which most CPUs have. When trying to read from or write to

5544-437: The last 30 years (from 1987 to 2017) with a steadily decreasing transistor size (node size) the footprint-shrinking of the SRAM cell topology itself slowed down, making it harder to pack the cells more densely. Besides issues with size a significant challenge of modern SRAM cells is a static current leakage. The current, that flows from positive supply (V dd ), through the cell, and to the ground, increases exponentially when

5628-519: The level-1 data cache in an AMD Athlon is two-way set associative, which means that any particular location in main memory can be cached in either of two locations in the level-1 data cache. Choosing the right value of associativity involves a trade-off . If there are ten places to which the placement policy could have mapped a memory location, then to check if that location is in the cache, ten cache entries must be searched. Checking more places takes more power and chip area, and potentially more time. On

5712-434: The local cache are now stale and should be marked invalid. A data cache typically requires two flag bits per cache line – a valid bit and a dirty bit . Having a dirty bit set indicates that the associated cache line has been changed since it was read from main memory ("dirty"), meaning that the processor has written data to that line and the new value has not propagated all the way to main memory. A cache miss

5796-399: The main memory can be cached in either of two locations in the cache, one logical question is: which one of the two? The simplest and most commonly used scheme, shown in the right-hand diagram above, is to use the least significant bits of the memory location's index as the index for the cache memory, and to have two entries for each index. One benefit of this scheme is that the tags stored in

5880-633: The major location in a cache block. Multicolumn cache remains a high hit ratio due to its high associativity, and has a comparable low latency to a direct-mapped cache due to its high percentage of hits in major locations. The concepts of major locations and selected locations in multicolumn cache have been used in several cache designs in ARM Cortex R chip, Intel's way-predicting cache memory, IBM's reconfigurable multi-way associative cache memory and Oracle's dynamic cache replacement way selection based on address tab bits. Cache row entries usually have

5964-419: The new memory location. There are intermediate policies as well. The cache may be write-through, but the writes may be held in a store data queue temporarily, usually so multiple stores can be processed together (which can reduce bus turnarounds and improve bus utilization). Cached data from the main memory may be changed by other entities (e.g., peripherals using direct memory access (DMA) or another core in

6048-404: The newly indexed cache block is a most recently used (MRU) block, it is placed in the major location in multicolumn cache with a consideration of temporal locality. Since multicolumn cache is designed for a cache with a high associativity, the number of ways in each set is high; thus, it is easy find a selected location in the set. A selected location index by an additional hardware is maintained for

6132-540: The other hand, caches with more associativity suffer fewer misses (see conflict misses ), so that the CPU wastes less time reading from the slow main memory. The general guideline is that doubling the associativity, from direct mapped to two-way, or from two-way to four-way, has about the same effect on raising the hit rate as doubling the cache size. However, increasing associativity more than four does not improve hit rate as much, and are generally done for other reasons (see virtual aliasing ). Some CPUs can dynamically reduce

6216-469: The pattern broke down, to allow for larger caches without being forced into the doubling-in-size paradigm, with e.g. Intel Core 2 Duo with 3 MiB L2 cache in April 2008. This happened much later for L1 caches, as their size is generally still a small number of KiB. The IBM zEC12 from 2012 is an exception however, to gain unusually large 96 KiB L1 data cache for its time, and e.g. the IBM z13 having

6300-422: The possible ways simultaneously, using something like a content-addressable memory . A pseudo-associative cache tests each possible way one at a time. A hash-rehash cache and a column-associative cache are examples of a pseudo-associative cache. In the common case of finding a hit in the first way tested, a pseudo-associative cache is as fast as a direct-mapped cache, but it has a much lower conflict miss rate than

6384-409: The processor finds that the memory location is in the cache, a cache hit has occurred. However, if the processor does not find the memory location in the cache, a cache miss has occurred. In the case of a cache hit, the processor immediately reads or writes the data in the cache line. For a cache miss, the cache allocates a new entry and copies data from main memory, then the request is fulfilled from

6468-439: The relatively weak transistors in the cell itself so they can easily override the previous state of the cross-coupled inverters. In practice, access NMOS transistors M 5 and M 6 have to be stronger than either bottom NMOS (M 1 , M 3 ) or top PMOS (M 2 , M 4 ) transistors. This is easily obtained as PMOS transistors are much weaker than NMOS when same sized. Consequently, when one transistor pair (e.g. M 3 and M 4 )

6552-405: The same entry, they may continually knock each other out. Although simpler, a direct-mapped cache needs to be much larger than an associative one to give comparable performance, and it is more unpredictable. Let x be block number in cache, y be block number of memory, and n be number of blocks in cache, then mapping is done with the help of the equation x = y mod n . If each location in

6636-418: The speed gap between the memory performance and the processor performance is increasing exponentially. The cache was introduced to reduce this speed gap. Thus knowing how well the cache is able to bridge the gap in the speed of processor and memory becomes important, especially in high-performance systems. The cache hit rate and the cache miss rate play an important role in determining this performance. To improve

6720-507: The tag field. An instruction cache requires only one flag bit per cache row entry: a valid bit. The valid bit indicates whether or not a cache block has been loaded with valid data. On power-up, the hardware sets all the valid bits in all the caches to "invalid". Some systems also set a valid bit to "invalid" at other times, such as when multi-master bus snooping hardware in the cache of one processor hears an address broadcast from some other processor, and realizes that certain data blocks in

6804-444: The value to be written to the bit lines. To write a 0, a 0 is applied to the bit lines, such as setting BL to 1 and BL to 0. This is similar to applying a reset pulse to an SR-latch , which causes the flip flop to change state. A 1 is written by inverting the values of the bit lines. WL is then asserted and the value that is to be stored is latched in. This works because the bit line input-drivers are designed to be much stronger than

6888-524: The word line is not asserted, the access transistors M 5 and M 6 disconnect the cell from the bit lines. The two cross-coupled inverters formed by M 1  – M 4 will continue to reinforce each other as long as they are connected to the supply. In theory, reading only requires asserting the word line WL and reading the SRAM cell state by a single access transistor and bit line, e.g. M 6 , BL. However, bit lines are relatively long and have large parasitic capacitance . To speed up reading,

6972-474: Was 64 bits (In the first versions, only 63 bits were usable due to a bug) and was based on bipolar junction transistors . It was designed by using rubylith . Though it can be characterized as volatile memory , SRAM exhibits data remanence . SRAM offers a simple data access model and does not require a refresh circuit. Performance and reliability are good and power consumption is low when idle. Since SRAM requires more transistors per bit to implement, it

7056-445: Was invented in 1964 by John Schmidt at Fairchild Semiconductor. It was a 64-bit MOS p-channel SRAM. SRAM was the main driver behind any new CMOS -based technology fabrication process since the 1960s, when CMOS was invented. In 1964, Arnold Farber and Eugene Schlig, working for IBM, created a hard-wired memory cell, using a transistor gate and tunnel diode latch . They replaced the latch with two transistors and two resistors ,

#357642