AMD K6 - Misplaced Pages

The K6 microprocessor was launched by AMD in 1997. The main advantage of this particular microprocessor is that it was designed to fit into existing desktop designs for Pentium -branded CPUs . It was marketed as a product that could perform as well as its Intel Pentium II equivalent but at a significantly lower price. The K6 had a considerable impact on the PC market and presented Intel with serious competition.

#64935

44-580: The AMD K6 is a superscalar P5 Pentium -class microprocessor , manufactured by AMD , which superseded the K5 . The AMD K6 is based on the Nx686 microprocessor that NexGen was designing when it was acquired by AMD. Despite the name implying a design evolving from the K5 , it is in fact a totally different design that was created by the NexGen team, including chief processor architect Greg Favor, and adapted after

88-566: A floating point unit capable of executing one multiply–add operation per cycle. AMD was designing a superscalar version until late 1995, when AMD dropped the development of the 29k because the design team was transferred to support the PC ( x86 ) side of the business. What remained of AMD's embedded business was realigned towards the embedded 186 family of 80186 derivatives. By then the majority of AMD's resources were concentrated on their high-performance x86 processors for desktop PCs, using many of

132-413: A scalar processor , which can execute at most one single instruction per clock cycle, a superscalar processor can execute more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution units on the processor. It therefore allows more throughput (the number of instructions that can be executed in a unit of time) than would otherwise be possible at

176-621: A 0.7-micron process. A superscalar version of 29K was being designed, but canceled in favor of x86. It was codenamed Jaguar , and was described in November 1994 and August 1995. It was an advanced design, capable of four-way dispatch into six reservation stations and speculative out-of-order execution of instructions, with four-way retire. The register file permitted four reads and two writes at once. The caches for instructions and data were 8 KB each. Loads from cache could bypass stores . It had no on-chip FPU due to cost reasons and

220-614: A degree of flexibility in the choice of memory technologies used to retain program instructions and data. The 29k saw some use as a computational accelerator or coprocessor, particularly on the Macintosh and IBM PC-compatible platforms. For instance, Yarc Systems Corporation produced 29k-based "RISC coprocessor" cards for Macintosh II and PC AT systems, alongside other "CISC coprocessor" cards featuring Motorola 68020 and 68030 processors, and "parallel coprocessor" cards featuring T800 transputer processors. Its NuSuper (originally named

264-475: A given clock rate . Each execution unit is not a separate processor (or a core if the processor is a multi-core processor ), but an execution resource within a single CPU such as an arithmetic logic unit . While a superscalar CPU is typically also pipelined , superscalar and pipelining execution are considered different performance enhancement techniques. The former (superscalar) executes multiple instructions in parallel by using multiple execution units, whereas

308-669: A given CPU): Seymour Cray 's CDC 6600 from 1964 is often mentioned as the first superscalar design. The 1967 IBM System/360 Model 91 was another superscalar mainframe. The Intel i960 CA (1989), the AMD 29000 -series 29050 (1990), and the Motorola MC88110 (1991), microprocessors were the first commercial single-chip superscalar microprocessors. RISC microprocessors like these were the first to have superscalar execution, because RISC architectures free transistors and die area which can be used to include multiple execution units and

352-462: A hard-wired window. In the original Berkeley design, SPARC, and i960, the windows were fixed in size. A routine using only one local variable would still use up eight registers on the SPARC, wasting this expensive resource. It was here that the 29000 differed from these earlier designs, using a variable window size. In this example only two registers would be used, one for the local variable, another for

396-427: A single processor. Thus a multicore CPU is possible where each core is an independent processor containing multiple parallel pipelines, each pipeline being superscalar. Some processors also include vector capability. AMD 29000 The AMD Am29000 , commonly shortened to 29k , is a family of 32-bit RISC microprocessors and microcontrollers developed and fabricated by Advanced Micro Devices (AMD). Based on

440-520: A stack, loading local data into a set of registers during a call, and marking them "dead" when the procedure returns. Values being returned from the routines would be placed in the "global page", the top eight registers in the SPARC (for instance). The competing early RISC design from Stanford University , the Stanford MIPS , also looked at this concept but decided that improved compilers could make more efficient use of general purpose registers than

484-436: A superscalar CPU the dispatcher reads instructions from memory and decides which ones can be run in parallel, dispatching each to one of the several execution units contained inside a single CPU. Therefore, a superscalar processor can be envisioned as having multiple parallel pipelines, each of which is processing instructions simultaneously from a single instruction thread. Most modern superscalar CPUs also have logic to reorder

SECTION 10

#1732772328065

528-426: Is no assurance otherwise and failure to detect a dependency would produce incorrect results. No matter how advanced the semiconductor process or how fast the switching speed, this places a practical limit on how many instructions can be simultaneously dispatched. While process advances will allow ever greater numbers of execution units (e.g. ALUs), the burden of checking instruction dependencies grows rapidly, as does

572-454: Is removed and delegated to the compiler . Explicitly parallel instruction computing (EPIC) is like VLIW with extra cache prefetching instructions. Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar processors. SMT permits multiple independent threads of execution to better utilize the resources provided by modern processor architectures. The fact that they are independent means that we know that

616-480: Is the difference between scalar and vector arithmetic. A superscalar processor is a mixture of the two. Each instruction processes one data item, but there are multiple execution units within each CPU thus multiple instructions can be processing separate data items concurrently. Superscalar CPU design emphasizes improving the instruction dispatcher accuracy and allowing it to keep the multiple execution units in use at all times. This has become increasingly important as

660-644: The ALU , integer multiplier , integer shifter, FPU , etc. There may be multiple versions of each execution unit to enable the execution of many instructions in parallel. This differs from a multi-core processor that concurrently processes instructions from multiple threads, one thread per processing unit (called "core"). It also differs from a pipelined processor , where the multiple instructions can concurrently be in various stages of execution, assembly-line fashion. The various alternative techniques are not mutually exclusive—they can be (and frequently are) combined in

704-492: The Am29050 had also become available, without on-chip cache but featuring a floating-point unit with fully pipelined multiply–accumulate operations , a larger 1 KB Branch Target Cache with a claimed 80% hit rate, and better-pipelined load operations sped up by a 4-entry TLB -like Physical Address Cache. Though it is not a superscalar processor, it permits a floating-point operation and an integer operation to complete at

748-797: The McCray ) and AT-Super cards, employing the Am29000 CPU and Am29027 floating-point accelerator, were followed by the MacRageous , upgrading the CPU to the Am29050. Such accelerator cards offered performance several times that of the Macintosh II itself and benchmarked competitively with RISC workstations such as the DECstation 3100. Multiple cards could also be fitted to a system. However,

792-459: The return address . It also added more registers, including the same 128 registers for the procedure stack, but adding another 64 for global access. In comparison, the SPARC had 128 registers in total, and the global set was a standard window of eight. This change resulted in much better register use in the 29000 under a wide variety of workloads. The 29000 also extended the register window stack with an in-memory (and in theory, in-cache) stack. When

836-556: The 29000 has a single branch delay slot . The buffer mitigated this by storing four or two instructions from the target address of the branch, which could be run instantly while the fetch buffer was re-filled with new instructions from memory. Support for virtual address translation followed a similar approach to that of the MIPS architecture. A 64-entry translation lookaside buffer (TLB) retained mappings from virtual to physical addresses, and upon an untranslated address being encountered,

880-411: The 29000 was used in a variety of products such as X terminals, laser printer controller cards, graphics accelerator cards, optical character recognition solutions, and network bridges. The memory architecture of the 29000 was a particular attraction for product designers, allowing them to forego external cache memory and to employ dynamic RAM directly while maintaining acceptable performance, permitting

924-603: The AMD purchase. The K6 processor included a feedback dynamic instruction reordering mechanism, MMX instructions, and a floating-point unit (FPU). It was also made pin-compatible with Intel's Pentium, enabling it to be used in the widely available " Socket 7 "-based motherboards. Like the AMD K5 , Nx586, and Nx686 before it, the K6 translated x86 instructions on the fly into dynamic buffered sequences of micro-operations . A later variation of

SECTION 20

#1732772328065

968-559: The K6 CPU, K6-2 , added floating-point -based SIMD instructions, called 3DNow! . The K6 was originally launched in April 1997, running at speeds of 166 and 200 MHz. It was followed by a 233 MHz version later in 1997. Initially, the AMD K6 processors used a Pentium II-based performance rating (PR2) to designate their speed. The PR2 rating was dropped because the rated frequency of

1012-512: The K6 design was released in May 1998, running at 300 MHz. The K6 line was updated with SIMD instructions (Branded as AMD 3DNow! ) to create the K6-2 line of microprocessors. Superscalar processor A superscalar processor (or multiple-issue processor ) is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. In contrast to

1056-402: The complexity of register renaming circuitry to mitigate some dependencies. Collectively the power consumption , complexity and gate delay costs limit the achievable superscalar speedup. However even given infinitely fast dependency checking logic on an otherwise conventional superscalar CPU, if the instruction stream itself has many dependencies, this would also limit the possible speedup. Thus

1100-409: The conditions to be easily saved at the expense of complicating some code. A Branch Target Cache (512 bytes on the 29000 and 1024 bytes on the 29050) stored sets of 4 or 2 sequential instructions found at the branch target address, reducing the instruction fetch latency during taken branches—the 29000 did not include any branch prediction system so there was a delay if a branch was taken. It means

1144-481: The cost of a Macintosh II system combined with such a card approached that of established RISC workstations running Unix. The AT-Super was priced at around $ 4,600 and was reported as running Unix, competing with similar products employing Intel's i860 processor. One notable product utilising the 29k was Apple's Macintosh Display Card 8·24 GC for its Macintosh IIfx , featuring a 30 MHz Am29000 processor, 64 KB static RAM cache, and 2 MB of video RAM, with

1188-419: The degree of intrinsic parallelism in the code stream forms a second limitation. Collectively, these limits drive investigation into alternative architectural changes such as very long instruction word (VLIW), explicitly parallel instruction computing (EPIC), simultaneous multithreading (SMT), and multi-core computing . With VLIW, the burdensome task of dependency checking by hardware logic at run time

1232-546: The ideas and individual parts of the 29k designs to produce the AMD K5 . The 29k evolved from the same Berkeley RISC design that also led to the Sun SPARC , Intel i960 , ARM and RISC-V . One design element used in some of the Berkeley RISC-derived designs is the concept of register windows , a technique used to speed up procedure calls significantly. The idea is to use a large set of registers as

1276-448: The instruction of one thread can be executed out of order and/or in parallel with the instruction of a different one. Also, one independent thread will not produce a pipeline bubble in the code stream of a different one, for example, due to a branch. Superscalar processors differ from multi-core processors in that the several execution units are not entire processors. A single processor is composed of finer-grained execution units such as

1320-452: The instructions to try to avoid pipeline stalls and increase parallel execution. Available performance improvement from superscalar techniques is limited by three key areas: Existing binary executable programs have varying degrees of intrinsic parallelism. In some cases instructions are not dependent on each other and can be executed simultaneously. In other cases they are inter-dependent: one instruction impacts either resources or results of

1364-441: The latter (pipeline) executes multiple instructions in the same execution unit in parallel by dividing the execution unit into different phases. In the "Simple superscalar pipeline" figure, fetching two instructions at the same time is superscaling, and fetching the next two before the first pair has been written back is pipelining. The superscalar technique is traditionally associated with several identifying characteristics (within

AMD K6 - Misplaced Pages Continue

1408-463: The more rigid methods used in the simpler P5 Pentium ; it also simplified speculative execution and allowed higher clock frequencies compared to designs such as the advanced Cyrix 6x86 . The simplest processors are scalar processors. Each instruction executed by a scalar processor typically manipulates one or two data items at a time. By contrast, each instruction executed by a vector processor operates simultaneously on many data items. An analogy

1452-753: The number of units has increased. While early superscalar CPUs would have two ALUs and a single FPU , a later design such as the PowerPC 970 includes four ALUs, two FPUs, and two SIMD units. If the dispatcher is ineffective at keeping all of these units fed with instructions, the performance of the system will be no better than that of a simpler, cheaper design. A superscalar processor usually sustains an execution rate in excess of one instruction per machine cycle . But merely processing multiple instructions concurrently does not make an architecture superscalar, since pipelined , multiprocessor or multi-core architectures also achieve that, but with different methods. In

1496-468: The other. The instructions a = b + c; d = e + f can be run in parallel because none of the results depend on other calculations. However, the instructions a = b + c; b = e + f might not be runnable in parallel, depending on the order in which the instructions complete while they move through the units. Although the instruction stream may contain no inter-instruction dependencies, a superscalar CPU must nonetheless check for that possibility, since there

1540-535: The predecode information held of the cached instructions. AMD claimed that the superscalar 29K would have only a slightly lower performance than the K5, but much lower cost due to the size difference. The Honeywell 29KII is a CPU based on the AMD 29050, and it was extensively used in real-time avionics. Positioned as a product for "medium- to high-performance embedded applications" with potential for use in Unix workstations,

1584-458: The processor was the same as the real frequency. The release of the 266 MHz version of this chip was not until the second quarter of 1998, when AMD was able to move to the 0.25-micrometre manufacturing process. The lower voltage and higher multiplier of the K6-266 meant that it was not fully compatible with some Socket 7 motherboards, similar to the later K6-2 processors. The final iteration of

1628-493: The resulting TLB "miss" would cause the processor to trap to a software routine responsible for providing any appropriate mapping to physical memory. In contrast to the MIPS approach which employed a random register to select the TLB entry to be replaced upon a TLB miss event, the 29000 provided a dedicated lru (least recently used) register. Some products in the 29000 family provided only 16 TLB entries to be able to dedicate part of

1672-594: The same cycle. The integer and floating-point sides each have an own write port to the registers. It contained 428,000 transistors on a 1-micron process with a 0.8-micron effective channel length and was available at 20, 25, 33, and 40 MHz. Later the Am29040 was released at 33, 40, and 50 MHz, being like the Am29030 except for featuring a 4 KB data cache, a multiplication unit, and a few other enhancements. The 119 mm Am29040 contained 1.2 million transistors on

1716-538: The seminal Berkeley RISC , the 29k added a number of significant improvements. They were, for a time, the most popular RISC chips on the market, widely used in laser printers from a variety of manufacturers. Developed since 1984–1985, announced in March 1987 and released in May 1988, the initial Am29000 was followed by several versions, ending with the Am29040 in 1995. The 29050 was notable for being early to feature

1760-711: The silicon to peripherals. To compensate, the maximum page size employed by a mapping was increased from 8 KB to 16 MB. The first Am29000 was released in 1988, including a built-in MMU but floating point support was offloaded to the Am29027 FPU . Units with failed MMU or Branch Target Cache were sold as the Am29005 . In 1991 the line was extended with the Am29030 and Am29035 , which included an 8 KB or 4 KB of instruction cache, respectively. By then

1804-438: The target market. It was expected to attain 100 MHz frequency on a 0.4-micron process. AMD used the unreleased 29K microarchitecture as the basis of the K5 series of x86 -compatible processors. The ALUs were carried over, as was the re-order buffer with a slight modification. The FPU was taken from the 29050, but extended to 80 bits precision . The K5 translated the x86 instructions into "RISC-OPs" upon decoding, aided by

AMD K6 - Misplaced Pages Continue

1848-477: The traditional uniformity of the instruction set favors superscalar dispatch (this was why RISC designs were faster than CISC designs through the 1980s and into the 1990s, and it's far more complicated to do multiple dispatch when instructions have variable bit length). Except for CPUs used in low-power applications, embedded systems , and battery -powered devices, essentially all general-purpose CPUs developed since about 1998 are superscalar. The P5 Pentium

1892-489: The window filled the calls would be pushed off the end of the register stack into memory, restored as required when the routine returned. Generally, the 29000's register usage was considerably more advanced than competing designs based on the Berkeley concepts. Another difference with the Berkeley design is that the 29000 included no special-purpose condition code register. Any register could be used for this purpose, allowing

1936-459: Was the first superscalar x86 processor; the Nx586 , P6 Pentium Pro and AMD K5 were among the first designs which decode x86 -instructions asynchronously into dynamic microcode -like micro-op sequences prior to actual execution on a superscalar microarchitecture ; this opened up for dynamic scheduling of buffered partial instructions and enabled more parallelism to be extracted compared to

#64935