Pentium Pro - Misplaced Pages

A floating-point unit ( FPU ), numeric processing unit ( NPU ), colloquially math coprocessor , is a part of a computer system specially designed to carry out operations on floating-point numbers. Typical operations are addition , subtraction , multiplication , division , and square root . Some FPUs can also perform various transcendental functions such as exponential or trigonometric calculations, but the accuracy can be low, so some systems prefer to compute these functions in software.

#790209

103-447: The Pentium Pro is a sixth-generation x86 microprocessor developed and manufactured by Intel and introduced on November 1, 1995. It introduced the P6 microarchitecture (sometimes termed i686) and was originally intended to replace the original Pentium in a full range of applications. Later, it was reduced to a more narrow role as a server and high-end desktop processor. The Pentium Pro

206-618: A 64 KB (one segment) stack in memory supported by computer hardware . Only words (two bytes) can be pushed to the stack. The stack grows toward numerically lower addresses, with SS:SP pointing to the most recently pushed item. There are 256 interrupts , which can be invoked by both hardware and software. The interrupts can cascade, using the stack to store the return address . The original Intel 8086 and 8088 have fourteen 16- bit registers. Four of them (AX, BX, CX, DX) are general-purpose registers (GPRs), although each may have an additional purpose; for example, only CX can be used as

309-579: A backward compatible version of this functionality on the same microprocessor as the main processor. In addition to this, modern x86 designs also contain a SIMD -unit (see SSE below) where instructions can work in parallel on (one or two) 128-bit words, each containing two or four floating-point numbers (each 64 or 32 bits wide respectively), or alternatively, 2, 4, 8 or 16 integers (each 64, 32, 16 or 8 bits wide respectively). The presence of wide SIMD registers means that existing x86 processors can load or store up to 128 bits of memory data in

412-592: A coprocessor rather than as an integrated unit (but now in addition to the CPU, e.g. GPUs – that are coprocessors not always built into the CPU ;– have FPUs as a rule, while first generations of GPUs did not). This could be a single integrated circuit , an entire circuit board or a cabinet. Where floating-point calculation hardware has not been provided, floating-point calculations are done in software, which takes more processor time, but avoids

515-476: A mobile version of the original Pentium Pro due to power draw and heat concerns. At least one vendor sold a portable computer with a Pentium Pro (Imperial Computer's 6200TLP). In Intel's "Family/Model/Stepping" scheme, the Pentium Pro is family 6, model 1, and its Intel Product code is 80521. The process used to fabricate the Pentium Pro processor die and its separate cache memory die changed, leading to

618-449: A central cache. However, this far faster L2 cache did come with some complications. The Pentium Pro's "on-package cache" arrangement was unique. The processor and the cache were on separate dies in the same package and connected closely by a full-speed bus. The two or three dies had to be bonded together early in the production process, before testing was possible. This meant that a single, tiny flaw in either die made it necessary to discard

721-437: A combination of processes used in the same package: The Pentium Pro (up to 512 KB cache) is packaged in a ceramic multi-chip module (MCM). The MCM contains two underside cavities in which the microprocessor die and its companion cache die reside. The dies are bonded to a heat slug, whose exposed top helps the heat from the dies to be transferred more directly to cooling apparatus such as a heat sink. The dies are connected to

824-403: A compatible design) and the scalability of x86 chips in the form of modern multi-core CPUs, is underlining x86 as an example of how continuous refinement of established industry standards can resist the competition from completely new architectures. The table below lists processor models and model series implementing various architectures in the x86 family, in chronological order. Each line item

927-539: A counter with the loop instruction. Each can be accessed as two separate bytes (thus BX's high byte can be accessed as BH and low byte as BL). Two pointer registers have special roles: SP (stack pointer) points to the "top" of the stack , and BP (base pointer) is often used to point at some other place in the stack, typically above the local variables (see frame pointer ). The registers SI, DI, BX and BP are address registers , and may also be used for array indexing. One of four possible 'segment registers' (CS, DS, SS and ES)

1030-509: A few number of chipsets supported these slotkets, and so did not see widespread use. The Intel 440FX chipset explicitly supported both Pentium Pro and Pentium II processors, however the Intel 440BX and later Slot 1 chipsets did not explicitly support the Pentium Pro. Slotkets eventually saw renewed popularity in the form of Socket 370 to Slot 1 adapters, when Intel introduced Socket 370 Celeron and Pentium III processors in

1133-463: A finite number of operations it can support – for example, no FPUs directly support arbitrary-precision arithmetic . When a CPU is executing a program that calls for a floating-point operation that is not directly supported by the hardware, the CPU uses a series of simpler floating-point operations. In systems without any floating-point hardware, the CPU emulates it using a series of simpler fixed-point arithmetic operations that run on

SECTION 10

#1732765052791

1236-889: A gate array to interface the ARM2 processor with the WE32206 to support the additional ARM floating-point instructions. Acorn later offered the FPA10 coprocessor, developed by ARM, for various machines fitted with the ARM3 processor. Coprocessors were available for the Motorola 68000 family , the 68881 and 68882 . These were common in Motorola 68020 / 68030 -based workstations , like the Sun-3 series. They were also commonly added to higher-end models of Apple Macintosh and Commodore Amiga series, but unlike IBM PC-compatible systems, sockets for adding

1339-537: A latency of three and five cycles, respectively. Division and square-root are not pipelined and are executed in separate units that share the FPU's ports. Division and square root have a latency of 18-36 and 29-69 cycles, respectively. The smallest number is for single precision (32-bit) floating-point numbers and the largest for extended precision (80-bit) numbers. Division and square root can operate simultaneously with adds and multiplies, preventing them from executing only when

1442-403: A load unit, store address unit, and a store data unit. One of the integer units shares the same ports as the FPU, and therefore the Pentium Pro can only dispatch one integer micro-op and one floating-point micro-op, or two integer micro-ops per a cycle, in addition to micro-ops for the other three execution units. Of the two integer units, only the one that shares the path with the FPU on port 0 has

1545-417: A machine for the masses due to poor 16-bit support for Windows 95 and many other 16-bit and mixed 16/32-bit operating systems, it did see significant successes in the file server space due to its advanced, integrated bus design, introducing many advanced features that had formerly only been available in the pricey workstation segment into the commodity marketplace. X86 x86 (also known as 80x86 or

1648-476: A major change to the architecture referred to as X86S (formerly known as X86-S). The S in X86S stands for "simplification", which aims to remove support for legacy execution modes and instructions. A processor implementing this proposal would start execution directly in long mode and would only support 64-bit operating systems. 32-bit code would only be supported for user applications running in ring 3, and would use

1751-547: A memory location. However, this memory operand may also be the destination (or a combined source and destination), while the other operand, the source, can be either register or immediate. Among other factors, this contributes to a code size that rivals eight-bit machines and enables efficient use of instruction cache memory. The relatively small number of general registers (also inherited from its 8-bit ancestors) has made register-relative addressing (using small immediate offsets) an important method of accessing operands, especially on

1854-560: A more complex micro-op which fits the execution model better and thus can be executed faster or with fewer machine resources involved. Another way to try to improve performance is to cache the decoded micro-operations, so the processor can directly access the decoded micro-operations from a special cache, instead of decoding them again. Intel followed this approach with the Execution Trace Cache feature in their NetBurst microarchitecture (for Pentium 4 processors) and later in

1957-400: A performance boost by allowing the avoidance of costly jump and branch instructions. In eg CMOVxx destreg1, source_operand2 the first operand is the destination register, the second the source register or memory location. The second operand unfortunately can not be an immediate (in-line constant) value and such a constant would have to be placed in a register first. The predicate code xx can take

2060-436: A plain 16-bit address. The term "x86" came into being because the names of several successors to Intel's 8086 processor end in "86", including the 80186 , 80286 , 80386 and 80486 . Colloquially, their names were "186", "286", "386" and "486". The term is not synonymous with IBM PC compatibility , as this implies a multitude of other computer hardware . Embedded systems and general-purpose computers used x86 chips before

2163-496: A response to the successful 8080-compatible Zilog Z80 , the x86 line soon grew in features and processing power. Today, x86 is ubiquitous in both stationary and portable personal computers, and is also used in midrange computers , workstations , servers, and most new supercomputer clusters of the TOP500 list. A large amount of software , including a large list of x86 operating systems are using x86-based hardware. Modern x86

SECTION 20

#1732765052791

2266-670: A single instruction and also perform bitwise operations (although not integer arithmetic ) on full 128-bits quantities in parallel. Intel's Sandy Bridge processors added the Advanced Vector Extensions (AVX) instructions, widening the SIMD registers to 256 bits. The Intel Initial Many Core Instructions implemented by the Knights Corner Xeon Phi processors, and the AVX-512 instructions implemented by

2369-432: A special FPU named FlexFPU, which uses simultaneous multithreading . Each physical integer core, two per module, is single-threaded, in contrast with Intel's Hyperthreading , where two virtual simultaneous threads share the resources of a single physical core. Some floating-point hardware only supports the simplest operations: addition, subtraction, and multiplication. But even the most complex floating-point hardware has

2472-511: A usable upgrade for quad-processor systems. These specially packaged Pentium II OverDrive processors were also used to upgrade the ASCI Red supercomputer in 1999. This makes the ASCI Red supercomputer, the first computer to reach the one teraFLOPS performance mark with dual Pentium Pro processors in 1996, to now become the first computer overall to exceed the two teraFLOPS performance mark with

2575-462: A wider 36-bit address bus , usable by Physical Address Extension (PAE), allowing it to access up to 64 GB of memory. The Pentium Pro has an 8 KB instruction cache , from which up to 16 bytes are fetched on each cycle and sent to the instruction decoders . There are three instruction decoders. The decoders are unequal in ability: only one can decode any x86 instruction, while the other two can only decode simple x86 instructions. This restricts

2678-470: Is allowed for almost all instructions. The largest native size for integer arithmetic and memory addresses (or offsets ) is 16, 32 or 64 bits depending on architecture generation (newer processors include direct support for smaller integers as well). Multiple scalar values can be handled simultaneously via the SIMD unit present in later generations, as described below. Immediate addressing offsets and immediate data may be expressed as 8-bit quantities for

2781-421: Is an example of MLP, Memory Level Parallelism .) These properties combined to produce an L2 cache that was immensely faster than the motherboard-based caches of older processors. This cache alone gave the CPU an advantage in input/output performance over older x86 CPUs. In multiprocessor configurations, Pentium Pro's integrated cache skyrocketed performance in comparison to architectures which had each CPU sharing

2884-552: Is available, the CORDIC methods are most commonly used for transcendental function evaluation. In most modern computer architectures, there is some division of floating-point operations from integer operations. This division varies significantly by architecture; some have dedicated floating-point registers, while some, like Intel x86 , go as far as independent clocking schemes. CORDIC routines have been implemented in Intel x87 coprocessors ( 8087 , 80287, 80387 ) up to

2987-691: Is characterized by significantly improved or commercially successful processor microarchitecture designs. At various times, companies such as IBM , VIA , NEC , AMD , TI , STM , Fujitsu , OKI , Siemens , Cyrix , Intersil , C&T , NexGen , UMC , and DM&P started to design or manufacture x86 processors (CPUs) intended for personal computers and embedded systems. Other companies that designed or manufactured x86 or x87 processors include ITT Corporation , National Semiconductor , ULSI System Technology, and Weitek . Such x86 implementations were seldom simple copies but often employed different internal microarchitectures and different solutions at

3090-415: Is considered to be minor and occurs under such special circumstances that very few, if any, software programs are affected. The Pentium Pro P6 microarchitecture was used in one form or another by Intel for more than a decade. The pipeline would scale from its initial 150 MHz start, all the way up to 1.4 GHz with the "Tualatin" Pentium III . The design's various traits would continue after that in

3193-463: Is one of the two modes only available in long mode . The addressing modes were not dramatically changed from 32-bit mode, except that addressing was extended to 64 bits, virtual addresses are now sign extended to 64 bits (in order to disallow mode bits in virtual addresses), and other selector details were dramatically reduced. In addition, an addressing mode was added to allow memory references relative to RIP (the instruction pointer ), to ease

Pentium Pro - Misplaced Pages Continue

3296-587: Is relatively uncommon in embedded systems , however, and small low power applications (using tiny batteries), and low-cost microprocessor markets, such as home appliances and toys, lack significant x86 presence. Simple 8- and 16-bit based architectures are common here, as well as simpler RISC architectures like RISC-V , although the x86-compatible VIA C7 , VIA Nano , AMD 's Geode , Athlon Neo and Intel Atom are examples of 32- and 64-bit designs used in some relatively low-power and low-cost segments. There have been several attempts, including by Intel, to end

3399-491: Is used to form a memory address. In the original 8086 / 8088 / 80186 / 80188 every address was built from a segment register and one of the general purpose registers. For example ds:si is the notation for an address formed as [16 * ds + si] to allow 20-bit addressing rather than 16 bits, although this changed in later processors. At that time only certain combinations were supported. The FLAGS register contains flags such as carry flag , overflow flag and zero flag . Finally,

3502-407: The fstsw instruction, and it is common to simply use some of its bits for branching by copying it into the normal FLAGS. In the Intel 80286 , to support protected mode , three special registers hold descriptor table addresses (GDTR, LDTR, IDTR ), and a fourth task register (TR) is used for task switching. The 80287 is the floating-point coprocessor for the 80286 and has the same registers as

3605-525: The 6x86 was significantly faster than the Pentium on integer code. AMD later managed to grow into a serious contender with the K6 set of processors, which gave way to the very successful Athlon and Opteron . There were also other contenders, such as Centaur Technology (formerly IDT ), Rise Technology , and Transmeta . VIA Technologies ' energy efficient C3 and C7 processors, which were designed by

3708-497: The 80287 , and 80386/80386SX -based machines – for the 80387 and 80387SX respectively, although early ones were socketed for the 80287, since the 80387 did not exist yet. Other companies manufactured co-processors for the Intel x86 series. These included Cyrix and Weitek . Acorn Computers opted for the WE32206 to offer single , double and extended precision to its ARM powered Archimedes range, introducing

3811-496: The 80486 microprocessor series, as well as in the Motorola 68881 and 68882 for some kinds of floating-point instructions, mainly as a way to reduce the gate counts (and complexity) of the FPU subsystem. Floating-point operations are often pipelined . In earlier superscalar architectures without general out-of-order execution , floating-point operations were sometimes pipelined separately from integer operations. The modular architecture of Bulldozer microarchitecture uses

3914-496: The 80486 and all subsequent x86 models, the floating-point processing unit (FPU) is integrated on-chip. The Pentium MMX added eight 64-bit MMX integer vector registers (MM0 to MM7, which share lower bits with the 80-bit-wide FPU stack). With the Pentium III , Intel added a 32-bit Streaming SIMD Extensions (SSE) control/status register (MXCSR) and eight 128-bit SSE floating-point registers (XMM0 to XMM7). Starting with

4017-418: The 8086 family ) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel based on the 8086 microprocessor and its 8-bit-external-bus variant, the 8088 . The 8086 was introduced in 1978 as a fully 16-bit extension of 8-bit Intel's 8080 microprocessor, with memory segmentation as a solution for addressing more memory than can be covered by

4120-406: The 8088 and 80286 were still in common use, the term x86 usually represented any 8086-compatible CPU. Today, however, x86 usually implies binary compatibility with the 32-bit instruction set of the 80386 . This is due to the fact that this instruction set has become something of a lowest common denominator for many modern operating systems and also probably because the term became common after

4223-573: The AMD Opteron processor, the x86 architecture extended the 32-bit registers into 64-bit registers in a way similar to how the 16 to 32-bit extension took place. An R -prefix (for "register") identifies the 64-bit registers (RAX, RBX, RCX, RDX, RSI, RDI, RBP, RSP, RFLAGS, RIP), and eight additional 64-bit general registers (R8–R15) were also introduced in the creation of x86-64 . Also, eight more SSE vector registers (XMM8–XMM15) were added. However, these extensions are only usable in 64-bit mode, which

Pentium Pro - Misplaced Pages Continue

4326-653: The Centaur company, were sold for many years following their release in 2005. Centaur's 2008 design, the VIA Nano , was their first processor with superscalar and speculative execution . It was introduced at about the same time (in 2008) as Intel introduced the Intel Atom , its first "in-order" processor after the P5 Pentium . Many additions and extensions have been added to the original x86 instruction set over

4429-624: The IBM 704 had floating-point arithmetic as a standard feature, one of its major improvements over its predecessor the IBM 701 . This was carried forward to its successors the 709, 7090, and 7094. In 1963, Digital announced the PDP-6 , which had floating point as a standard feature. In 1963, the GE-235 featured an "Auxiliary Arithmetic Unit" for floating point and double-precision calculations. Historically, some systems implemented floating point with

4532-461: The machine code format was expanded. To provide backward compatibility, segments with executable code can be marked as containing either 16-bit or 32-bit instructions. Special prefixes allow inclusion of 32-bit instructions in a 16-bit segment or vice versa. The 80386 had an optional floating-point coprocessor, the 80387 ; it had eight 80-bit wide registers: st(0) to st(7), like the 8087 and 80287. The 80386 could also use an 80287 coprocessor. With

4635-535: The write combining features of the CPU. Memory type range registers (MTRRs) are set automatically by Windows video drivers starting from 1997, and from there the improved cache/memory subsystem and FPU performance caused it to outclass the Pentium clock-for-clock in the emerging 3D games of the mid–to–late 1990s, particularly when using Windows NT 4.0 . However, its lack of MMX implementation reduces performance in multimedia applications that made use of those instructions. Likely Pentium Pro's most noticeable addition

4738-471: The 8087 with the same data formats. With the advent of the 32-bit 80386 processor, the 16-bit general-purpose registers, base registers, index registers, instruction pointer, and FLAGS register , but not the segment registers, were expanded to 32 bits. The nomenclature represented this by prefixing an " E " (for "extended") to the register names in x86 assembly language . Thus, the AX register corresponds to

4841-585: The Decoded Stream Buffer (for Core-branded processors since Sandy Bridge). Transmeta used a completely different method in their Crusoe x86 compatible CPUs. They used just-in-time translation to convert x86 instructions to the CPU's native VLIW instruction set. Transmeta argued that their approach allows for more power efficient designs since the CPU can forgo the complicated decode step of more traditional x86 implementations. Addressing modes for 16-bit processor modes can be summarized by

4944-407: The Intel iAPX 432 and the lead architect of the i686 chip, the Pentium Pro. He was no doubt intimately familiar with all this history. The Pentium Pro was designed to include the 4-way SMP split-transaction cache-coherent bus as a mandatory feature of every chip produced. This also served to deny competition access to the socket to produce cloned processors. While the Pentium Pro was not successful as

5047-877: The Knights Landing Xeon Phi processors and by Skylake-X processors, use 512-bit wide SIMD registers. During execution , current x86 processors employ a few extra decoding steps to split most instructions into smaller pieces called micro-operations. These are then handed to a control unit that buffers and schedules them in compliance with x86-semantics so that they can be executed, partly in parallel, by one of several (more or less specialized) execution units . These modern x86 designs are thus pipelined , superscalar , and also capable of out of order and speculative execution (via branch prediction , register renaming , and memory dependence prediction ), which means they may execute multiple (partial or complete) x86 instructions simultaneously, and not necessarily in

5150-540: The PC-compatible market started , some of them before the IBM PC (1981) debut. As of June 2022 , most desktop and laptop computers sold are based on the x86 architecture family, while mobile categories such as smartphones or tablets are dominated by ARM . At the high end, x86 continues to dominate computation-intensive workstation and cloud computing segments. In the 1980s and early 1990s, when

5253-407: The Pentium Pro bus was influenced by Futurebus , the Intel iAPX 432 bus, and elements of the Intel i960 bus. Futurebus had been intended as an advanced bus to replace VMEbus used with the Motorola 68000 from the late 1970s, but it stagnated in standardization committee for more than a decade if you count all the twists and turns. Intel's iAPX 432 initiative was also a commercial failure, but in

SECTION 50

#1732765052791

5356-564: The Pentium Pro's P6 microarchitecture , a fully 32-bit operating system is needed, such as Windows NT , Linux , Unix , or OS/2 . The performance issues on legacy code were later partly mitigated by Intel with the Pentium II. Compared to RISC microprocessors, the Pentium Pro, when introduced, slightly outperformed the fastest RISC microprocessors on integer performance when running the SPECint95 benchmark, but floating-point performance

5459-486: The Pentium Pro's ability to decode multiple instructions simultaneously, limiting superscalar execution. x86 instructions are decoded into 118-bit micro-operations (micro-ops). The micro-ops are reduced instruction set computer (RISC)-like; that is, they encode an operation, two sources, and a destination. The general decoder can generate up to four micro-ops per cycle, whereas the simple decoders can generate one micro-op each per cycle. Thus, x86 instructions that operate on

5562-776: The Pentium's P5 microarchitecture. It has a decoupled, 14-stage superpipelined architecture which used an instruction pool. The Pentium Pro ( P6 ) implemented many radical architectural differences mirroring other contemporary x86 designs such as the NexGen Nx586 and Cyrix 6x86 . The Pentium Pro pipeline had extra decode stages to dynamically translate IA-32 instructions into buffered micro-operation sequences which could then be analysed, reordered, and renamed in order to detect parallelizable operations that may be issued to more than one execution unit at once. The Pentium Pro thus featured out-of-order execution , including speculative execution via register renaming . It also had

5665-434: The advanced but delayed 5k86 ( K5 ), which, internally, was closely based on AMD's earlier 29K RISC design; similar to NexGen 's Nx586 , it used a strategy such that dedicated pipeline stages decode x86 instructions into uniform and easily handled micro-operations , a method that has remained the basis for most x86 designs to this day. Some early versions of these microprocessors had heat dissipation problems. The 6x86

5768-415: The art, had been planned for 2021; as of March 2022 the release had not taken place, however. The instruction set architecture has twice been extended to a larger word size. In 1985, Intel released the 32-bit 80386 (later known as i386) which gradually replaced the earlier 16-bit chips in computers (although typically not in embedded systems ) during the following years; this extended programming model

5871-473: The corresponding YMM register. Floating-point unit In general-purpose computer architectures , one or more FPUs may be integrated as execution units within the central processing unit ; however, many embedded processors do not have hardware support for floating-point operations (while they increasingly have them as standard). When a CPU is executing a program that calls for a floating-point operation, there are three ways to carry it out: In 1954,

5974-434: The cost of the extra hardware. For a particular computer architecture, the floating-point unit instructions may be emulated by a library of software functions; this may permit the same object code to run on systems with or without floating-point hardware. Emulation can be implemented on any of several levels: in the CPU as microcode , as an operating system function, or in user-space code. When only integer functionality

6077-616: The derivative core called " Banias " in Pentium M and Intel Core ( Yonah ), which itself would evolve into the Core microarchitecture ( Core 2 processor) in 2006 and onward. The Pentium Pro (P6) introduced new instructions into the Intel range; the CMOVxx (‘conditional move’) instructions can move a value that is either the contents of a register or memory location into another register or not, according to some predicate logical condition xx on

6180-515: The electronic and physical levels. Quite naturally, early compatible microprocessors were 16-bit, while 32-bit designs were developed much later. For the personal computer market, real quantities started to appear around 1990 with i386 and i486 compatible processors, often named similarly to Intel's original chips. After the fully pipelined i486 , in 1993 Intel introduced the Pentium brand name (which, unlike numbers, could be trademarked ) for their new set of superscalar x86 designs. With

6283-427: The entire assembly, which was one of the reasons for the Pentium Pro's relatively low production yield and high cost. All versions of the chip were expensive, those with 1024 KB being particularly so, since it required two 512 KB cache dies as well as the processor die. Pentium Pro clock speeds were 150, 166, 180 or 200 MHz with a 60 or 66 MHz external bus clock. A prototype 133 MHz Pentium Pro

SECTION 60

#1732765052791

6386-510: The execution of those instructions. In the 1980s, it was common in IBM PC /compatible microcomputers for the FPU to be entirely separate from the CPU , and typically sold as an optional add-on. It would only be purchased if needed to speed up or enable math-intensive programs. The IBM PC, XT , and most compatibles based on the 8088 or 8086 had a socket for the optional 8087 coprocessor. The AT and 80286 -based systems were generally socketed for

6489-401: The execution units with the decode steps opens up possibilities for more analysis of the (buffered) code stream, and therefore permits detection of operations that can be performed in parallel, simultaneously feeding more than one execution unit. The latest processors also do the opposite when appropriate; they combine certain x86 sequences (such as a compare followed by a conditional jump) into

6592-549: The first two actively produce modern 64-bit designs, leading to what has been called a "duopoly" of Intel and AMD in x86 processors. However, in 2014 the Shanghai-based Chinese company Zhaoxin , a joint venture between a Chinese company and VIA Technologies, began designing VIA based x86 processors for desktops and laptops. The release of its newest "7" family of x86 processors (e.g. KX-7000), which are not quite as fast as AMD or Intel chips but are still state of

6695-452: The flags register, xx being a flags predicate code as given in the condition for conditional jump instructions. So for example CMOVNE moves a specified value into a register or not depending on whether the NE (not-equal) condition is true in the flags register ie Z flag = 0. This allows the evaluation of if-then-else operations and for example the ? : operation in C. These instructions give

6798-528: The formula: Addressing modes for 32-bit x86 processor modes can be summarized by the formula: Addressing modes for the 64-bit processor mode can be summarized by the formula: Instruction relative addressing in 64-bit code (RIP + displacement, where RIP is the instruction pointer register ) simplifies the implementation of position-independent code (as used in shared libraries in some operating systems). The 8086 had 64 KB of eight-bit (or alternatively 32 K-word of 16-bit ) I/O space, and

6901-399: The frequently occurring cases or contexts where a −128..127 range is enough. Typical instructions are therefore 2 or 3 bytes in length (although some are much longer, and some are single-byte). To further conserve encoding space, most registers are expressed in opcodes using three or four bits, the latter via an opcode prefix in 64-bit mode, while at most one operand to an instruction can be

7004-405: The full complement of functions such as a barrel shifter , multiplier, divider, and support for LEA instructions. The second integer unit, which is connected to port 1, does not have these facilities and is limited to simple operations such as add, subtract, and the calculation of branch target addresses. The FPU executes floating-point operations. Addition and multiplication are pipelined and have

7107-502: The full range of values as allowed in conditional branches. A second development was the documentation of the UD2 illegal instruction. This op code is reserved and guaranteed to cause an illegal instruction exception on the P6 and all later processors. This allows developers to easily crash the current program in a future-proof fashion when a bug is detected by software. Despite being advanced for

7210-501: The implementation of position-independent code , used in shared libraries in some operating systems. SIMD registers XMM0–XMM15 (XMM0–XMM31 when AVX-512 is supported). SIMD registers YMM0–YMM15 (YMM0–YMM31 when AVX-512 is supported). Lower half of each of the YMM registers maps onto the corresponding XMM register. SIMD registers ZMM0–ZMM31. Lower half of each of the ZMM registers maps onto

7313-408: The instruction pointer (IP) points to the next instruction that will be fetched from memory and then executed; this register cannot be directly accessed (read or written) by a program. The Intel 80186 and 80188 are essentially an upgraded 8086 or 8088 CPU, respectively, with on-chip peripherals added, and they have the same CPU registers as the 8086 and 8088 (in addition to interface registers for

7416-459: The integer arithmetic logic unit . The software that lists the necessary series of operations to emulate floating-point operations is often packaged in a floating-point library . In some cases, FPUs may be specialized, and divided between simpler floating-point operations (mainly addition and multiplication) and more complicated operations, like division. In some cases, only the simple operations may be implemented in hardware or microcode , while

7519-441: The introduction of the 80386 in 1985. A few years after the introduction of the 8086 and 8088, Intel added some complexity to its naming scheme and terminology as the "iAPX" of the ambitious but ill-fated Intel iAPX 432 processor was tried on the more successful 8086 family of chips, applied as a kind of system-level prefix. An 8086 system, including coprocessors such as 8087 and 8089 , and simpler Intel-specific system chips,

7622-477: The late 1990s. These form of slotkets allowed for lower costs for computer builders, especially with dual processor machines, and gave Slot 1 motherboards the ability to continue receiving CPU upgrades beyond the then-currently available Slot 1 CPUs. The Pentium Pro used GTL+ signaling in its front-side bus. The Pentium Pro could be used by itself on up to four-way designs. Eight-way Pentium Pro computers were also built, but these used multiple buses. The design of

7725-447: The lower 16 bits of the new 32-bit EAX register, SI corresponds to the lower 16 bits of ESI, and so on. The general-purpose registers, base registers, and index registers can all be used as the base in addressing modes, and all of those registers except for the stack pointer can be used as the index in addressing modes. Two new segment registers (FS and GS) were added. With a greater number of registers, instructions and operands,

7828-404: The main system bus with the CPU, the Pentium Pro's cache had its own back-side bus (called dual independent bus by Intel). Because of this, the CPU could read main memory and cache concurrently, greatly reducing a traditional bottleneck. The cache was also "non-blocking", meaning that the processor could issue more than one cache request at a time (up to 4), reducing cache-miss penalties. (This

7931-604: The market dominance of the "inelegant" x86 architecture designed directly from the first simple 8-bit microprocessors. Examples of this are the iAPX 432 (a project originally named the Intel 8800 ), the Intel 960 , Intel 860 and the Intel/Hewlett-Packard Itanium architecture. However, the continuous refinement of x86 microarchitectures , circuitry and semiconductor manufacturing would make it hard to replace x86 in many segments. AMD's 64-bit extension of x86 (which Intel eventually responded to with

8034-455: The memory (e.g., add this register to this location in the memory) can only be processed by the general decoder, as this operation requires a minimum of three micro-ops. Likewise, the simple decoders are limited to instructions that can be translated into one micro-op. Instructions that require more micro-ops than four are translated with the assistance of a sequencer, which generates the required micro-ops over multiple clock cycles. The Pentium Pro

8137-922: The more complex operations are implemented as software. In some current architectures, the FPU functionality is combined with SIMD units to perform SIMD computation; an example of this is the augmentation of the x87 instructions set with SSE instruction set in the x86-64 architecture used in newer Intel and AMD processors. Several models of the PDP-11 , such as the PDP-11/45, PDP-11/34a, PDP-11/44, and PDP-11/70, supported an add-on floating-point unit to support floating-point instructions. The PDP-11/60, MicroPDP-11/23 and several VAX models could execute floating-point instructions without an add-on FPU (the MicroPDP-11/23 required an add-on microcode option), and offered add-on accelerators to further speed

8240-473: The name EM64T and finally using Intel 64. Microsoft and Sun Microsystems / Oracle also use term "x64", while many Linux distributions , and the BSDs also use the "amd64" term. Microsoft Windows, for example, designates its 32-bit versions as "x86" and 64-bit versions as "x64", while installation files of 64-bit Windows versions are required to be placed into a directory called "AMD64". In 2023, Intel proposed

8343-474: The package using conventional wire bonding. The cavities are capped with a ceramic plate. The Pentium Pro with 1 MB of cache uses a plastic MCM. Instead of two cavities, there is only one, in which the three dies reside, bonded to the package instead of a heat slug. The cavities are filled in with epoxy. The MCM has 387 pins, of which approximately half are arranged in a pin grid array (PGA) and half in an interstitial pin grid array (IPGA). The packaging

8446-454: The peripherals). The 8086, 8088, 80186, and 80188 can use an optional floating-point coprocessor, the 8087 . The 8087 appears to the programmer as part of the CPU and adds eight 80-bit wide registers, st(0) to st(7), each of which can hold numeric data in one of seven formats: 32-, 64-, or 80-bit floating point, 16-, 32-, or 64-bit (binary) integer, and 80-bit packed decimal integer. It also has its own 16-bit status register accessible through

8549-405: The process they did learn how to build a split-transaction bus to support a cacheless multiprocessor system. The i960 had further developed the split-transaction iAPX 432 bus to include a cache coherency protocol, ending up with a feature set highly reminiscent of the original Futurebus ambitions. The lead architect of i960 was superscalarity specialist Fred Pollack who was also the lead engineer of

8652-500: The result has to be stored in the ROB. After the microprocessor was released, a bug was discovered in the floating point unit , commonly called the "Pentium Pro and Pentium II FPU bug" and by Intel as the "flag erratum". The bug occurs under some circumstances during floating point-to-integer conversion when the floating point number will not fit into the smaller integer format, causing the FPU to deviate from its documented behaviour. The bug

8755-418: The same order as given in the instruction stream. Some Intel CPUs ( Xeon Foster MP , some Pentium 4 , and some Nehalem and later Intel Core processors) and AMD CPUs (starting from Zen ) are also capable of simultaneous multithreading with two threads per core ( Xeon Phi has four threads per core). Some Intel CPUs support transactional memory ( TSX ). When introduced, in the mid-1990s, this method

8858-443: The same simplified segmentation as long mode. The x86 architecture is a variable instruction length, primarily " CISC " design with emphasis on backward compatibility . The instruction set is not typical CISC, however, but basically an extended version of the simple eight-bit 8008 and 8080 architectures. Byte-addressing is enabled and words are stored in memory with little-endian byte order. Memory access to unaligned addresses

8961-454: The stack. Much work has therefore been invested in making such accesses as fast as register accesses—i.e., a one cycle instruction throughput, in most circumstances where the accessed data is available in the top-level cache. A dedicated floating-point processor with 80-bit internal registers, the 8087 , was developed for the original 8086 . This microprocessor subsequently developed into the extended 80387 , and later processors incorporated

9064-418: The time of the Pentium Pro's release were 16-bit DOS , and mixed 16/32-bit Windows 3.1x and Windows 95 (although the latter requires a 32-bit 80386 CPU as a minimum, much of its code is still 16-bit for performance reasons, such as the 16-bit Windows USER dynamic link library , user.exe ). This, along with the high cost of Pentium Pro systems, led to tepid sales among PC buyers at the time. To fully use

9167-427: The time, the Pentium Pro's out-of-order register renaming architecture had trouble running 16-bit code and mixed code ( 8-bit with 16-bit (8/16), or 16-bit with 32-bit (16/32), as using partial registers cause frequent pipeline flushing. Specific use of partial registers was then a common performance optimization, as it incurred no performance penalty on pre-P6 Intel processors; also, the dominant operating systems at

9270-588: The upgrade to dual Pentium II OverDrive processors in 1999. ASCI Red continued to use dual Pentium II OverDrive processors for the remainder of its lifespan before being decommissioned in 2006. As Slot 1 motherboards became prevalent, several manufacturers released slotket (or slocket) adapters, such as the Tyan M2020, Asus C-P6S1, Tekram P6SL1, and the Abit KP6. These sockets allowed Pentium Pro processors to be used with Slot 1 motherboards. However, only

9373-490: The x86 naming scheme now legally cleared, other x86 vendors had to choose different names for their x86-compatible products, and initially some chose to continue with variations of the numbering scheme: IBM partnered with Cyrix to produce the 5x86 and then the very efficient 6x86 (M1) and 6x86 MX ( MII ) lines of Cyrix designs, which were the first x86 microprocessors implementing register renaming to enable speculative execution . AMD meanwhile designed and manufactured

9476-484: The years, almost consistently with full backward compatibility . The architecture family has been implemented in processors from Intel, Cyrix , AMD , VIA Technologies and many other companies; there are also open implementations, such as the Zet SoC platform (currently inactive). Nevertheless, of those, only Intel, AMD, VIA Technologies, and DM&P Electronics hold x86 architectural licenses, and from these, only

9579-537: Was also affected by a few minor compatibility problems, the Nx586 lacked a floating-point unit (FPU) and (the then crucial) pin-compatibility, while the K5 had somewhat disappointing performance when it was (eventually) introduced. Customer ignorance of alternatives to the Pentium series further contributed to these designs being comparatively unsuccessful, despite the fact that the K5 had very good Pentium compatibility and

9682-401: Was also used in supercomputers , most notably ASCI Red , which used two Pentium Pro CPUs on each computing node and was the first computer to reach over one teraFLOPS in 1996, holding the number one spot in the TOP500 list from 1997 to 2000. While the Pentium and Pentium MMX had 3.1 and 4.5 million transistors , respectively, the Pentium Pro contained 5.5 million transistors. It

9785-500: Was capable of both dual- and quad-processor configurations and only came in one form factor, the relatively large rectangular Socket 8 . The Pentium Pro was succeeded by the Pentium II Xeon in 1998. The lead architect of Pentium Pro was Fred Pollack who was specialized in superscalarity and had also worked as the lead engineer of the Intel iAPX 432 . The Pentium Pro incorporated a new microarchitecture , different from

9888-562: Was designed for Socket 8 . In 1998, the 300/333 MHz Pentium II OverDrive processor for Socket 8 was released. Based on some of the technology used in the Deschutes Pentium II Xeon , it featured double L1 and 512 KB of full-speed L2 cache with MMX capabilities, and was produced by Intel as a drop-in upgrade option for owners of Pentium Pro systems. However, it only supported two-way glueless multiprocessing, not four-way or higher, which did not make it

9991-488: Was developed in its earliest stages of development but was never released. Some users chose to overclock their Pentium Pro chips, with the 200 MHz version often being run at 233 MHz, the 180 MHz version often being run at 200 MHz, and the 150 MHz version often being run at 166 MHz. The chip was popular in symmetric multiprocessing configurations, with dual and quad SMP server and workstation setups being commonplace. Intel skipped out on providing

10094-468: Was its on-package L2 cache , which ranged from 256 KB at introduction to 1 MB in 1997. At the time, manufacturing technology did not feasibly allow a large L2 cache to be integrated into the processor core. Intel instead placed the L2 die(s) separately in the package which still allowed it to run at the same clock speed as the CPU core. Additionally, unlike most motherboard-based cache schemes that shared

10197-403: Was originally referred to as the i386 architecture (like its first implementation) but Intel later dubbed it IA-32 when introducing its (unrelated) IA-64 architecture. In 1999–2003, AMD extended this 32-bit architecture to 64 bits and referred to it as x86-64 in early documents and later as AMD64 . Intel soon adopted AMD's architectural extensions under the name IA-32e, later using

10300-909: Was significantly lower, half that of some RISC microprocessors. The Pentium Pro's integer performance lead disappeared rapidly, first overtaken by the MIPS Technologies R10000 in January 1996, and then by Digital Equipment Corporation 's EV56 variant of the Alpha 21164 . Reviewers quickly noted the very slow writes to video memory as the weak spot of the P6 platform, with performance here being as low as 10% of an identically clocked Pentium system in benchmarks such as VIDSPEED. Methods to circumvent this included setting VESA drawing to system memory instead of video memory in games such as Quake , and later on utilities such as FASTVID emerged, which could double performance in certain games by enabling

10403-436: Was sometimes referred to as a "RISC core" or as "RISC translation", partly for marketing reasons, but also because these micro-operations share some properties with certain types of RISC instructions. However, traditional microcode (used since the 1950s) also inherently shares many of the same properties; the new method differs mainly in that the translation to micro-operations now occurs asynchronously. Not having to synchronize

10506-447: Was the first processor in the x86 family to support upgradeable microcode under BIOS and/or operating system (OS) control. Micro-ops exit the re-order buffer (ROB) and enter a reserve station (RS), where they await dispatch to the execution units. In each clock cycle, up to five micro-ops can be dispatched to five execution units. The Pentium Pro has a total of six execution units: two integer units, one floating-point unit (FPU),

10609-478: Was thereby described as an iAPX 86 system. There were also terms iRMX (for operating systems), iSBC (for single-board computers), and iSBX (for multimodule boards based on the 8086-architecture), all together under the heading Microsystem 80 . However, this naming scheme was quite temporary, lasting for a few years during the early 1980s. Although the 8086 was primarily developed for embedded systems and small multi-user or single-user computers, largely as

#790209