Misplaced Pages

HITAC S-3000

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

The HITAC S-3000 is a former family of vector supercomputers , which was developed, manufactured and marketed by Hitachi . Announced in April 1992, the family succeeded the HITAC S-820 . The S-3000 family comprised the low-end and mid-range S-3600 models and the high-end S-3800 models. Unlike Hitachi 's previous generations of supercomputers, the S-3000 family was marketed outside Japan.

#95904

60-656: The S-3600 was an improved version of the S-820 implemented in more modern semiconductor technology. The S-3800 was a new design, differing significantly from the previous generations. It was a parallel vector processor and supported one to four vector processors. In 1994, the S-3000 family was complemented by an MPP machine that used superscalar microprocessors, the SR2001 . Hitachi eventually discontinued development of vector supercomputers in favor of this approach. The S-3000 family

120-478: A mobile version of the original Pentium Pro due to power draw and heat concerns. At least one vendor sold a portable computer with a Pentium Pro (Imperial Computer's 6200TLP). In Intel's "Family/Model/Stepping" scheme, the Pentium Pro is family 6, model 1, and its Intel Product code is 80521. The process used to fabricate the Pentium Pro processor die and its separate cache memory die changed, leading to

180-449: A central cache. However, this far faster L2 cache did come with some complications. The Pentium Pro's "on-package cache" arrangement was unique. The processor and the cache were on separate dies in the same package and connected closely by a full-speed bus. The two or three dies had to be bonded together early in the production process, before testing was possible. This meant that a single, tiny flaw in either die made it necessary to discard

240-437: A combination of processes used in the same package: The Pentium Pro (up to 512 KB cache) is packaged in a ceramic multi-chip module (MCM). The MCM contains two underside cavities in which the microprocessor die and its companion cache die reside. The dies are bonded to a heat slug, whose exposed top helps the heat from the dies to be transferred more directly to cooling apparatus such as a heat sink. The dies are connected to

300-509: A few number of chipsets supported these slotkets, and so did not see widespread use. The Intel 440FX chipset explicitly supported both Pentium Pro and Pentium II processors, however the Intel 440BX and later Slot 1 chipsets did not explicitly support the Pentium Pro. Slotkets eventually saw renewed popularity in the form of Socket 370 to Slot 1 adapters, when Intel introduced Socket 370 Celeron and Pentium III processors in

360-669: A given CPU): Seymour Cray 's CDC 6600 from 1964 is often mentioned as the first superscalar design. The 1967 IBM System/360 Model 91 was another superscalar mainframe. The Intel i960 CA (1989), the AMD 29000 -series 29050 (1990), and the Motorola MC88110 (1991), microprocessors were the first commercial single-chip superscalar microprocessors. RISC microprocessors like these were the first to have superscalar execution, because RISC architectures free transistors and die area which can be used to include multiple execution units and

420-537: A latency of three and five cycles, respectively. Division and square-root are not pipelined and are executed in separate units that share the FPU's ports. Division and square root have a latency of 18-36 and 29-69 cycles, respectively. The smallest number is for single precision (32-bit) floating-point numbers and the largest for extended precision (80-bit) numbers. Division and square root can operate simultaneously with adds and multiplies, preventing them from executing only when

480-404: A load unit, store address unit, and a store data unit. One of the integer units shares the same ports as the FPU, and therefore the Pentium Pro can only dispatch one integer micro-op and one floating-point micro-op, or two integer micro-ops per a cycle, in addition to micro-ops for the other three execution units. Of the two integer units, only the one that shares the path with the FPU on port 0 has

540-400: A performance boost by allowing the avoidance of costly jump and branch instructions. In eg CMOVxx destreg1, source_operand2 the first operand is the destination register, the second the source register or memory location. The second operand unfortunately can not be an immediate (in-line constant) value and such a constant would have to be placed in a register first. The predicate code xx can take

600-500: A single processor. Thus a multicore CPU is possible where each core is an independent processor containing multiple parallel pipelines, each pipeline being superscalar. Some processors also include vector capability. Pentium Pro The Pentium Pro is a sixth-generation x86 microprocessor developed and manufactured by Intel and introduced on November 1, 1995. It introduced the P6 microarchitecture (sometimes termed i686) and

660-436: A superscalar CPU the dispatcher reads instructions from memory and decides which ones can be run in parallel, dispatching each to one of the several execution units contained inside a single CPU. Therefore, a superscalar processor can be envisioned as having multiple parallel pipelines, each of which is processing instructions simultaneously from a single instruction thread. Most modern superscalar CPUs also have logic to reorder

SECTION 10

#1732772193096

720-527: A unit of time) than would otherwise be possible at a given clock rate . Each execution unit is not a separate processor (or a core if the processor is a multi-core processor ), but an execution resource within a single CPU such as an arithmetic logic unit . While a superscalar CPU is typically also pipelined , superscalar and pipelining execution are considered different performance enhancement techniques. The former (superscalar) executes multiple instructions in parallel by using multiple execution units, whereas

780-511: A usable upgrade for quad-processor systems. These specially packaged Pentium II OverDrive processors were also used to upgrade the ASCI Red supercomputer in 1999. This makes the ASCI Red supercomputer, the first computer to reach the one teraFLOPS performance mark with dual Pentium Pro processors in 1996, to now become the first computer overall to exceed the two teraFLOPS performance mark with

840-462: A wider 36-bit address bus , usable by Physical Address Extension (PAE), allowing it to access up to 64 GB of memory. The Pentium Pro has an 8 KB instruction cache , from which up to 16 bytes are fetched on each cycle and sent to the instruction decoders . There are three instruction decoders. The decoders are unequal in ability: only one can decode any x86 instruction, while the other two can only decode simple x86 instructions. This restricts

900-494: Is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. In contrast to a scalar processor , which can execute at most one single instruction per clock cycle, a superscalar processor can execute more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution units on the processor. It therefore allows more throughput (the number of instructions that can be executed in

960-470: Is an example of MLP, Memory Level Parallelism .) These properties combined to produce an L2 cache that was immensely faster than the motherboard-based caches of older processors. This cache alone gave the CPU an advantage in input/output performance over older x86 CPUs. In multiprocessor configurations, Pentium Pro's integrated cache skyrocketed performance in comparison to architectures which had each CPU sharing

1020-415: Is considered to be minor and occurs under such special circumstances that very few, if any, software programs are affected. The Pentium Pro P6 microarchitecture was used in one form or another by Intel for more than a decade. The pipeline would scale from its initial 150 MHz start, all the way up to 1.4 GHz with the "Tualatin" Pentium III . The design's various traits would continue after that in

1080-426: Is no assurance otherwise and failure to detect a dependency would produce incorrect results. No matter how advanced the semiconductor process or how fast the switching speed, this places a practical limit on how many instructions can be simultaneously dispatched. While process advances will allow ever greater numbers of execution units (e.g. ALUs), the burden of checking instruction dependencies grows rapidly, as does

1140-454: Is removed and delegated to the compiler . Explicitly parallel instruction computing (EPIC) is like VLIW with extra cache prefetching instructions. Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar processors. SMT permits multiple independent threads of execution to better utilize the resources provided by modern processor architectures. The fact that they are independent means that we know that

1200-480: Is the difference between scalar and vector arithmetic. A superscalar processor is a mixture of the two. Each instruction processes one data item, but there are multiple execution units within each CPU thus multiple instructions can be processing separate data items concurrently. Superscalar CPU design emphasizes improving the instruction dispatcher accuracy and allowing it to keep the multiple execution units in use at all times. This has become increasingly important as

1260-644: The ALU , integer multiplier , integer shifter, FPU , etc. There may be multiple versions of each execution unit to enable the execution of many instructions in parallel. This differs from a multi-core processor that concurrently processes instructions from multiple threads, one thread per processing unit (called "core"). It also differs from a pipelined processor , where the multiple instructions can concurrently be in various stages of execution, assembly-line fashion. The various alternative techniques are not mutually exclusive—they can be (and frequently are) combined in

SECTION 20

#1732772193096

1320-524: The NexGen Nx586 and Cyrix 6x86 . The Pentium Pro pipeline had extra decode stages to dynamically translate IA-32 instructions into buffered micro-operation sequences which could then be analysed, reordered, and renamed in order to detect parallelizable operations that may be issued to more than one execution unit at once. The Pentium Pro thus featured out-of-order execution , including speculative execution via register renaming . It also had

1380-602: The TOP500 list from 1997 to 2000. While the Pentium and Pentium MMX had 3.1 and 4.5 million transistors , respectively, the Pentium Pro contained 5.5 million transistors. It was capable of both dual- and quad-processor configurations and only came in one form factor, the relatively large rectangular Socket 8 . The Pentium Pro was succeeded by the Pentium II Xeon in 1998. The lead architect of Pentium Pro

1440-537: The write combining features of the CPU. Memory type range registers (MTRRs) are set automatically by Windows video drivers starting from 1997, and from there the improved cache/memory subsystem and FPU performance caused it to outclass the Pentium clock-for-clock in the emerging 3D games of the mid–to–late 1990s, particularly when using Windows NT 4.0 . However, its lack of MMX implementation reduces performance in multimedia applications that made use of those instructions. Likely Pentium Pro's most noticeable addition

1500-461: The Intel iAPX 432 and the lead architect of the i686 chip, the Pentium Pro. He was no doubt intimately familiar with all this history. The Pentium Pro was designed to include the 4-way SMP split-transaction cache-coherent bus as a mandatory feature of every chip produced. This also served to deny competition access to the socket to produce cloned processors. While the Pentium Pro was not successful as

1560-464: The Pentium Pro bus was influenced by Futurebus , the Intel iAPX 432 bus, and elements of the Intel i960 bus. Futurebus had been intended as an advanced bus to replace VMEbus used with the Motorola 68000 from the late 1970s, but it stagnated in standardization committee for more than a decade if you count all the twists and turns. Intel's iAPX 432 initiative was also a commercial failure, but in

1620-566: The Pentium Pro's P6 microarchitecture , a fully 32-bit operating system is needed, such as Windows NT , Linux , Unix , or OS/2 . The performance issues on legacy code were later partly mitigated by Intel with the Pentium II. Compared to RISC microprocessors, the Pentium Pro, when introduced, slightly outperformed the fastest RISC microprocessors on integer performance when running the SPECint95 benchmark, but floating-point performance

1680-486: The Pentium Pro's ability to decode multiple instructions simultaneously, limiting superscalar execution. x86 instructions are decoded into 118-bit micro-operations (micro-ops). The micro-ops are reduced instruction set computer (RISC)-like; that is, they encode an operation, two sources, and a destination. The general decoder can generate up to four micro-ops per cycle, whereas the simple decoders can generate one micro-op each per cycle. Thus, x86 instructions that operate on

1740-402: The complexity of register renaming circuitry to mitigate some dependencies. Collectively the power consumption , complexity and gate delay costs limit the achievable superscalar speedup. However even given infinitely fast dependency checking logic on an otherwise conventional superscalar CPU, if the instruction stream itself has many dependencies, this would also limit the possible speedup. Thus

1800-419: The degree of intrinsic parallelism in the code stream forms a second limitation. Collectively, these limits drive investigation into alternative architectural changes such as very long instruction word (VLIW), explicitly parallel instruction computing (EPIC), simultaneous multithreading (SMT), and multi-core computing . With VLIW, the burdensome task of dependency checking by hardware logic at run time

1860-617: The derivative core called " Banias " in Pentium M and Intel Core ( Yonah ), which itself would evolve into the Core microarchitecture ( Core 2 processor) in 2006 and onward. The Pentium Pro (P6) introduced new instructions into the Intel range; the CMOVxx (‘conditional move’) instructions can move a value that is either the contents of a register or memory location into another register or not, according to some predicate logical condition xx on

HITAC S-3000 - Misplaced Pages Continue

1920-427: The entire assembly, which was one of the reasons for the Pentium Pro's relatively low production yield and high cost. All versions of the chip were expensive, those with 1024 KB being particularly so, since it required two 512 KB cache dies as well as the processor die. Pentium Pro clock speeds were 150, 166, 180 or 200 MHz with a 60 or 66 MHz external bus clock. A prototype 133 MHz Pentium Pro

1980-452: The flags register, xx being a flags predicate code as given in the condition for conditional jump instructions. So for example CMOVNE moves a specified value into a register or not depending on whether the NE (not-equal) condition is true in the flags register ie Z flag = 0. This allows the evaluation of if-then-else operations and for example the ? : operation in C. These instructions give

2040-406: The full complement of functions such as a barrel shifter , multiplier, divider, and support for LEA instructions. The second integer unit, which is connected to port 1, does not have these facilities and is limited to simple operations such as add, subtract, and the calculation of branch target addresses. The FPU executes floating-point operations. Addition and multiplication are pipelined and have

2100-502: The full range of values as allowed in conditional branches. A second development was the documentation of the UD2 illegal instruction. This op code is reserved and guaranteed to cause an illegal instruction exception on the P6 and all later processors. This allows developers to easily crash the current program in a future-proof fashion when a bug is detected by software. Despite being advanced for

2160-448: The instruction of one thread can be executed out of order and/or in parallel with the instruction of a different one. Also, one independent thread will not produce a pipeline bubble in the code stream of a different one, for example, due to a branch. Superscalar processors differ from multi-core processors in that the several execution units are not entire processors. A single processor is composed of finer-grained execution units such as

2220-452: The instructions to try to avoid pipeline stalls and increase parallel execution. Available performance improvement from superscalar techniques is limited by three key areas: Existing binary executable programs have varying degrees of intrinsic parallelism. In some cases instructions are not dependent on each other and can be executed simultaneously. In other cases they are inter-dependent: one instruction impacts either resources or results of

2280-479: The late 1990s. These form of slotkets allowed for lower costs for computer builders, especially with dual processor machines, and gave Slot 1 motherboards the ability to continue receiving CPU upgrades beyond the then-currently available Slot 1 CPUs. The Pentium Pro used GTL+ signaling in its front-side bus. The Pentium Pro could be used by itself on up to four-way designs. Eight-way Pentium Pro computers were also built, but these used multiple buses. The design of

2340-441: The latter (pipeline) executes multiple instructions in the same execution unit in parallel by dividing the execution unit into different phases. In the "Simple superscalar pipeline" figure, fetching two instructions at the same time is superscaling, and fetching the next two before the first pair has been written back is pipelining. The superscalar technique is traditionally associated with several identifying characteristics (within

2400-405: The main system bus with the CPU, the Pentium Pro's cache had its own back-side bus (called dual independent bus by Intel). Because of this, the CPU could read main memory and cache concurrently, greatly reducing a traditional bottleneck. The cache was also "non-blocking", meaning that the processor could issue more than one cache request at a time (up to 4), reducing cache-miss penalties. (This

2460-455: The memory (e.g., add this register to this location in the memory) can only be processed by the general decoder, as this operation requires a minimum of three micro-ops. Likewise, the simple decoders are limited to instructions that can be translated into one micro-op. Instructions that require more micro-ops than four are translated with the assistance of a sequencer, which generates the required micro-ops over multiple clock cycles. The Pentium Pro

HITAC S-3000 - Misplaced Pages Continue

2520-463: The more rigid methods used in the simpler P5 Pentium ; it also simplified speculative execution and allowed higher clock frequencies compared to designs such as the advanced Cyrix 6x86 . The simplest processors are scalar processors. Each instruction executed by a scalar processor typically manipulates one or two data items at a time. By contrast, each instruction executed by a vector processor operates simultaneously on many data items. An analogy

2580-753: The number of units has increased. While early superscalar CPUs would have two ALUs and a single FPU , a later design such as the PowerPC 970 includes four ALUs, two FPUs, and two SIMD units. If the dispatcher is ineffective at keeping all of these units fed with instructions, the performance of the system will be no better than that of a simpler, cheaper design. A superscalar processor usually sustains an execution rate in excess of one instruction per machine cycle . But merely processing multiple instructions concurrently does not make an architecture superscalar, since pipelined , multiprocessor or multi-core architectures also achieve that, but with different methods. In

2640-468: The other. The instructions a = b + c; d = e + f can be run in parallel because none of the results depend on other calculations. However, the instructions a = b + c; b = e + f might not be runnable in parallel, depending on the order in which the instructions complete while they move through the units. Although the instruction stream may contain no inter-instruction dependencies, a superscalar CPU must nonetheless check for that possibility, since there

2700-474: The package using conventional wire bonding. The cavities are capped with a ceramic plate. The Pentium Pro with 1 MB of cache uses a plastic MCM. Instead of two cavities, there is only one, in which the three dies reside, bonded to the package instead of a heat slug. The cavities are filled in with epoxy. The MCM has 387 pins, of which approximately half are arranged in a pin grid array (PGA) and half in an interstitial pin grid array (IPGA). The packaging

2760-406: The process they did learn how to build a split-transaction bus to support a cacheless multiprocessor system. The i960 had further developed the split-transaction iAPX 432 bus to include a cache coherency protocol, ending up with a feature set highly reminiscent of the original Futurebus ambitions. The lead architect of i960 was superscalarity specialist Fred Pollack who was also the lead engineer of

2820-500: The result has to be stored in the ROB. After the microprocessor was released, a bug was discovered in the floating point unit , commonly called the "Pentium Pro and Pentium II FPU bug" and by Intel as the "flag erratum". The bug occurs under some circumstances during floating point-to-integer conversion when the floating point number will not fit into the smaller integer format, causing the FPU to deviate from its documented behaviour. The bug

2880-418: The time of the Pentium Pro's release were 16-bit DOS , and mixed 16/32-bit Windows 3.1x and Windows 95 (although the latter requires a 32-bit 80386 CPU as a minimum, much of its code is still 16-bit for performance reasons, such as the 16-bit Windows USER dynamic link library , user.exe ). This, along with the high cost of Pentium Pro systems, led to tepid sales among PC buyers at the time. To fully use

2940-428: The time, the Pentium Pro's out-of-order register renaming architecture had trouble running 16-bit code and mixed code ( 8-bit with 16-bit (8/16), or 16-bit with 32-bit (16/32), as using partial registers cause frequent pipeline flushing. Specific use of partial registers was then a common performance optimization, as it incurred no performance penalty on pre-P6 Intel processors; also, the dominant operating systems at

3000-477: The traditional uniformity of the instruction set favors superscalar dispatch (this was why RISC designs were faster than CISC designs through the 1980s and into the 1990s, and it's far more complicated to do multiple dispatch when instructions have variable bit length). Except for CPUs used in low-power applications, embedded systems , and battery -powered devices, essentially all general-purpose CPUs developed since about 1998 are superscalar. The P5 Pentium

3060-588: The upgrade to dual Pentium II OverDrive processors in 1999. ASCI Red continued to use dual Pentium II OverDrive processors for the remainder of its lifespan before being decommissioned in 2006. As Slot 1 motherboards became prevalent, several manufacturers released slotket (or slocket) adapters, such as the Tyan M2020, Asus C-P6S1, Tekram P6SL1, and the Abit KP6. These sockets allowed Pentium Pro processors to be used with Slot 1 motherboards. However, only

SECTION 50

#1732772193096

3120-450: Was Fred Pollack who was specialized in superscalarity and had also worked as the lead engineer of the Intel iAPX 432 . The Pentium Pro incorporated a new microarchitecture , different from the Pentium's P5 microarchitecture. It has a decoupled, 14-stage superpipelined architecture which used an instruction pool. The Pentium Pro ( P6 ) implemented many radical architectural differences mirroring other contemporary x86 designs such as

3180-562: Was designed for Socket 8 . In 1998, the 300/333 MHz Pentium II OverDrive processor for Socket 8 was released. Based on some of the technology used in the Deschutes Pentium II Xeon , it featured double L1 and 512 KB of full-speed L2 cache with MMX capabilities, and was produced by Intel as a drop-in upgrade option for owners of Pentium Pro systems. However, it only supported two-way glueless multiprocessing, not four-way or higher, which did not make it

3240-488: Was developed in its earliest stages of development but was never released. Some users chose to overclock their Pentium Pro chips, with the 200 MHz version often being run at 233 MHz, the 180 MHz version often being run at 200 MHz, and the 150 MHz version often being run at 166 MHz. The chip was popular in symmetric multiprocessing configurations, with dual and quad SMP server and workstation setups being commonplace. Intel skipped out on providing

3300-468: Was its on-package L2 cache , which ranged from 256 KB at introduction to 1 MB in 1997. At the time, manufacturing technology did not feasibly allow a large L2 cache to be integrated into the processor core. Intel instead placed the L2 die(s) separately in the package which still allowed it to run at the same clock speed as the CPU core. Additionally, unlike most motherboard-based cache schemes that shared

3360-406: Was originally intended to replace the original Pentium in a full range of applications. Later, it was reduced to a more narrow role as a server and high-end desktop processor. The Pentium Pro was also used in supercomputers , most notably ASCI Red , which used two Pentium Pro CPUs on each computing node and was the first computer to reach over one teraFLOPS in 1996, holding the number one spot in

3420-549: Was replaced in 2000 by the SR8000 , making it the last vector supercomputer from Hitachi. The CPU architecture of HITACHI S-3800 Series was based on IBM System/370 , and compatible with Hitachi's mainframe systems. It supported two operating systems: OSF/1 Unix and Hitachi's own VOS3 (a fork of IBM MVS ). This supercomputer-related article is a stub . You can help Misplaced Pages by expanding it . Superscalar A superscalar processor (or multiple-issue processor )

3480-913: Was significantly lower, half that of some RISC microprocessors. The Pentium Pro's integer performance lead disappeared rapidly, first overtaken by the MIPS Technologies R10000 in January 1996, and then by Digital Equipment Corporation 's EV56 variant of the Alpha 21164 . Reviewers quickly noted the very slow writes to video memory as the weak spot of the P6 platform, with performance here being as low as 10% of an identically clocked Pentium system in benchmarks such as VIDSPEED. Methods to circumvent this included setting VESA drawing to system memory instead of video memory in games such as Quake , and later on utilities such as FASTVID emerged, which could double performance in certain games by enabling

3540-448: Was the first processor in the x86 family to support upgradeable microcode under BIOS and/or operating system (OS) control. Micro-ops exit the re-order buffer (ROB) and enter a reserve station (RS), where they await dispatch to the execution units. In each clock cycle, up to five micro-ops can be dispatched to five execution units. The Pentium Pro has a total of six execution units: two integer units, one floating-point unit (FPU),

3600-459: Was the first superscalar x86 processor; the Nx586 , P6 Pentium Pro and AMD K5 were among the first designs which decode x86 -instructions asynchronously into dynamic microcode -like micro-op sequences prior to actual execution on a superscalar microarchitecture ; this opened up for dynamic scheduling of buffered partial instructions and enabled more parallelism to be extracted compared to

#95904