HAL SPARC64 - Misplaced Pages

SPARC64 is a microprocessor developed by HAL Computer Systems and fabricated by Fujitsu . It implements the SPARC V9 instruction set architecture (ISA), the first microprocessor to do so. SPARC64 was HAL's first microprocessor and was the first in the SPARC64 brand. It operates at 101 and 118 MHz. The SPARC64 was used exclusively by Fujitsu in their systems; the first systems, the Fujitsu HALstation Model 330 and Model 350 workstations, were formally announced in September 1995 and were introduced in October 1995, two years late. It was succeeded by the SPARC64 II (previously known as the SPARC64+) in 1996.

#917082

37-472: The SPARC64 is a superscalar microprocessor that issues four instructions per cycle and executes them out of order . It is a multichip design, consisting of seven dies: a CPU die, MMU die, four CACHE dies and a CLOCK die. The CPU die contains the majority of logic, all of the execution units and a level 0 (L0) instruction cache. The execution units consist of two integer units, address units, floating-point units (FPUs), memory units. The FPU hardware consists of

74-472: A fused multiply add (FMA) unit and a divide unit. But the FMA instructions are really fused (that is, with a single rounding) only as of SPARC64 VI . The FMA unit is pipelined and has a four-cycle latency and a one-cycle-throughput. The divide unit is not pipelined and has significantly longer latencies. The L0 instruction cache has a capacity of 4 KB, is direct-mapped and has a one-cycle latency. The CPU die

111-402: A capacity of 128 KB. The latency for both caches is three cycles, and the caches are four-way set associative. The data cache is protected by error correcting code (ECC) and parity. It uses a 128-byte line size. Each CACHE die implements 64 KB of the cache and a portion of the cache tags. The cache die contains 4.3 million transistors, has dimensions of 14.0 mm by 10.11 mm for

148-655: A dedicated 128-bit data bus that operates at the same or half clock frequency of the microprocessor. The L2 cache is inclusive, that is it is a super-set of the L1 caches. Both L1 and L2 cache have their data protected by ECC and their tags by parity. The SPARC64 II's proprietary system interface was replaced by one compatible with the Ultra Port Architecture . This enabled the SPARC64 III to use chipsets from Sun Microelectronics. The system bus operates at half,

185-516: A derivative of Solaris developed by HAL that supported the SPARC64. The L1 caches were halved in capacity to 64 KB from 128 KB to reduce die area (the reason why only two of the four CACHE dies were integrated from the SPARC64 II). The associated performance loss was mitigated by the provision of a large external L2 cache with a capacity of 1 to 16 MB. The L2 cache is accessed with

222-400: A die area of 142 mm. It has 1,854 solder bumps, of which 446 are signals and 1408 are power. The SPARC64 consisted of 21.9 million transistors. It was fabricated by Fujitsu in their CS-55 process, a 0.40 μm, four-layer metal complementary metal–oxide–semiconductor (CMOS) process. The seven dies are packaged in a rectangular ceramic multi-chip module (MCM), connected to

259-669: A given CPU): Seymour Cray 's CDC 6600 from 1964 is often mentioned as the first superscalar design. The 1967 IBM System/360 Model 91 was another superscalar mainframe. The Intel i960 CA (1989), the AMD 29000 -series 29050 (1990), and the Motorola MC88110 (1991), microprocessors were the first commercial single-chip superscalar microprocessors. RISC microprocessors like these were the first to have superscalar execution, because RISC architectures free transistors and die area which can be used to include multiple execution units and

296-492: A second FPU which could execute add and subtract instructions. A FPU of less functionality was added instead of a duplicate of the first to save die area; the second FPU is half the size of the first. It has a three-cycle latency for all instructions. The complex SPARC64 II memory management unit (MMU) was replaced with a simpler one that is compatible with the Solaris operating system. Previously, SPARC64 systems ran SPARC64/OS,

333-436: A superscalar CPU the dispatcher reads instructions from memory and decides which ones can be run in parallel, dispatching each to one of the several execution units contained inside a single CPU. Therefore, a superscalar processor can be envisioned as having multiple parallel pipelines, each of which is processing instructions simultaneously from a single instruction thread. Most modern superscalar CPUs also have logic to reorder

370-481: A third, quarter or fifth the frequency of the microprocessor, up to a maximum of 150 MHz. It contained 17.6 million transistors, of which 6 million are for logic and 11.6 million are contained in the caches and TLBs. The die has an area of 210 mm. It was fabricated by Fujitsu in their CS-70 process, a 0.24 μm, five-layer metal, CMOS process. It is packaged in a 957-pad flip-chip land grid array (LGA) package with dimensions of 42.5 mm by 42.5 mm. Of

407-527: A unit of time) than would otherwise be possible at a given clock rate . Each execution unit is not a separate processor (or a core if the processor is a multi-core processor ), but an execution resource within a single CPU such as an arithmetic logic unit . While a superscalar CPU is typically also pipelined , superscalar and pipelining execution are considered different performance enhancement techniques. The former (superscalar) executes multiple instructions in parallel by using multiple execution units, whereas

SECTION 10

#1732791702918

444-494: Is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. In contrast to a scalar processor , which can execute at most one single instruction per clock cycle, a superscalar processor can execute more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution units on the processor. It therefore allows more throughput (the number of instructions that can be executed in

481-610: Is connected to the CACHE and MMU dies by ten 64-bit buses. Four address buses carrying virtual addresses lead out to each cache die. Two data buses write data from the register file to the two CACHE dies that implement the data cache. Four buses, one from each CACHE die, deliver data or instructions to the CPU. The CPU die contained 2.7 million transistors, has dimensions of 17.53 mm by 16.92 mm for an area of 297 mm and has 817 signal bumps and 1,695 power bumps. The MMU die contains

518-426: Is no assurance otherwise and failure to detect a dependency would produce incorrect results. No matter how advanced the semiconductor process or how fast the switching speed, this places a practical limit on how many instructions can be simultaneously dispatched. While process advances will allow ever greater numbers of execution units (e.g. ALUs), the burden of checking instruction dependencies grows rapidly, as does

555-454: Is removed and delegated to the compiler . Explicitly parallel instruction computing (EPIC) is like VLIW with extra cache prefetching instructions. Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar processors. SMT permits multiple independent threads of execution to better utilize the resources provided by modern processor architectures. The fact that they are independent means that we know that

592-480: Is the difference between scalar and vector arithmetic. A superscalar processor is a mixture of the two. Each instruction processes one data item, but there are multiple execution units within each CPU thus multiple instructions can be processing separate data items concurrently. Superscalar CPU design emphasizes improving the instruction dispatcher accuracy and allowing it to keep the multiple execution units in use at all times. This has become increasingly important as

629-644: The ALU , integer multiplier , integer shifter, FPU , etc. There may be multiple versions of each execution unit to enable the execution of many instructions in parallel. This differs from a multi-core processor that concurrently processes instructions from multiple threads, one thread per processing unit (called "core"). It also differs from a pipelined processor , where the multiple instructions can concurrently be in various stages of execution, assembly-line fashion. The various alternative techniques are not mutually exclusive—they can be (and frequently are) combined in

666-480: The memory management unit , cache controller and the external interfaces. The SPARC64 has separate interfaces for memory and input/output (I/O). The bus used to access the memory is 128 bits wide. The system interface is the HAL I/O (HIO) bus, a 64-bit asynchronous bus. The MMU has a die area of 163 mm. Four dies implement the level 1 (L1) instruction and data caches, which require two dies each. Both caches have

703-518: The 957 pads, 552 are for signals and 405 are for power and ground. Internal voltage is 2.5 V, I/O voltage is 3.3 V. Peak power consumption of 60 W at 275 MHz. The Ultra Port Architecture (UPA) signals are compatible with 3.3 V Low Voltage Transistor Transistor Logic (LVTTL) levels with the exception of differential clock signals which are compatible with 3.3 V pseudo emitter coupled logic (PECL) levels. The second and third SPARC64 GPs are fourth generation SPARC64 microprocessors. The second SPARC64 GP

740-402: The complexity of register renaming circuitry to mitigate some dependencies. Collectively the power consumption , complexity and gate delay costs limit the achievable superscalar speedup. However even given infinitely fast dependency checking logic on an otherwise conventional superscalar CPU, if the instruction stream itself has many dependencies, this would also limit the possible speedup. Thus

777-419: The degree of intrinsic parallelism in the code stream forms a second limitation. Collectively, these limits drive investigation into alternative architectural changes such as very long instruction word (VLIW), explicitly parallel instruction computing (EPIC), simultaneous multithreading (SMT), and multi-core computing . With VLIW, the burdensome task of dependency checking by hardware logic at run time

SECTION 20

#1732791702918

814-576: The dies, with the CPU die measuring 202 mm, the MMU die 103 mm, and the CACHE die 84 mm. The SPARC64 GP is a series of related microprocessors developed by HAL and Fujitsu used in the Fujitsu GP7000F and PrimePower servers . The first SPARC64 GP was a further development of the SPARC64 II. It was a third-generation SPARC64 microprocessor and was known as the SPARC64 III before it

851-448: The instruction of one thread can be executed out of order and/or in parallel with the instruction of a different one. Also, one independent thread will not produce a pipeline bubble in the code stream of a different one, for example, due to a branch. Superscalar processors differ from multi-core processors in that the several execution units are not entire processors. A single processor is composed of finer-grained execution units such as

888-452: The instructions to try to avoid pipeline stalls and increase parallel execution. Available performance improvement from superscalar techniques is limited by three key areas: Existing binary executable programs have varying degrees of intrinsic parallelism. In some cases instructions are not dependent on each other and can be executed simultaneously. In other cases they are inter-dependent: one instruction impacts either resources or results of

925-441: The latter (pipeline) executes multiple instructions in the same execution unit in parallel by dividing the execution unit into different phases. In the "Simple superscalar pipeline" figure, fetching two instructions at the same time is superscaling, and fetching the next two before the first pair has been written back is pipelining. The superscalar technique is traditionally associated with several identifying characteristics (within

962-463: The more rigid methods used in the simpler P5 Pentium ; it also simplified speculative execution and allowed higher clock frequencies compared to designs such as the advanced Cyrix 6x86 . The simplest processors are scalar processors. Each instruction executed by a scalar processor typically manipulates one or two data items at a time. By contrast, each instruction executed by a vector processor operates simultaneously on many data items. An analogy

999-753: The number of units has increased. While early superscalar CPUs would have two ALUs and a single FPU , a later design such as the PowerPC 970 includes four ALUs, two FPUs, and two SIMD units. If the dispatcher is ineffective at keeping all of these units fed with instructions, the performance of the system will be no better than that of a simpler, cheaper design. A superscalar processor usually sustains an execution rate in excess of one instruction per machine cycle . But merely processing multiple instructions concurrently does not make an architecture superscalar, since pipelined , multiprocessor or multi-core architectures also achieve that, but with different methods. In

1036-468: The other. The instructions a = b + c; d = e + f can be run in parallel because none of the results depend on other calculations. However, the instructions a = b + c; b = e + f might not be runnable in parallel, depending on the order in which the instructions complete while they move through the units. Although the instruction stream may contain no inter-instruction dependencies, a superscalar CPU must nonetheless check for that possibility, since there

1073-446: The second in terms of microarchitecture. It operated at 600 to 810 MHz. First versions were introduced in 2001. 700, 788 and 810 MHz versions introduced on 17 July 2002. It was fabricated by Fujitsu in their 0.15 μm CS85 process with six levels of copper interconnect. It used a 1.5 V internal power supply and a 1.8 or 2.5 V power supply for I/O. Superscalar A superscalar processor (or multiple-issue processor )

1110-477: The traditional uniformity of the instruction set favors superscalar dispatch (this was why RISC designs were faster than CISC designs through the 1980s and into the 1990s, and it's far more complicated to do multiple dispatch when instructions have variable bit length). Except for CPUs used in low-power applications, embedded systems , and battery -powered devices, essentially all general-purpose CPUs developed since about 1998 are superscalar. The P5 Pentium

1147-672: The underside of the MCM with solder bumps. The MCM has 565 pins, of which 286 are signal pins and 218 are power pins, organized as a pin grid array (PGA). The MCM has wide buses which connect the seven dies. The SPARC64 II (SPARC64+) was a further development of the SPARC64. It is a second-generation SPARC64 microprocessor. It operated at 141 and 161 MHz. It was used by Fujitsu in their HALstation Model 375 (141 MHz) and Model 385 (161 MHz) workstations, which were introduced in November 1996 and December 1996, respectively. The SPARC64 II

HAL SPARC64 - Misplaced Pages Continue

1184-560: Was a further development of the first and it operated at 400 to 563 MHz. The first versions, operating at 400 and 450 MHz were introduced on 1 August 2000. It had larger L1 instruction and data caches, doubled in capacity to 128 KB each; better branch prediction as the result of a larger BHT consisting of 16,384 entries; support for the Visual Instruction Set (VIS); and a L2 cache built from double data rate (DDR) SRAM. It contained 30 million transistors and

1221-404: Was fabricated by Fujitsu in their CS80 process, a 0.18 μm CMOS process with six levels of copper interconnect . It used a 1.8 V internal power supply and a 2.5 or 3.3 V power supply for I/O. It was packaged in a 1,206-contact ball grid array (BGA) measuring 37.5 mm by 37.5 mm. of the 1,206 contacts, 552 are signals and 405 are power or ground. The third SPARC64 GP was identical to

1258-488: Was introduced in March 1999. It was a single-die implementation of the SPARC64 II that integrated, with modifications, the CPU die and two of the four CACHE dies. Numerous modifications and improvements were made to the microarchitecture, such as the replacement of the MMU and a new system interface using the Ultra Port Architecture . It had improved branch prediction , an extra pipeline stage to improve clock frequencies and

1295-582: Was introduced. The SPARC64 GP operated at clock frequencies of 225, 250 and 275 MHz . It was the first microprocessor from HAL to support multiprocessing . The main competitors were the HP PA-8500 , IBM POWER3 and Sun UltraSPARC II . The SPARC64 GP was taped out in July 1997. It was announced on 11 April 1998, with 225 and 250 MHz versions were introduced in December 1998. A 275 MHz version

1332-401: Was succeeded by the SPARC64 III in 1998. The SPARC64 II has higher performance due to higher clock frequencies enabled by the new process and circuit tweaks; and a higher instructions per cycle (IPC) count due to the following microarchitecture improvements: It was fabricated by Fujitsu in their CS-60 process, a 0.35 μm, five-layer metal CMOS process. The new process reduced the area of

1369-459: Was the first superscalar x86 processor; the Nx586 , P6 Pentium Pro and AMD K5 were among the first designs which decode x86 -instructions asynchronously into dynamic microcode -like micro-op sequences prior to actual execution on a superscalar microarchitecture ; this opened up for dynamic scheduling of buffered partial instructions and enabled more parallelism to be extracted compared to

#917082