A superscalar processor (or multiple-issue processor ) is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. In contrast to a scalar processor , which can execute at most one single instruction per clock cycle, a superscalar processor can execute more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution units on the processor. It therefore allows more throughput (the number of instructions that can be executed in a unit of time) than would otherwise be possible at a given clock rate . Each execution unit is not a separate processor (or a core if the processor is a multi-core processor ), but an execution resource within a single CPU such as an arithmetic logic unit .
53-616: POWER7 is a family of superscalar multi-core microprocessors based on the Power ISA 2.06 instruction set architecture released in 2010 that succeeded the POWER6 and POWER6+ . POWER7 was developed by IBM at several sites including IBM's Rochester, MN ; Austin, TX; Essex Junction, VT ; T. J. Watson Research Center , NY; Bromont, QC and IBM Deutschland Research & Development GmbH, Böblingen , Germany laboratories. IBM announced servers based on POWER7 on 8 February 2010. IBM won
106-635: A $ 244 million DARPA contract in November 2006 to develop a petascale supercomputer architecture before the end of 2010 in the HPCS project. The contract also states that the architecture shall be available commercially. IBM's proposal, PERCS (Productive, Easy-to-use, Reliable Computer System), which won them the contract, is based on the POWER7 processor, AIX operating system and General Parallel File System . One feature that IBM and DARPA collaborated on
159-477: A finite digit binary fraction. For example, decimal 0.1 cannot be represented in binary exactly, only approximated. Therefore: Since IEEE 754 binary32 format requires real values to be represented in ( 1. x 1 x 2 . . . x 23 ) 2 × 2 e {\displaystyle (1.x_{1}x_{2}...x_{23})_{2}\times 2^{e}} format (see Normalized number , Denormalized number ), 1100.011
212-477: A given 32-bit binary32 data with a given sign , biased exponent e (the 8-bit unsigned integer), and a 23-bit fraction is which yields In this example: thus: Note: The single-precision binary floating-point exponent is encoded using an offset-binary representation, with the zero offset being 127; also known as exponent bias in the IEEE 754 standard. Thus, in order to get the true exponent as defined by
265-669: A given CPU): Seymour Cray 's CDC 6600 from 1964 is often mentioned as the first superscalar design. The 1967 IBM System/360 Model 91 was another superscalar mainframe. The Intel i960 CA (1989), the AMD 29000 -series 29050 (1990), and the Motorola MC88110 (1991), microprocessors were the first commercial single-chip superscalar microprocessors. RISC microprocessors like these were the first to have superscalar execution, because RISC architectures free transistors and die area which can be used to include multiple execution units and
318-586: A set of queues. Up to eight instructions per cycle can be issued to the Instruction Execution units. This gives the following theoretical single precision (SP) performance figures (based on a 4.14 GHz 8 core implementation): 4 64-bit SIMD units per core, and a 128-bit SIMD VMX unit per core, can do 12 Multiply-Adds per cycle, giving 24 SP FP ops per cycle. At 4.14 GHz, that gives 4.14 billion * 24 = 99.36 SP GFLOPS, and at 8 cores, 794.88 SP GFLOPS. Peak double precision (DP) performance
371-422: A single processor. Thus a multicore CPU is possible where each core is an independent processor containing multiple parallel pipelines, each pipeline being superscalar. Some processors also include vector capability. Single precision Single-precision floating-point format (sometimes called FP32 or float32 ) is a computer number format , usually occupying 32 bits in computer memory ; it represents
424-436: A superscalar CPU the dispatcher reads instructions from memory and decides which ones can be run in parallel, dispatching each to one of the several execution units contained inside a single CPU. Therefore, a superscalar processor can be envisioned as having multiple parallel pipelines, each of which is processing instructions simultaneously from a single instruction thread. Most modern superscalar CPUs also have logic to reorder
477-493: A value of 0.375. We saw that 0.375 = ( 0.011 ) 2 = ( 1.1 ) 2 × 2 − 2 {\displaystyle 0.375={(0.011)_{2}}={(1.1)_{2}}\times 2^{-2}} Hence after determining a representation of 0.375 as ( 1.1 ) 2 × 2 − 2 {\displaystyle {(1.1)_{2}}\times 2^{-2}} we can proceed as above: From these we can form
530-421: A value of 127 represents the actual exponent zero. Exponents range from −126 to +127 (thus 1 to 254 in the exponent field), because the biased exponent values 0 (all 0s) and 255 (all 1s) are reserved for special numbers ( subnormal numbers , signed zeros , infinities , and NaNs ). The true significand of normal numbers includes 23 fraction bits to the right of the binary point and an implicit leading bit (to
583-451: A whole number −149 ≤ n ≤ 127, can be converted exactly into an IEEE 754 single-precision floating-point value. In the IEEE 754 standard , the 32-bit base-2 format is officially referred to as binary32 ; it was called single in IEEE 754-1985 . IEEE 754 specifies additional floating-point types, such as 64-bit base-2 double precision and, more recently, base-10 representations. One of
SECTION 10
#1732773159134636-484: A wide dynamic range of numeric values by using a floating radix point . A floating-point variable can represent a wider range of numbers than a fixed-point variable of the same bit width at the cost of precision. A signed 32-bit integer variable has a maximum value of 2 − 1 = 2,147,483,647, whereas an IEEE 754 32-bit base-2 floating-point variable has a maximum value of (2 − 2 ) × 2 ≈ 3.4028235 × 10 . All integers with seven or fewer decimal digits, and any 2 for
689-438: Is 2 − 149 ≈ 1.4 × 10 − 45 {\displaystyle 2^{-149}\approx 1.4\times 10^{-45}} . In general, refer to the IEEE 754 standard itself for the strict conversion (including the rounding behaviour) of a real number into its equivalent binary32 format. Here we can show how to convert a base-10 real number into an IEEE 754 binary32 format using
742-411: Is capable of four-way simultaneous multithreading (SMT). The POWER7 has approximately 1.2 billion transistors and is 567 mm large fabricated on a 45 nm process. A notable difference from POWER6 is that the POWER7 executes instructions out-of-order instead of in-order. Despite the decrease in maximum frequency compared to POWER6 (4.25 GHz vs 5.0 GHz), each core has higher performance than
795-468: Is important for workloads which require the fastest sequential performance at the cost of reduced parallel performance. TurboCore mode can reduce "software costs in half for those applications that are licensed per core, while increasing per core performance from that software." The new IBM Power 780 scalable, high-end servers featuring the new TurboCore workload optimizing mode and delivering up to double performance per core of POWER6 based systems. Each core
848-706: Is indicative of major architectural differences between the two chips / mainboards / memory systems etc.: they were designed with different workloads in mind. However, overall, in a very broad sense, one can say that the floating-point performance of the POWER7 is similar to that of the Haswell ;i7. IBM introduced the POWER7+ processor at the Hot Chips 24 conference in August 2012. It is an updated version with higher speeds, more cache and integrated accelerators. It
901-650: Is manufactured on a 32 nm fabrication process. The first boxes to ship with the POWER7+ processors were IBM Power 770 and 780 servers. The chips have up to 80 MB of L3 cache (10 MB/core), improved clock speeds (up to 4.4 GHz) and 20 LPARs per core. As of October 2011, the range of POWER7-based systems including IBM Power Systems "Express" models (710, 720, 730, 740 and 750), Enterprise models (770, 780 and 795) and High Performance computing models (755 and 775). Enterprise models differ in having Capacity on Demand capabilities. Maximum specifications are shown in
954-440: Is modifying the addressing and page table hardware to support global shared memory space for POWER7 clusters. This enables research scientists to program a cluster as if it were a single system, without using message passing. From a productivity standpoint, this is essential since some scientists are not conversant with MPI or other parallel programming techniques used in clusters. The POWER7 superscalar multi-core architecture
1007-426: Is no assurance otherwise and failure to detect a dependency would produce incorrect results. No matter how advanced the semiconductor process or how fast the switching speed, this places a practical limit on how many instructions can be simultaneously dispatched. While process advances will allow ever greater numbers of execution units (e.g. ALUs), the burden of checking instruction dependencies grows rapidly, as does
1060-454: Is removed and delegated to the compiler . Explicitly parallel instruction computing (EPIC) is like VLIW with extra cache prefetching instructions. Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar processors. SMT permits multiple independent threads of execution to better utilize the resources provided by modern processor architectures. The fact that they are independent means that we know that
1113-441: Is roughly half of peak SP performance. For comparison, Intel's 2013 Haswell architecture CPUs can do 16 DP FLOPs or 32 SP FLOPs per cycle (8/16 DP/SP fused multiply-add spread across 2× 256-bit AVX2 FP vector units). At 3.4 GHz (i7-4770) this translates into 108.8 SP GFLOPS per core and 435.2 SP GFLOPS peak performance across the 4-core chip, giving roughly similar levels of performance per core, without taking into account
SECTION 20
#17327731591341166-438: Is shifted to the right by 3 digits to become ( 1.100011 ) 2 × 2 3 {\displaystyle (1.100011)_{2}\times 2^{3}} Finally we can see that: ( 12.375 ) 10 = ( 1.100011 ) 2 × 2 3 {\displaystyle (12.375)_{10}=(1.100011)_{2}\times 2^{3}} From which we deduce: From these we can form
1219-709: Is termed REAL in Fortran ; SINGLE-FLOAT in Common Lisp ; float in C , C++ , C# and Java ; Float in Haskell and Swift ; and Single in Object Pascal ( Delphi ), Visual Basic , and MATLAB . However, float in Python , Ruby , PHP , and OCaml and single in versions of Octave before 3.2 refer to double-precision numbers. In most implementations of PostScript , and some embedded systems ,
1272-480: Is the difference between scalar and vector arithmetic. A superscalar processor is a mixture of the two. Each instruction processes one data item, but there are multiple execution units within each CPU thus multiple instructions can be processing separate data items concurrently. Superscalar CPU design emphasizes improving the instruction dispatcher accuracy and allowing it to keep the multiple execution units in use at all times. This has become increasingly important as
1325-426: Is the sign bit, x is the exponent, and m is the significand. These examples are given in bit representation , in hexadecimal and binary , of the floating-point value. This includes the sign, (biased) exponent, and significand. By default, 1/3 rounds up, instead of down like double precision , because of the even number of bits in the significand. The bits of 1/3 beyond the rounding point are 1010... which
1378-644: The ALU , integer multiplier , integer shifter, FPU , etc. There may be multiple versions of each execution unit to enable the execution of many instructions in parallel. This differs from a multi-core processor that concurrently processes instructions from multiple threads, one thread per processing unit (called "core"). It also differs from a pipelined processor , where the multiple instructions can concurrently be in various stages of execution, assembly-line fashion. The various alternative techniques are not mutually exclusive—they can be (and frequently are) combined in
1431-513: The IBM POWER 7 processor has up to eight cores, and four threads per core, for a total capacity of 32 simultaneous threads. IBM stated at ISCA 29 that peak performance was achieved by high frequency designs with 10–20 FO4 delays per pipeline stage at the cost of power efficiency. However, the POWER6 binary floating-point unit achieves a "6-cycle, 13- FO4 pipeline". Therefore, the pipeline for
1484-434: The POWER6, while each processor has up to 4 times the number of cores. POWER7 has these specifications: The technical specification further specifies: Each POWER7 processor core implements aggressive out-of-order (OoO) instruction execution to drive high efficiency in the use of available execution paths. The POWER7 processor has an Instruction Sequence Unit that is capable of dispatching up to six instructions per cycle to
1537-605: The POWER7 CPU has been changed again, just as it was for the POWER5 and POWER6 designs. In some respects, this rework is similar to Intel's turn in 2005 that left the P4 7th-generation x86 microarchitecture. The POWER7 is available with 4, 6, or 8 physical cores per microchip, in a 1 to 32-way design, with up to 1024 SMTs and a slightly different microarchitecture and interfaces for supporting extended/Sub-Specifications in reference to
1590-578: The Power ISA and/or different system architectures. For example, in the Supercomputing (HPC) System Power 775 it is packaged as a 32-way quad-chip-module (QCM) with 256 physical cores and 1024 SMTs. There is also a special TurboCore mode that can turn off half of the cores from an eight-core processor, but those 4 cores have access to all the memory controllers and L3 cache at increased clock speeds. This makes each core's performance higher which
1643-402: The complexity of register renaming circuitry to mitigate some dependencies. Collectively the power consumption , complexity and gate delay costs limit the achievable superscalar speedup. However even given infinitely fast dependency checking logic on an otherwise conventional superscalar CPU, if the instruction stream itself has many dependencies, this would also limit the possible speedup. Thus
POWER7 - Misplaced Pages Continue
1696-419: The degree of intrinsic parallelism in the code stream forms a second limitation. Collectively, these limits drive investigation into alternative architectural changes such as very long instruction word (VLIW), explicitly parallel instruction computing (EPIC), simultaneous multithreading (SMT), and multi-core computing . With VLIW, the burdensome task of dependency checking by hardware logic at run time
1749-540: The effects or benefits of Intel's Turbo Boost technology. This theoretical peak performance comparison holds in practice too, with the POWER7 and the i7-4770 obtaining similar scores in the SPEC CPU2006 floating point benchmarks (single-threaded): 71.5 for POWER7 versus 74.0 for i7-4770. Notice that the POWER7 chip significantly outperformed (2×–5×) the i7 in some benchmarks (bwaves, cactusADM, lbm) while also being significantly slower (2x-3x) in most others. This
1802-454: The first programming languages to provide single- and double-precision floating-point data types was Fortran . Before the widespread adoption of IEEE 754-1985, the representation and properties of floating-point data types depended on the computer manufacturer and computer model, and upon decisions made by programming-language designers. E.g., GW-BASIC 's single-precision data type was the 32-bit MBF floating-point format. Single precision
1855-637: The following outline: Conversion of the fractional part: Consider 0.375, the fractional part of 12.375. To convert it into a binary fraction, multiply the fraction by 2, take the integer part and repeat with the new fraction by 2 until a fraction of zero is found or until the precision limit is reached which is 23 fraction digits for IEEE 754 binary32 format. We see that ( 0.375 ) 10 {\displaystyle (0.375)_{10}} can be exactly represented in binary as ( 0.011 ) 2 {\displaystyle (0.011)_{2}} . Not all decimal fractions can be represented in
1908-418: The implicit 24th bit), bit 23 to bit 0, represents a value, starting at 1 and halves for each bit, as follows: The significand in this example has three bits set: bit 23, bit 22, and bit 19. We can now decode the significand by adding the values represented by these bits. Then we need to multiply with the base, 2, to the power of the exponent, to get the final result: Thus This is equivalent to: where s
1961-448: The instruction of one thread can be executed out of order and/or in parallel with the instruction of a different one. Also, one independent thread will not produce a pipeline bubble in the code stream of a different one, for example, due to a branch. Superscalar processors differ from multi-core processors in that the several execution units are not entire processors. A single processor is composed of finer-grained execution units such as
2014-452: The instructions to try to avoid pipeline stalls and increase parallel execution. Available performance improvement from superscalar techniques is limited by three key areas: Existing binary executable programs have varying degrees of intrinsic parallelism. In some cases instructions are not dependent on each other and can be executed simultaneously. In other cases they are inter-dependent: one instruction impacts either resources or results of
2067-441: The latter (pipeline) executes multiple instructions in the same execution unit in parallel by dividing the execution unit into different phases. In the "Simple superscalar pipeline" figure, fetching two instructions at the same time is superscaling, and fetching the next two before the first pair has been written back is pipelining. The superscalar technique is traditionally associated with several identifying characteristics (within
2120-513: The left of the binary point) with value 1. Subnormal numbers and zeros (which are the floating-point numbers smaller in magnitude than the least positive normal number) are represented with the biased exponent value 0, giving the implicit leading bit the value 0. Thus only 23 fraction bits of the significand appear in the memory format, but the total precision is 24 bits (equivalent to log 10 (2 ) ≈ 7.225 decimal digits). The bits are laid out as follows: [REDACTED] The real value assumed by
2173-463: The more rigid methods used in the simpler P5 Pentium ; it also simplified speculative execution and allowed higher clock frequencies compared to designs such as the advanced Cyrix 6x86 . The simplest processors are scalar processors. Each instruction executed by a scalar processor typically manipulates one or two data items at a time. By contrast, each instruction executed by a vector processor operates simultaneously on many data items. An analogy
POWER7 - Misplaced Pages Continue
2226-753: The number of units has increased. While early superscalar CPUs would have two ALUs and a single FPU , a later design such as the PowerPC 970 includes four ALUs, two FPUs, and two SIMD units. If the dispatcher is ineffective at keeping all of these units fed with instructions, the performance of the system will be no better than that of a simpler, cheaper design. A superscalar processor usually sustains an execution rate in excess of one instruction per machine cycle . But merely processing multiple instructions concurrently does not make an architecture superscalar, since pipelined , multiprocessor or multi-core architectures also achieve that, but with different methods. In
2279-425: The offset-binary representation, the offset of 127 has to be subtracted from the stored exponent. The stored exponents 00 H and FF H are interpreted specially. The minimum positive normal value is 2 − 126 ≈ 1.18 × 10 − 38 {\displaystyle 2^{-126}\approx 1.18\times 10^{-38}} and the minimum positive (subnormal) value
2332-402: The only supported precision is single. The IEEE 754 standard specifies a binary32 as having: This gives from 6 to 9 significant decimal digits precision. If a decimal string with at most 6 significant digits is converted to the IEEE 754 single-precision format, giving a normal number , and then converted back to a decimal string with the same number of digits, the final result should match
2385-415: The original string. If an IEEE 754 single-precision number is converted to a decimal string with at least 9 significant digits, and then converted back to single-precision representation, the final result must match the original number. The sign bit determines the sign of the number, which is the sign of the significand as well. The exponent field is an 8-bit unsigned integer from 0 to 255, in biased form :
2438-468: The other. The instructions a = b + c; d = e + f can be run in parallel because none of the results depend on other calculations. However, the instructions a = b + c; b = e + f might not be runnable in parallel, depending on the order in which the instructions complete while they move through the units. Although the instruction stream may contain no inter-instruction dependencies, a superscalar CPU must nonetheless check for that possibility, since there
2491-800: The resulting 32-bit IEEE 754 binary32 format representation of 12.375: Note: consider converting 68.123 into IEEE 754 binary32 format: Using the above procedure you expect to get ( 42883EF9 ) 16 {\displaystyle ({\text{42883EF9}})_{16}} with the last 4 bits being 1001. However, due to the default rounding behaviour of IEEE 754 format, what you get is ( 42883EFA ) 16 {\displaystyle ({\text{42883EFA}})_{16}} , whose last 4 bits are 1010. Example 1: Consider decimal 1. We can see that: ( 1 ) 10 = ( 1.0 ) 2 × 2 0 {\displaystyle (1)_{10}=(1.0)_{2}\times 2^{0}} From which we deduce: From these we can form
2544-424: The resulting 32-bit IEEE 754 binary32 format representation of real number 0.375: If the binary32 value, 41C80000 in this example, is in hexadecimal we first convert it to binary: then we break it down into three parts: sign bit, exponent, and significand. We then add the implicit 24th bit to the significand: and decode the exponent value by subtracting 127: Each of the 24 bits of the significand (including
2597-473: The resulting 32-bit IEEE 754 binary32 format representation of real number 1: Example 2: Consider a value 0.25. We can see that: ( 0.25 ) 10 = ( 1.0 ) 2 × 2 − 2 {\displaystyle (0.25)_{10}=(1.0)_{2}\times 2^{-2}} From which we deduce: From these we can form the resulting 32-bit IEEE 754 binary32 format representation of real number 0.25: Example 3: Consider
2650-472: The table below. IBM also offers 5 POWER7 based BladeCenters . Specifications are shown in the table below. The following are supercomputer projects that use the POWER7 processor: Superscalar While a superscalar CPU is typically also pipelined , superscalar and pipelining execution are considered different performance enhancement techniques. The former (superscalar) executes multiple instructions in parallel by using multiple execution units, whereas
2703-477: The traditional uniformity of the instruction set favors superscalar dispatch (this was why RISC designs were faster than CISC designs through the 1980s and into the 1990s, and it's far more complicated to do multiple dispatch when instructions have variable bit length). Except for CPUs used in low-power applications, embedded systems , and battery -powered devices, essentially all general-purpose CPUs developed since about 1998 are superscalar. The P5 Pentium
SECTION 50
#17327731591342756-416: Was a substantial evolution from the POWER6 design, focusing more on power efficiency through multiple cores and simultaneous multithreading (SMT). The POWER6 architecture was built from the ground up to maximize processor frequency at the cost of power efficiency. It achieved a remarkable 5 GHz. While the POWER6 features a dual-core processor, each capable of two-way simultaneous multithreading (SMT),
2809-459: Was the first superscalar x86 processor; the Nx586 , P6 Pentium Pro and AMD K5 were among the first designs which decode x86 -instructions asynchronously into dynamic microcode -like micro-op sequences prior to actual execution on a superscalar microarchitecture ; this opened up for dynamic scheduling of buffered partial instructions and enabled more parallelism to be extracted compared to
#133866