The Power Processing Element ( PPE ) comprises a Power Processing Unit ( PPU ) and a 512 KB L2 cache. In most instances the PPU is used in a PPE. The PPU is a 64-bit dual-threaded in-order PowerPC 2.02 microprocessor core designed by IBM for use primarily in the game consoles PlayStation 3 and Xbox 360 , but has also found applications in high performance computing in supercomputers such as the record setting IBM Roadrunner .
9-472: The PPU is used as a main CPU core in three different processor designs: The PPU is an in-order processor, but it has some unique traits which allow it to achieve some benefits of out-of-order execution without expensive re-ordering hardware. Upon reaching an L1 cache miss – it can execute past the cache miss, stopping only when an instruction is actually dependent on a load. It can send up to 8 load instructions to
18-437: A bubble , by analogy with an air bubble in a fluid pipe. In some architectures, the execution stage of the pipeline must always be performing an action at every cycle. In that case, the bubble is implemented by feeding NOP ("no operation") instructions to the execution stage, until the bubble is flushed past it. The following is two executions of the same four instructions through a 4-stage pipeline but, for whatever reason,
27-460: A delay in fetching of the purple instruction in cycle #2 leads to a bubble being created delaying all instructions after it as well. The below example shows a bubble being inserted into a classic RISC pipeline , with five stages (IF = Instruction Fetch, ID = Instruction Decode, EX = Execute, MEM = Memory access, WB = Register write back). In this example, data available after the MEM stage (4th stage) of
36-400: A register to which the currently executed instruction writes. If this condition holds, the control unit will stall the instruction by one clock cycle. It also stalls the instruction in the fetch stage, to prevent the instruction in that stage from being overwritten by the next instruction in the program. In a Von Neumann architecture which uses the program counter (PC) register to determine
45-508: A theoretical 12 floating-point operations per cycle, as its floating-point unit can do floating-point multiply-adds, and come no smaller than 64-bits. That gives 3.2 billion clock cycles × 12 = 38.4 billion floating-point operations/second. The PPU is enhanced in the PowerXCell 8i processor to be able to make single cycle double precision floating point operations, tailored for high performance computing in supercomputers. The VMX unit in
54-533: The XCPU in the Xbox 360 is enhanced with 128 registers and is not entirely compatible with regular AltiVec. Bubble (computing) In the design of pipelined computer processors , a pipeline stall is a delay in execution of an instruction in order to resolve a hazard . In a standard five-stage pipeline , during the decoding stage , the control unit will determine whether the decoded instruction reads from
63-720: The L2 cache out-of-order. It has an instruction delay pipe – a side path that allows it to execute instructions that would normally cause pipeline stalls without holding up the rest of the pipeline . The instruction delay pipeline is used for the Out-Of-Order Load/Stores: cache misses are put there while it moves on. The PPE has a 23-stage general pipeline with an additional 11 stages possible for microcode and an additional 4 stages possible for branch prediction. The PPU runs two hardware threads simultaneously. The main registers for code execution are duplicated, as are
72-452: The current instruction being fetched in the pipeline, to prevent new instructions from being fetched when an instruction in the decoding stage has been stalled, the value in the PC register and the instruction in the fetch stage are preserved to prevent changes. The values are preserved until the instruction causing the conflict has passed through the execution stage. Such an event is often called
81-488: The exception and interrupt-handling registers, and several essential arrays and queues. They can generate exceptions simultaneously, and perform branch prediction on their individual branch histories. The execution engine and caches are not duplicated though – so it is still just a single-core design. Its 64-bit double-precision floating-point unit, and 128-bit VMX unit (using the AltiVec instruction set), can perform
#332667