SPARC T series - Misplaced Pages

The SPARC T-series family of RISC processors and server computers , based on the SPARC V9 architecture, was originally developed by Sun Microsystems , and later by Oracle Corporation after its acquisition of Sun . Its distinguishing feature from earlier SPARC iterations is the introduction of chip multithreading (CMT) technology, a multithreading , multicore design intended to drive greater processor utilization at lower power consumption.

#43956

70-621: The first generation T-series processor, the UltraSPARC T1 , and servers based on it, were announced in December 2005. As later generations were introduced, the term "T series" was used to refer to the entire family of processors. Sun Microsystems' Sun Fire and SPARC Enterprise product lines were based on early generations of CMT technology. The UltraSPARC T1 based Sun Fire T2000 and T1000 servers were launched in December 2005 and early 2006, respectively. They were later rebranded to match

140-525: A 12-entry reservation station for load/store, which permits greater reordering of cache/memory access than preceding processors. Up to 64 instructions can be in a reordered state at a time. Pentium Pro (1995) introduced a unified reservation station , which at the 20 micro-OP capacity permitted very flexible reordering, backed by a 40-entry reorder buffer. Loads can be reordered ahead of both loads and stores. The practically attainable per-cycle rate of execution rose further as full out-of-order execution

210-405: A 1D vector for hazard avoidance. This new paradigm breaks up the processing of instructions into these steps: The key concept of out-of-order processing is to allow the processor to avoid a class of stalls that occur when the data needed to perform an operation are unavailable. In the outline above, the processor avoids the stall that occurs in step 2 of the in-order processor when the instruction

280-430: A bit later Arm 's A9 succeeded A8 . For low-end x86 personal computers in-order Bonnell microarchitecture in early Intel Atom processors were first challenged by AMD 's Bobcat microarchitecture , and in 2013 were succeeded by an out-of-order Silvermont microarchitecture . Because the complexity of out-of-order execution precludes achieving the lowest minimum power consumption, cost and size, in-order execution

350-424: A blade server in the same Sun Blade 6000 form factor. On September 26, 2011, Oracle announced a range of SPARC T4 -based servers. These systems use the same chassis as the earlier T3 based systems. Their main features are very similar, with the exception of: On March 26, 2013, Oracle announced refreshed SPARC servers based on the new SPARC T5 microprocessor, which the company claims is "the world's fastest". In

420-399: A computer program and achieve high-performance by exploiting the fine-grain parallelism between the two. In doing so, it effectively hides all memory latency from the processor's perspective. A larger buffer can, in theory, increase throughput. However, if the processor has a branch misprediction then the entire buffer may need to be flushed, wasting a lot of clock cycles and reducing

490-592: A family of systems built on the 32-core, 256-thread SPARC M8 microprocessor at 5.0 GHz. It also included the second generation of Data Analytics Accelerator (DAX) engines. SPARC T-series servers can be partitioned using Oracle's Logical Domains technology. Additional virtualization is provided by Oracle Solaris Zones (aka Solaris Containers ) to create isolated virtual servers within a single operating system instance. Logical Domains and Solaris Zones can be used together to increase server utilization. UltraSPARC T1 The UltraSPARC T1 ( codenamed " Niagara ")

560-452: A large number of separate threads. One of the limitations of the T1 design is that a single floating point unit (FPU) is shared between all 8 cores, making the T1 unsuitable for applications performing a lot of floating point mathematics. However, since the processor's intended markets do not typically make much use of floating-point operations, Sun did not expect this to be a problem. Sun provides

630-482: A major research area in computer architecture in the 1970s and early 1980s. The first machine to use out-of-order execution was the CDC 6600 (1964), designed by James E. Thornton , which uses a scoreboard to avoid conflicts. It permits an instruction to execute if its source operand (read) registers aren't to be written to by any unexecuted earlier instruction (true dependency) and the destination (write) register not be

700-560: A range of open source applications, including MySQL , PHP , gzip , and ImageMagick . Proper optimization for CoolThreads systems can result in significant gains: when the Sun Studio compiler is used with the recommended optimization settings, MySQL performance improves by 268% compared to using just the -O3 flag. The "Coolthreads(TM)" architecture, beginning with the UltraSPARC T1 (with its positive and negative aspects),

770-422: A register used by any unexecuted earlier instruction (false dependency). The 6600 lacks the means to avoid stalling an execution unit on false dependencies ( write after write (WAW) and write after read (WAR) conflicts, respectively termed first-order conflict and third-order conflict by Thornton, who termed true dependencies ( read after write (RAW)) as second-order conflict) because each address has only

SECTION 10

#1732779780044

840-597: A replacement for the UltraSPARC T1 or T2, but was canceled in the timeframe of Oracle's acquisition of Sun . Formerly known by the codename Niagara 2 , the follow-on to the UltraSPARC T1, the T2 provides eight cores. Unlike the T1, each core supports 8 threads per core, one FPU per core, one enhanced cryptographic unit per core, and CPU embedded 10 Gigabit Ethernet network controllers. In February 2007, Sun announced at its annual analyst summit that its third-generation simultaneous multithreading design, code-named Victoria Falls ,

910-482: A similar way to high-end Sun SMP systems. Thus, several cores can be partitioned for running a single or group of processes and/or threads, while the other cores deal with the rest of the processes on the system. Afara Websystems pioneered a radical thread-heavy SPARC design. The company was purchased by Sun, and the intellectual property became the foundation of the CoolThreads line of processors, starting with

980-498: A single location referable by it. The WAW is worse than WAR for the 6600, because when an execution unit encounters a WAR, the other execution units still receive and execute instructions, but upon a WAW the assignment of instructions to execution units stops, and they can not receive any further instructions until the WAW-causing instruction's destination register has been written to by earlier instruction. About two years later,

1050-483: A tool for analysing an application's level of parallelism and use of floating point instructions to determine if it is suitable for use on a T1 or T2 platform. In addition to web and application tier processing, the UltraSPARC T1 may be well suited for smaller database applications which have a large user count. One customer has published results showing that a MySQL application running on an UltraSPARC T1 server ran 13.5 times faster than on an AMD Opteron server. T1

1120-492: A two-entry reservation station permitting the newer entry to execute before the older. The reorder buffer capacity is 16 instructions. A four-entry load queue and a six-entry store queue track the reordering of loads and stores upon cache misses. HAL SPARC64 (1995) exceeded the reordering capacity of the ES/9000 model 900 by having three 8-entry reservation stations for integer, floating-point, and address generation unit , and

1190-524: Is 3 MB and there is no L3 cache. The T1 processor can be found in the following products from Sun and Fujitsu Computer Systems : The UltraSPARC T1 microprocessor is unique in its strength and weaknesses, and as such is targeted at specific markets. Rather than being used for high-end number-crunching and ultra-high performance applications, the chip is targeted at network-facing high-demand servers, such as high-traffic web servers , and mid-tier Java, ERP, and CRM application servers, which often utilize

1260-611: Is a multithreading , multicore CPU released by Sun Microsystems in 2005. Designed to lower the energy consumption of server computers , the CPU typically uses 72 W of power at 1.4 GHz. The T1 is a new-from-the-ground-up SPARC microprocessor implementation that conforms to the UltraSPARC Architecture 2005 specification and executes the full SPARC V9 instruction set . Sun has produced two previous multicore processors ( UltraSPARC IV and IV+), but UltraSPARC T1

1330-720: Is also a BluePrints article on using the Cryptographic Accelerator Units on the T1 and T2 processors. A wide range of applications were optimized on the CoolThreads platform, including Symantec Brightmail AntiSpam, Oracle's Siebel applications, and the Sun Java System Web Proxy Server . Sun also documented its experience in moving its own online store onto a T2000 server cluster, and have published two articles on web consolidation on CoolThreads using Solaris Containers . Sun had an application performance tuning page for

1400-404: Is not completely ready to be processed due to missing data. Out-of-order processors fill these slots in time with other instructions that are ready, then reorder the results at the end to make it appear that the instructions were processed as normal. The way the instructions are ordered in the original computer code is known as program order , in the processor they are handled in data order ,

1470-426: Is still prevalent in microcontrollers and embedded systems , as well as in phone-class cores such as Arm's A55 and A510 in big.LITTLE configurations. Out-of-order execution is more sophisticated relative to the baseline of in-order execution. In pipelined in-order execution processors, execution of instructions overlap in pipelined fashion with each requiring multiple clock cycles to complete. The consequence

SECTION 20

#1732779780044

1540-435: Is that results from a previous instruction will lag behind where they may be needed in the next. In-order execution still has to keep track of these dependencies. Its approach is however quite unsophisticated: stall, every time. Out-of-order uses much more sophisticated data tracking techniques, as described below. In earlier processors, the processing of instructions is performed in an instruction cycle normally consisting of

1610-472: Is the first SPARC processor that supports the Hyper-Privileged execution mode. The SPARC Hypervisor runs in this mode, and it can partition a T1 system into 32 Logical Domains , each of which can run an operating system instance. Currently , Solaris , Linux , NetBSD and OpenBSD are supported. Traditionally, commercial software suites such as Oracle Database charge their customers based on

1680-490: Is turned into a normal register r n only when all the earlier instructions addressing r n have been executed, but until then r n is given for earlier instructions and alt-r n for later ones addressing r n . In the Model 91 the register renaming is implemented by a bypass termed Common Data Bus (CDB) and memory source operand buffers, leaving the physical architectural registers unused for many cycles as

1750-595: The Cray-1S would reduce the performance of executing the first 14 Livermore loops (unvectorized) by only 3%. Important academic research in this subject was led by Yale Patt with his HPSm simulator. In the 1980s many early RISC microprocessors, like the Motorola 88100 , had out-of-order writeback to the registers, resulting in imprecise exceptions. Instructions started execution in order, but some (e.g. floating-point) took more cycles to complete execution. However

1820-572: The GNU General Public License via the OpenSPARC project. The published information includes: Out-of-order execution In computer engineering , out-of-order execution (or more formally dynamic execution ) is a paradigm used in high-performance central processing units to make use of instruction cycles that would otherwise be wasted. In this paradigm, a processor executes instructions in an order governed by

1890-488: The IBM System/360 Model 91 (1966) introduced register renaming with Tomasulo's algorithm , which dissolves false dependencies (WAW and WAR), making full out-of-order execution possible. An instruction addressing a write into a register r n can be executed before an earlier instruction using the register r n is executed, by actually writing into an alternative (renamed) register alt-r n , which

1960-490: The 377mm die." The T4 CPU was released in late 2011. The new T4 CPU will drop from 16 cores (on the T3) back to 8 cores (as used on the T1, T2, and T2+). The new T4 core design (named "S3") feature improved per-thread performance, due to introduction of out-of-order execution, as well as having additional improved performance for single-threaded programs. In 2010, Larry Ellison announced that Oracle will offer Oracle Linux on

2030-711: The 6600. The Model 91 is also capable of reordering loads and stores to execute before the preceding loads and stores, unlike the 6600, which only has a limited ability to move loads past loads, and stores past stores, but not loads past stores and stores past loads. Only the floating-point registers of the Model 91 are renamed, making it subject to the same WAW and WAR limitations as the CDC 6600 when running fixed-point calculations. The 91 and 6600 both also suffer from imprecise exceptions , which needed to be solved before out-of-order execution could be applied generally and made practical outside supercomputers. To have precise exceptions ,

2100-587: The SPARC Enterprise T5140 and T5240. In October 2008, Sun released 4-way UltraSPARC T2 Plus SPARC Enterprise T5440 server. In October 2006, Sun disclosed that Niagara 3 will be built with a 45 nm process. The Register , reported in June 2008 that the microprocessor will have 16 cores, incorrectly suggesting each core would have 16 threads. During the Hot Chips 21 conference Sun revealed

2170-452: The T1 is 30 PVUs (each T2 core is 50 PVUs, and T3 is 70 PVUs) instead of the default value of 100 PVUs per core. The T1 only offered a single floating-point unit to be shared by the 8 cores, limiting usage in HPC environments. This weakness was mitigated with the follow-on UltraSPARC T2 processor, which included 8 floating point units, as well as other additional features. Furthermore, the T1

SPARC T series - Misplaced Pages Continue

2240-464: The T1. The UltraSPARC T1 was designed from scratch as a multi-threaded, special-purpose processor, and thus introduced a whole new architecture for obtaining performance. Rather than try to make each core as intelligent and optimized as they can, Sun's goal was to run as many concurrent threads as possible, and maximize utilization of each core's pipeline. The T1's cores are less complex than those of competing processors in order to allow 8 cores to fit on

2310-704: The T5 range of servers, the single socket rackmount server design was deprecated, while a new eight-socket rackmount server was introduced. On October 26, 2015, Oracle announced a family of systems built on the 32-core, 256-thread SPARC M7 microprocessor. Unlike prior generations, both T- and M-series systems were introduced using the same processor. The M7 included the first generation of the Data Analytics Accelerator (DAX) engines. DAX engines offloaded in-memory query processing and performed real-time data decompression. On September 18, 2017, Oracle announced

2380-548: The UltraSPARC T1 is more powerful than the circa 2001, single-core, single-threaded UltraSPARC III, and at a chip to chip comparison, significantly outperforms other processors on multithreaded integer workloads. The UltraSPARC T1 contains 279 million transistors and has an area of 378 mm . It was fabricated by Texas Instruments (TI) in their 90 nm complementary metal–oxide–semiconductor (CMOS) process with nine levels of copper interconnect . Each core has L1 16 KB instruction cache and 8 KB data cache. L2 cache

2450-574: The UltraSPARC platform, and the port was scheduled to be available in the T4 and T5 timeframe. John Fowler, Executive Vice President Systems Oracle, in Openworld 2014 said Linux will be able to run on Sparc at some point. The new T5 CPU features 128 threads over 16 cores and is manufactured with a 28 nanometer technology. On March 21, 2006, Sun made the UltraSPARC T1 processor design available under

2520-401: The availability of input data and execution units, rather than by their original order in a program. In doing so, the processor can avoid being idle while waiting for the preceding instruction to complete and can, in the meantime, process the next instructions that are able to run immediately and independently. Out-of-order execution is a restricted form of dataflow architecture , which was

2590-417: The case of a cache miss, loads and stores could be reordered. Only the link and count registers could be renamed. In the fall of 1994 NexGen and IBM with Motorola brought the renaming of general-purpose registers to single-chip CPUs. NexGen's Nx586 was the first x86 processor capable of out-of-order execution and featured a reordering distance of up to 14 micro-operations . The PowerPC 603 renamed both

2660-588: The chip has a total of 16 cores and 128 threads. According to the ISSCC 2010 presentation: "A 16-core SPARC SoC processor enables up to 512 threads in a 4-way glueless system to maximize throughput. The 6 MB L2 cache of 461 GB/s and the 308-pin SerDes I/O of 2.4 Tb/s support the required bandwidth. Six clock and four voltage domains, as well as power management and circuit techniques, optimize performance, power, variability and yield trade-offs across

2730-581: The competition. The first superscalar single-chip processors ( Intel i960CA in 1989) used a simple scoreboarding scheduling like the CDC 6600 had a quarter of a century earlier. In 1992–1996 a rapid advancement of techniques, enabled by increasing transistor counts , saw proliferation down to personal computers . The Motorola 88110 (1992) used a history buffer to revert instructions. Loads could be executed ahead of preceding stores. While stores and branches were waiting to start execution, subsequent instructions of other types could keep flowing through all

2800-513: The dual integer unit (each cycle, from the six instructions up to two can be selected and then executed) and six entries for the FPU. Other units have simple FIFO queues. The reordering distance is up to 32 instructions. The A19 of Unisys ' A-series of mainframes was also released in 1991 and was claimed to have out-of-order execution, and one analyst called the A19's technology three to five years ahead of

2870-420: The earlier in-order processors, these stages operated in a fairly lock-step , pipelined fashion. The instructions of the program may not be run in the originally specified order, as long as the end result is correct. It separates the fetch and decode stages from the execute stage in a pipelined processor by using a buffer . The buffer's purpose is to partition the memory access and execute functions in

SPARC T series - Misplaced Pages Continue

2940-614: The effectiveness. Furthermore, larger buffers create more heat and use more die space. For this reason processor designers today favour a multi-threaded design approach. Decoupled architectures are generally thought of as not useful for general purpose computing as they do not handle control intensive code well. Control intensive code include such things as nested branches that occur frequently in operating system kernels . Decoupled architectures play an important role in scheduling in very long instruction word (VLIW) architectures. To avoid false operand dependencies, which would decrease

3010-476: The floating-point instructions is still very limited; due to POWER1's inability to reorder floating-point arithmetic instructions (results became available in-order), their destination registers aren't renamed. POWER1 also doesn't have reservation stations needed for out-of-order use of the same execution unit. The next year IBM's ES/9000 model 900 had register renaming added for the general-purpose registers. It also has reservation stations with six entries for

3080-412: The floating-point pipeline, allowing inter-pipeline reordering. The ZS-1 was also capable of executing loads ahead of preceding stores. In his 1984 paper he opined that enforcing the precise exceptions only on the integer/memory pipeline should be sufficient for many use cases, as it even permits virtual memory . Each pipeline had an instruction buffer to decouple it from the instruction decoder, to prevent

3150-400: The following steps: Often, an in-order processor has a bit vector recording which registers will be written to by a pipeline. If any input operands have the corresponding bit set in this vector, the instruction stalls. Essentially, the vector performs a greatly simplified role of protecting against register hazards. Thus out-of-order execution uses 2D matrices whereas in-order execution uses

3220-508: The frequency when instructions could be issued out of order, a technique called register renaming is used. In this scheme, there are more physical registers than defined by the architecture. The physical registers are tagged so that multiple versions of the same architectural register can exist at the same time. The queue for results is necessary to resolve issues such as branch mispredictions and exceptions/traps. The results queue allows programs to be restarted after an exception, which requires

3290-501: The general-purpose and FP registers. Each of the four non-branch execution units can have one instruction wait in front of it without blocking the instruction flow to the other units. A five-entry reorder buffer lets no more than four instructions overtake an unexecuted instruction. Due to a store buffer, a load can access cache ahead of a preceding store. PowerPC 604 (1995) was the first single-chip processor with execution unit -level reordering, as three out of its six units each had

3360-449: The issue of cache misses by multithreading. Each core is a barrel processor , meaning it switches between available threads each cycle. When a long-latency event occurs, such as cache miss, the thread is taken out of rotation while the data is fetched into cache in the background. Once the long-latency event completes, the thread is made available for execution again. Sharing of the pipeline by multiple threads may make each thread slower, but

3430-792: The latter had an out-of-order floating-point unit . The other high-end in-order processors fell far behind, namely Sun 's UltraSPARC III / IV , and IBM's mainframes which had lost the out-of-order execution capability for the second time, remaining in-order into the z10 generation. Later big in-order processors were focused on multithreaded performance, but eventually the SPARC T series and Xeon Phi changed to out-of-order execution in 2011 and 2016 respectively. Almost all processors for phones and other lower-end applications remained in-order until c. 2010 . First, Qualcomm 's Scorpion (reordering distance of 32) shipped in Snapdragon , and

3500-555: The massive amount of thread-level parallelism (TLP) available on the CoolThreads platform can require different application development techniques than for traditional server platforms. Using TLP in applications is key to getting good performance. Sun has published a number of Sun BluePrints to assist application programmers in developing and deploying software on T1 or T2-based CoolThreads servers. The main article, Tuning Applications on UltraSPARC T1 Chip Multithreading Systems , addresses issues for general application programmers. There

3570-431: The memory, so during the time an in-order processor spends waiting for data to arrive, it could have theoretically processed a large number of instructions. One of the differences created by the new paradigm is the creation of queues that allows the dispatch step to be decoupled from the issue step and the graduation stage to be decoupled from the execute stage. An early name for the paradigm was decoupled architecture . In

SECTION 50

#1732779780044

3640-651: The name of the UltraSPARC T2 and T2 Plus based Sun SPARC Enterprise T5**0 servers. In September 2010, Oracle announced a range of SPARC T3 processor based servers. These are branded as the "SPARC T3" series, the "SPARC Enterprise" brand being dropped. The SPARC T3-series servers include the T3-1B, a blade server module that fits into the Sun Blade 6000 system. All other T3 based servers are rack mounted systems. Subsequent T-series server generations also include

3710-417: The number of processors the software runs on. In early 2006, Oracle changed the licensing model by introducing the processor factor . With a processor factor of .25 for the T1, an 8-core T2000 requires only a 2-CPU license. The "Oracle Processor Core Factor Table" has since been updated regularly as new CPUs came to market. In Q3 2006, IBM introduced the concept of Value Unit (VU) pricing. Each core of

3780-527: The oldest state of registers addressed by any unexecuted instruction is found on the CDB. Another advantage the Model 91 has over the 6600 is the ability to execute instructions out-of-order in the same execution unit , not just between the units like the 6600. This is accomplished by reservation stations , from which instructions go to the execution unit when ready, as opposed to the FIFO queue of each execution unit of

3850-439: The order in which the data becomes available in the processor's registers. Fairly complex circuitry is needed to convert from one ordering to the other and maintain a logical ordering of the output. The benefit of out-of-order processing grows as the instruction pipeline deepens and the speed difference between main memory (or cache memory ) and the processor widens. On modern machines, the processor runs many times faster than

3920-637: The overall throughput (and utilization) of each core is much higher. It also means that the impact of cache misses is greatly reduced, and the T1 can maintain high throughput with a smaller amount of cache. The cache no longer needs to be large enough to hold all or most of the "working set", just the recent cache misses of each thread. Benchmarks demonstrate this approach has worked very well on commercial (integer), multithreaded workloads such as Java application servers, Enterprise Resource Planning (ERP) application servers, email (such as Lotus Domino ) servers, and web servers. These benchmarks suggest each core in

3990-504: The pipeline stages, including writeback. The 12-entry capacity of the history buffer placed a limit on the reorder distance. The PowerPC 601 (1993) was an evolution of the RISC Single Chip , itself a simplification of POWER1. The 601 permitted branch and floating-point instructions to overtake the integer instructions already in the fetched instruction queue, the lowest four entries of which were scanned for dispatchability. In

4060-472: The processor, but older applications burdened with single thread bottlenecks occasionally exhibited poor overall performance. Single-threaded application weakness was mitigated with the follow-on SPARC T4 processor. The T4 core count was reduced to 8 (from 16 on the T3), the cores were made more complex, the clock rate was nearly doubled — all contributing to faster single thread performance (300% to 500% increase over previous generations). Additional effort

4130-434: The proper in-order state of the program's execution must be available upon an exception. By 1985 various approaches were developed as described by James E. Smith and Andrew R. Pleszkun. The CDC Cyber 205 was a precursor, as upon a virtual memory interrupt the entire state of the processor (including the information on the partially executed instructions) is saved into an invisible exchange package , so that it can resume at

4200-407: The same die. The cores do not feature out-of-order execution , or a sizable amount of cache . Single-thread processors depend heavily on large caches for their performance because cache misses result in a wait while the data is fetched from main memory. By making the cache larger, the probability of a cache miss is reduced, but the impact of a miss is still the same. The T1 cores largely side-step

4270-450: The same state of execution. However to make all exceptions precise, there has to be a way to cancel the effects of instructions. The CDC Cyber 990 (1984) implements precise interrupts by using a history buffer, which holds the old (overwritten) values of registers that are restored when an exception necessitates the reverting of instructions. Through simulation, Smith determined that adding a reorder buffer (or history buffer or equivalent) to

SECTION 60

#1732779780044

4340-467: The single-cycle execution of the most basic instructions greatly reduced the scope of the problem compared to the CDC 6600. Smith also researched how to make different execution units operate more independently of each other and of the memory, front-end, and branching. He implemented those ideas in the Astronautics ZS-1 (1988), featuring a decoupling of the integer/load/store pipeline from

4410-549: The stalling of the front end. To further decouple the memory access from execution, each of the two pipelines was associated with two addressable queues that effectively performed limited register renaming. A similar decoupled architecture had been used a bit earlier in the Culler 7. The ZS-1's ISA, like IBM's subsequent POWER, aided the early execution of branches. With the POWER1 (1990), IBM returned to out-of-order execution. It

4480-515: Was taped out in October 2006. A two-socket server (2 RU ) will have 128 threads, 16 cores, and a 65× performance improvement over UltraSPARC III. At the Hot Chips 19 conference, Sun announced that Victoria Falls will be in two-way and four-way servers. Thus, a single 4-way SMP server will support 256 concurrent hardware threads. In April 2008, Sun released 2-way UltraSPARC T2 Plus servers,

4550-429: Was certainly influential in the concurrent and future designs of SPARC processors. The original UltraSPARC T1 was designed for single CPU systems only and is not capable of SMP. "Rock" was a more ambitious project, intended to support multiple-chip server architectures, targeting traditional data-facing workloads such as databases. It was seen as more a follow-on to Sun's SMP processors such as UltraSPARC IV , rather than

4620-421: Was further adopted by SGI / MIPS ( R10000 ) and HP PA-RISC ( PA-8000 ) in 1996. The same year Cyrix 6x86 and AMD K5 brought advanced reordering techniques into mainstream personal computers. Since DEC Alpha gained out-of-order execution in 1998 ( Alpha 21264 ), the top-performing out-of-order processor cores have been unmatched by in-order cores other than HP / Intel Itanium 2 and IBM POWER6 , though

4690-482: Was its first microprocessor that is both multicore and multithreaded. Security was built-in from the very first release on silicon, with hardware cryptographic units in the T1, unlike general purpose processor from competing vendors of the time. The processor is available with four, six or eight CPU cores, each core able to handle four threads concurrently. Thus, the processor is capable of processing up to 32 threads concurrently. The UltraSPARC T1 can be partitioned in

4760-462: Was made to add the "critical thread API", where the operating system would detect a bottleneck and would temporarily allocate the resources of an entire core, instead of 1 (of 8) threads, to the targeted application processes exhibiting single threaded CPU bound behavior. This allowed the T4 to uniquely mitigate single threaded bottlenecks, while not having to compromise in the overall architecture to achieve massive multi-threaded throughput. Leveraging

4830-415: Was only available in uniprocessor systems, limiting vertical scalability in large enterprise environments. This weakness was mitigated with the follow-on UltraSPARC T2 Plus , as well as the next generation SPARC T3 and SPARC T4 . The UltraSPARC T2+, SPARC T3, and SPARC T4 all offer single, dual, and quad socket configurations. The T1 had outstanding throughput with massive numbers of threads supported by

4900-502: Was the first processor to combine register renaming (though again only floating-point registers) with precise exceptions. It uses a physical register file (i.e. a dynamically remapped file with both uncommitted and committed values) instead of a reorder buffer, but the ability to cancel instructions is needed only in the branch unit, which implements a history buffer (named program counter stack by IBM) to undo changes to count, link, and condition registers. The reordering capability of even

#43956