The Cray Y-MP was a supercomputer sold by Cray Research from 1988, and the successor to the company's X-MP . The Y-MP retained software compatibility with the X-MP, but extended the address registers from 24 to 32 bits. High-density VLSI ECL technology was used and a new liquid-cooling system was devised. The Y-MP ran the Cray UNICOS operating system .
97-486: The Y-MP could be equipped with two, four or eight vector processors , with two functional units each and a clock cycle time of 6 ns (167 MHz). Peak performance was thus 333 megaflops per processor. Main memory comprised 128, 256 or 512 MB of SRAM . The original Y-MP (otherwise known as the Y-MP Model D ) was housed in a chassis similar to the horseshoe-shaped X-MP, but with an extra rectangular cabinet added in
194-687: A Lustre file system . In 2011, Cray launched the OpenACC parallel programming standard organization. In 2019, Cray announced that it was deprecating OpenACC , and will support OpenMP . However, in 2022, the Cray Fortran compiler still supported OpenACC, in part due to its usage in the ICON climate simulation code. In April 2012, Cray announced the sale of its interconnect hardware development program and related intellectual property to Intel for $ 140 million. On November 9, 2012, Cray announced
291-625: A LOAD, ADD, MULTIPLY and STORE sequence. If the SIMD width is 4, then the SIMD processor must LOAD four elements entirely before it can move on to the ADDs, must complete all the ADDs before it can move on to the MULTIPLYs, and likewise must complete all of the MULTIPLYs before it can start the STOREs. This is by definition and by design. Having to perform 4-wide simultaneous 64-bit LOADs and 64-bit STOREs
388-554: A batch of vector instructions to be pipelined into each of the ALU subunits, a technique they called vector chaining . The Cray-1 normally had a performance of about 80 MFLOPS, but with up to three chains running it could peak at 240 MFLOPS and averaged around 150 – far faster than any machine of the era. Other examples followed. Control Data Corporation tried to re-enter the high-end market again with its ETA-10 machine, but it sold poorly and they took that as an opportunity to leave
485-540: A co-processor, it is the main computer with the PC-compatible computer into which it is plugged serving support functions. Modern graphics processing units ( GPUs ) include an array of shader pipelines which may be driven by compute kernels , and can be considered vector processors (using a similar strategy for hiding memory latencies). As shown in Flynn's 1972 paper the key distinguishing factor of SIMT-based GPUs
582-468: A greater quantity of numbers in the vector register, it becomes unfeasible for the computer to have a register that large. As a result, the vector processor either gains the ability to perform loops itself, or exposes some sort of vector control (status) register to the programmer, usually known as a vector Length. The self-repeating instructions are found in early vector computers like the STAR-100, where
679-421: A high performance vector processor may have multiple functional units adding those numbers in parallel. The checking of dependencies between those numbers is not required as a vector instruction specifies multiple independent operations. This simplifies the control logic required, and can further improve performance by avoiding stalls. The math operations thus completed far faster overall, the limiting factor being
776-468: A lab in his hometown of Chippewa Falls, Wisconsin , about 85 miles to the east. Cray had a string of successes at CDC, including the CDC 6600 and CDC 7600 . When CDC ran into financial difficulties in the late 1960s, development funds for Cray's follow-on CDC 8600 became scarce. When he was told the project would have to be put "on hold" in 1972, Cray left to form his own company, Cray Research, Inc. Copying
873-551: A maximum measured performance of 5.9 teraflops, being the 29th fastest supercomputer in the world. Since then the X1 has been superseded by the X1E, with faster dual-core processors. On October 4, 2004, the company announced the Cray XD1 range of entry-level supercomputers which use dual-core 64-bit Advanced Micro Devices Opteron central processing units running Linux . This system
970-506: A pair of PowerPC 405 processors which can add to the already considerable power of a single node. The Cray XD1, although moderately successful, was eventually discontinued. In 2004, Cray completed the Red Storm system for Sandia National Laboratories . Red Storm was to become the jumping-off point for a string of successful products that eventually revitalized Cray in supercomputing. Red Storm had processors clustered in 96 unit cabinets,
1067-484: A peak performance of 133 megaflops) and 32 MB to 1 GB of DRAM were available. The Y-MP EL was later developed into the Cray EL90 series ( EL92 , EL94 and EL98 ). The Y-MP EL came in a cabinet much smaller than the traditional room-filling Cray 2010×1270×810 mm (height × width × depth) and 635 kg in weight—and could be powered from regular mains power. In the 1992 film Sneakers , whose story
SECTION 10
#17327903766691164-428: A pipeline architecture that supported both scalar and vector computations, with peak performance reaching approximately 20 MFLOPS, readily achieved when processing long vectors. Expanded ALU configurations supported "two pipes" or "four pipes" with a corresponding 2X or 4X performance gain. Memory bandwidth was sufficient to support these expanded modes. The STAR-100 was otherwise slower than CDC's own supercomputers like
1261-482: A pipelined loop over 16 units for a hybrid approach. The Broadcom Videocore IV is also capable of this hybrid approach: nominally stating that its SIMD QPU Engine supports 16-long FP array operations in its instructions, it actually does them 4 at a time, as (another) form of "threads". This example starts with an algorithm ("IAXPY"), first show it in scalar instructions, then SIMD, then predicated SIMD, and finally vector instructions. This incrementally helps illustrate
1358-438: A register (SWAR) Arithmetic Units. Vector processors can greatly improve performance on certain workloads, notably numerical simulation and similar tasks. Vector processing techniques also operate in video-game console hardware and in graphics accelerators . Vector machines appeared in the early 1970s and dominated supercomputer design through the 1970s into the 1990s, notably the various Cray platforms. The rapid fall in
1455-812: A separate line of computers, originally with lead designer Steve Chen and the Cray X-MP . After Chen's departure, the Cray Y-MP , Cray C90 and Cray T90 were developed on the original Cray-1 architecture but achieved much greater performance via multiple additional processors, faster clocks, and wider vector pipes. The uncertainty of the Cray-2 project gave rise to a number of Cray-object-code compatible "Crayette" firms: Scientific Computer Systems (SCS), American Supercomputer, Supertek , and perhaps one other firm. These firms did not intend to compete against Cray and therefore attempted less expensive, slower CMOS versions of
1552-596: A single common instruction to all of the arithmetic logic units (ALUs), one per cycle, but with a different data point for each one to work on. This allowed the Solomon machine to apply a single algorithm to a large data set , fed in the form of an array. In 1962, Westinghouse cancelled the project, but the effort was restarted by the University of Illinois at Urbana–Champaign as the ILLIAC IV . Their version of
1649-428: A special instruction, the significance compared to Videocore IV (and, crucially as will be shown below, SIMD as well) being that the repeat length does not have to be part of the instruction encoding. This way, significantly more work can be done in each batch; the instruction encoding is much more elegant and compact as well. The only drawback is that in order to take full advantage of this extra batch processing capacity,
1746-422: A theoretical maximum of 300 cabinets in a machine, and a design speed of 41.5 teraflops. Red Storm also included an innovative new design for network interconnects, which was dubbed SeaStar and destined to be the centerpiece of succeeding innovations by Cray. The Cray XT3 massively parallel supercomputer became a commercialized version of Red Storm, similar in many respects to the earlier T3E architecture, but, like
1843-565: A true global-address space and represented a return to the T3E feature set that had been so successful with Cray Research. This product was a successful follow-on to the XT3, XT4 and XT5 products. The first multi-cabinet XE6 system was shipped in July 2010. The next generation Cascade systems were designed make use of future multicore and/or manycore processors from vendors such as Intel and Nvidia. Cascade
1940-541: A vector processor. Although vector supercomputers resembling the Cray-1 are less popular these days, NEC has continued to make this type of computer up to the present day with their SX series of computers. Most recently, the SX-Aurora TSUBASA places the processor and either 24 or 48 gigabytes of memory on an HBM 2 module within a card that physically resembles a graphics coprocessor, but instead of serving as
2037-406: Is unable by design to cope with iteration and reduction. This is illustrated further with examples, below. Additionally, vector processors can be more resource-efficient by using slower hardware and saving power, but still achieving throughput and having less latency than SIMD, through vector chaining . Consider both a SIMD processor and a vector processor working on 4 64-bit elements, doing
SECTION 20
#17327903766692134-611: Is significantly more complex and involved than "Packed SIMD" , which is strictly limited to execution of parallel pipelined arithmetic operations only. Although the exact internal details of today's commercial GPUs are proprietary secrets, the MIAOW team was able to piece together anecdotal information sufficient to implement a subset of the AMDGPU architecture. Several modern CPU architectures are being designed as vector processors. The RISC-V vector extension follows similar principles as
2231-521: Is single-issue and uses no SIMD ALUs, only having 1-wide 64-bit LOAD, 1-wide 64-bit STORE (and, as in the Cray-1 , the ability to run MULTIPLY simultaneously with ADD), may complete the four operations faster than a SIMD processor with 1-wide LOAD, 1-wide STORE, and 2-wide SIMD. This more efficient resource utilization, due to vector chaining , is a key advantage and difference compared to SIMD. SIMD, by design and definition, cannot perform chaining except to
2328-441: Is a central processing unit (CPU) that implements an instruction set where its instructions are designed to operate efficiently and effectively on large one-dimensional arrays of data called vectors . This is in contrast to scalar processors , whose instructions operate on single data items only, and in contrast to some of those same scalar processors having additional single instruction, multiple data (SIMD) or SIMD within
2425-416: Is assumed that both x and y are properly aligned here (only start on a multiple of 16) and that n is a multiple of 4, as otherwise some setup code would be needed to calculate a mask or to run a scalar version. It can also be assumed, for simplicity, that the SIMD instructions have an option to automatically repeat scalar operands, like ARM NEON can. If it does not, a "splat" (broadcast) must be used, to copy
2522-415: Is centered around extremely high-level cryptography , two lead characters have an important discussion while sitting on a Cray Y-MP. In an episode of the television dramedy Northern Exposure titled "Nothing's Perfect", a character expresses her excitement at having finally gained access to a "CRAY Y-MP3" supercomputer. Vector processor In computing , a vector processor or array processor
2619-505: Is comprehensive individual element-level predicate masks on every vector instruction as is now available in ARM SVE2. And AVX-512 , almost qualifies as a vector processor. Predicated SIMD uses fixed-width SIMD ALUs but allows locally controlled (predicated) activation of units to provide the appearance of variable length vectors. Examples below help explain these categorical distinctions. SIMD, because it uses fixed-width batch processing,
2716-489: Is instead "pointed to" by passing in an address to a memory location that holds the data. Decoding this address and getting the data out of the memory takes some time, during which the CPU traditionally would sit idle waiting for the requested data to show up. As CPU speeds have increased, this memory latency has historically become a large impediment to performance; see Random-access memory § Memory wall . In order to reduce
2813-466: Is not common to later designs, and is often referred to under a separate category, massively parallel computing. Around this time Flynn categorized this type of processing as an early form of single instruction, multiple threads (SIMT). International Computers Limited sought to avoid many of the difficulties with the ILLIAC concept with its own Distributed Array Processor (DAP) design, categorising
2910-456: Is not possible, then the operations take even longer because the LD may not be issued (started) at the same time as the first ADDs, and so on. If there are only 4-wide 64-bit SIMD ALUs, the completion time is even worse: only when all four LOADs have completed may the SIMD operations start, and only when all ALU operations have completed may the STOREs begin. A vector processor, by contrast, even if it
3007-458: Is that it has a single instruction decoder-broadcaster but that the cores receiving and executing that same instruction are otherwise reasonably normal: their own ALUs, their own register files, their own Load/Store units and their own independent L1 data caches. Thus although all cores simultaneously execute the exact same instruction in lock-step with each other they do so with completely different data from completely different memory locations. This
Cray Y-MP - Misplaced Pages Continue
3104-480: Is that vector processors, inherently by definition and design, have always been variable-length since their inception. Whereas pure (fixed-width, no predication) SIMD is often mistakenly claimed to be "vector" (because SIMD processes data which happens to be vectors), through close analysis and comparison of historic and modern ISAs, actual vector ISAs may be observed to have the following features that no SIMD ISA has: Predicated SIMD (part of Flynn's taxonomy ) which
3201-496: Is these which somewhat deserve the nomenclature "vector processor" or at least deserve the claim of being capable of "vector processing". SIMD processors without per-element predication ( MMX , SSE , AltiVec ) categorically do not. Modern GPUs, which have many small compute units each with their own independent SIMD ALUs, use Single Instruction Multiple Threads (SIMT). SIMT units run from a shared single broadcast synchronised Instruction Unit. The "vector registers" are very wide and
3298-402: Is very costly in hardware (256 bit data paths to memory). Having 4x 64-bit ALUs, especially MULTIPLY, likewise. To avoid these high costs, a SIMD processor would have to have 1-wide 64-bit LOAD, 1-wide 64-bit STORE, and only 2-wide 64-bit ALUs. As shown in the diagram, which assumes a multi-issue execution model , the consequences are that the operations now take longer to complete. If multi-issue
3395-411: The CDC 7600 , but at data-related tasks they could keep up while being much smaller and less expensive. However the machine also took considerable time decoding the vector instructions and getting ready to run the process, so it required very specific data sets to work on before it actually sped anything up. The vector technique was first fully exploited in 1976 by the famous Cray-1 . Instead of leaving
3492-645: The Cray CX1 system, launched in September the same year. This was a deskside blade server system, comprising up to 16 dual- or quad-core Intel Xeon processors, with either Microsoft Windows HPC Server 2008 or Red Hat Enterprise Linux installed. By 2009, the largest computer system Cray had delivered was the Cray XT5 system at National Center for Computational Sciences at Oak Ridge National Laboratories . This system, with over 224,000 processing cores,
3589-633: The Cray S-MP , later replacing it with the Cray CS6400 . In spite of these machines being some of the most powerful available when applied to appropriate workloads, Cray was never very successful in this market, possibly due to it being so foreign to its existing market niche. CCC was building the Cray-3/SSS when it went into Chapter 11 bankruptcy in March 1995. In February 1996, Cray Research
3686-702: The Cray SV1 , was launched. This was a clustered SMP vector processor architecture, developed from J90 technology. On March 2, 2000, Cray was sold to Tera Computer Company , which was renamed Cray Inc. After the Tera merger, the Tera MTA system was relaunched as the Cray MTA-2 . This was not a commercial success and shipped to only two customers. Cray Inc. also unsuccessfully badged the NEC SX-6 supercomputer as
3783-733: The Cray XK7 which supports the Nvidia Kepler GPGPU and announced that the ORNL Jaguar system would be upgraded to an XK7 (renamed Titan ) and capable of over 20 petaflops. Titan was the world's fastest supercomputer as measured by the LINPACK benchmark until the introduction of the Tianhe-2 in 2013, which is substantially faster. In 2011 Cray also announced it had been awarded the $ 188 million Blue Waters contract with
3880-712: The Cray-2 , though it ended up being only marginally faster than the Cray X-MP , developed by another team at the company. Cray soon left the CEO position to become an independent contractor. He started a new Very Large Scale Integration technology lab for the Cray-2 in Boulder, Colorado , Cray Laboratories , in 1979, which closed in 1982; undaunted, Cray later headed a similar spin-off in 1989, Cray Computer Corporation (CCC) in Colorado Springs, Colorado , where he worked on
3977-674: The Cray-3 project—the first attempt at major use of gallium arsenide (GaAs) semiconductors in computing. However, the changing political climate (collapse of the Warsaw Pact and the end of the Cold War ) resulted in poor sales prospects. Ultimately, only one Cray-3 was delivered, and a number of follow-on designs were never completed. The company filed for bankruptcy in 1995. CCC's remains then became Cray's final corporation, SRC Computers, Inc . Cray Research continued development along
Cray Y-MP - Misplaced Pages Continue
4074-668: The United States Department of Energy 's fastest-computer-in-the-world project to build a 50 tera Flops machine for the Oak Ridge National Laboratory . Cray was sued in 2002 by Isothermal Systems Research for patent infringement. The suit claimed that Cray used ISR's patented technology in the development of the Cray X1. The lawsuit was settled in 2003. As of November 2004, the Cray X1 had
4171-502: The University of Illinois at Urbana–Champaign , after IBM had pulled out of the delivery. This system was delivered in 2012 and was the largest system to date, in terms of cabinets and general-purpose x86 processors, that Cray had ever delivered. In November 2011, the Cray Sonexion 1300 Data Storage System was introduced and signaled Cray's entry into the high performance storage business. This product used modular technology and
4268-585: The Videocore IV ISA for a REP field, but unlike the STAR-100 which uses memory for its repeats, the Videocore IV repeats are on all operations including arithmetic vector operations. The repeat length can be a small range of power of two or sourced from one of the scalar registers. The Cray-1 introduced the idea of using processor registers to hold vector data in batches. The batch lengths (vector length, VL) could be dynamically set with
4365-541: The price-to-performance ratio of conventional microprocessor designs led to a decline in vector supercomputers during the 1990s. Vector processing development began in the early 1960s at the Westinghouse Electric Corporation in their Solomon project. Solomon's goal was to dramatically increase math performance by using a large number of simple coprocessors under the control of a single master Central processing unit (CPU). The CPU fed
4462-411: The 1980s high performance market. At first, Cray Research denigrated such approaches by complaining that developing software to effectively use the machines was difficult – a true complaint in the era of the ILLIAC IV , but becoming less so each day. Cray eventually realized that the approach was likely the only way forward and started a five-year project to capture the lead in this area: the plan's result
4559-427: The CPU, in the fashion of an assembly line , so the address decoder is constantly in use. Any particular instruction takes the same amount of time to complete, a time known as the latency , but the CPU can process an entire batch of operations, in an overlapping fashion, much faster and more efficiently than if it did so one at a time. Vector processors take this concept one step further. Instead of pipelining just
4656-664: The CPU, this would look something like this: But to a vector processor, this task looks considerably different: Note the complete lack of looping in the instructions, because it is the hardware which has performed 10 sequential operations: effectively the loop count is on an explicit per-instruction basis. Cray-style vector ISAs take this a step further and provide a global "count" register, called vector length (VL): There are several savings inherent in this approach. Additionally, in more modern vector processor ISAs, "Fail on First" or "Fault First" has been introduced (see below) which brings even more advantages. But more than that,
4753-665: The Cray SX-6 and acquired exclusive rights to sell the SX-6 in the US, Canada, and Mexico. In 2002, Cray Inc. announced its first new model, the Cray X1 combined architecture vector processor / massively parallel supercomputer. Previously known as the SV2 , the X1 is the result of the earlier SN2 concept originated during the SGI years. In May 2004, Cray was announced to be one of the partners in
4850-742: The Cray XD1, which required a dedicated socket for the FPGA coprocessor. On November 13, 2006, Cray announced a new system, the Cray XMT , based on the MTA series of machines. This system combined multi-threaded processors, as used on the original Tera systems, and the SeaStar2 interconnect used by the XT4. By reusing ASICs , boards, cabinets, and system software used by the comparatively higher volume XT4 product,
4947-543: The I/O throughput. The Y-shaped chassis was dropped in favor of one or two rectangular cabinets (each with a separate connected cabinet containing the liquid-cooling system), depending on configuration. Maximum RAM was increased to 2 GB and up to eight IOSs were possible. Model E variants included the Y-MP ;2E , Y-MP 4E , Y-MP 8E and Y-MP 8I , the latter being a single-cabinet ( I for Integrated ) version of
SECTION 50
#17327903766695044-669: The ILLIAC and DAP as cellular array processors that potentially offered substantial performance benefits over conventional vector processor designs such as the CDC STAR-100 and Cray 1. A computer for operations with functions was presented and developed by Kartsev in 1967. The first vector supercomputers are the Control Data Corporation STAR-100 and Texas Instruments Advanced Scientific Computer (ASC), which were introduced in 1974 and 1972, respectively. The basic ASC (i.e., "one pipe") ALU used
5141-644: The S-1 as the Cray XMS , but the machine proved problematic; meanwhile, the not-yet-completed S-2, a Y-MP clone, was later offered as the Cray Y-MP (later becoming the Cray EL90 ) which started to sell in reasonable numbers in 1991–92—to mostly smaller companies, notably in the oil exploration business. This line evolved into the Cray J90 and eventually the Cray SV1 in 1998. In December 1991, Cray purchased some of
5238-407: The STAR-100's vectorisation was by design based around memory accesses, an extra slot of memory is now required to process the information. Two times the latency is also needed due to the extra requirement of memory access. A modern packed SIMD architecture, known by many names (listed in Flynn's taxonomy ), can do most of the operation in batches. The code is mostly similar to the scalar version. It
5335-794: The X-MP with the release of the COS operating system (SCS) and the CFT Fortran compiler; they also considered the Cray Time Sharing System operating system, developed at United States Department of Energy national laboratories ( LANL / LLNL ), before joining the broader trend toward adoption of Unixes . Today, Cray OS is a specialized version of SUSE Linux Enterprise Server . A series of massively parallel computers from Thinking Machines Corporation , Kendall Square Research , Intel , nCUBE , MasPar and Meiko Scientific took over
5432-671: The XD1, using AMD Opteron processors. On August 8, 2005, Peter Ungaro was appointed CEO. Ungaro had joined Cray in August 2003 as Vice President of Sales and Marketing and had been made Cray's President in March 2005. Introduced in 2006, the Cray XT4 added support for DDR2 memory, newer dual-core and future quad-core Opteron processors and utilized a second generation SeaStar2 communication coprocessor. It also included an option for FPGA chips to be plugged directly into processor sockets, unlike
5529-504: The above action would be described in a single instruction (somewhat like vadd c, a, b, $ 10 ). They are also found in the x86 architecture as the REP prefix. However, only very simple calculations can be done effectively in hardware this way without a very large cost increase. Since all operands have to be in memory for the STAR-100 architecture, the latency caused by access became huge too. Broadcom included space in all vector operations of
5626-521: The acquisition of Appro International, Inc. , a California-based privately held developer of advanced scalable supercomputing solutions. As of 2012 the #3 provider on the Top100 supercomputer list, Appro builds some of the world's most advanced high performance computing (HPC) cluster systems. In 2012, Cray also opened a subsidiary in China. On September 25, 2019, Hewlett Packard Enterprise (HPE) acquired
5723-787: The addition of SIMD cannot, by itself, qualify a processor as an actual vector processor , because SIMD is fixed-length , and vectors are variable-length . The difference is illustrated below with examples, showing and comparing the three categories: Pure SIMD, Predicated SIMD, and Pure Vector Processing. Other CPU designs include some multiple instructions for vector processing on multiple (vectorized) data sets, typically known as MIMD (Multiple Instruction, Multiple Data) and realized with VLIW (Very Long Instruction Word) and EPIC (Explicitly Parallel Instruction Computing). The Fujitsu FR-V VLIW/vector processor combines both technologies. SIMD instruction sets lack crucial features when compared to vector instruction sets. The most important of these
5820-415: The amount of time consumed by these steps, most modern CPUs use a technique known as instruction pipelining in which the instructions pass through several sub-units in turn. The first sub-unit reads the address and decodes it, the next "fetches" the values at those addresses, and the next does the math itself. With pipelining the "trick" is to start decoding the next instruction even before the first has left
5917-556: The assets of Floating Point Systems , another minisuper vendor that had moved into the file server market with its SPARC -based Model 500 line. These symmetric multiprocessing machines scaled up to 64 processors and ran a modified version of the Solaris operating system from Sun Microsystems . Cray set up Cray Research Superservers, Inc. (later the Cray Business Systems Division ) to sell this system as
SECTION 60
#17327903766696014-593: The company for $ 1.3 billion. In October 2020, HPE was awarded the contract to build the pre-exascale EuroHPC computer LUMI , in Kajaani , Finland . The contract, worth €144.5 million, is for an HPE Cray EX system, with a theoretical maximum performance of 550 petaflops . Once fully operational, LUMI will become one of the fastest supercomputers in the world. On June 28, 2022, the US National Oceanic and Atmospheric Administration (NOAA) inaugurated
6111-490: The company was founded by computer designer Seymour Cray as Cray Research, Inc., and it continues to manufacture parts in Chippewa Falls, Wisconsin , where Cray was born and raised. After being acquired by Silicon Graphics in 1996, the modern company was formed after being purchased in 2000 by Tera Computer Company , which adopted the name Cray Inc. In 2019, the company was acquired by Hewlett Packard Enterprise for $ 1.3 billion. In 1950, Seymour Cray began working in
6208-464: The computing field when he joined Engineering Research Associates (ERA) in Saint Paul, Minnesota . There, he helped to create the ERA 1103 . ERA eventually became part of UNIVAC , and began to be phased out. In 1960, he left the company, a few years after former ERA employees set up Control Data Corporation (CDC). He initially worked out of the CDC headquarters in Minneapolis, but grew upset by constant interruptions by managers. He eventually set up
6305-496: The cost of making the very specialized MTA system could be reduced. A second generation of the XMT is scheduled for release in 2011, with the first system ordered by the Swiss National Supercomputing Center (CSCS). In 2006, Cray announced a vision of products dubbed Adaptive Supercomputing . The first generation of such systems, dubbed the Rainier Project , used a common interconnect network (SeaStar2), programming environment, cabinet design, and I/O subsystem. These systems included
6402-409: The data in memory like the STAR-100 and ASC, the Cray design had eight vector registers , which held sixty-four 64-bit words each. The vector instructions were applied between registers, which is much faster than talking to main memory. Whereas the STAR-100 would apply a single operation across a long vector in memory and then move on to the next operation, the Cray design would load a smaller section of
6499-431: The decoding of the more common instructions such as normal adding. ( This can be somewhat mitigated by keeping the entire ISA to RISC principles: RVV only adds around 190 vector instructions even with the advanced features. ) Vector processors were traditionally designed to work best only when there are large amounts of data to be worked on. For this reason, these sorts of CPUs were found primarily in supercomputers , as
6596-431: The design originally called for a 1 GFLOPS machine with 256 ALUs, but, when it was finally delivered in 1972, it had only 64 ALUs and could reach only 100 to 150 MFLOPS. Nevertheless, it showed that the basic concept was sound, and, when used on data-intensive applications, such as computational fluid dynamics , the ILLIAC was the fastest machine in the world. The ILLIAC approach of using separate ALUs for each data element
6693-482: The difference between a traditional vector processor and a modern SIMD one. The example starts with a 32-bit integer variant of the "DAXPY" function, in C : In each iteration, every element of y has an element of x multiplied by a and added to it. The program is expressed in scalar linear form for readability. The scalar version of this would load one of each of x and y, process one calculation, store one result, and loop: The STAR-like code remains concise, but because
6790-559: The early vector processors, and is being implemented in commercial products such as the Andes Technology AX45MPV. There are also several open source vector processor architectures being developed, including ForwardCom and Libre-SOC . As of 2016 most commodity CPUs implement architectures that feature fixed-length SIMD instructions. On first inspection these can be considered a form of vector processing because they operate on multiple (vectorized, explicit length) data sets, and borrow features from vector processors. However, by definition,
6887-418: The entire group of results. In general terms, CPUs are able to manipulate one or two pieces of data at a time. For instance, most CPUs have an instruction that essentially says "add A to B and put the result in C". The data for A, B and C could be—in theory at least—encoded directly into the instruction. However, in efficient implementation things are rarely that simple. The data is rarely sent in raw form, and
6984-509: The existing XT4 and the XMT. The second generation, launched as the XT5h , allowed a system to combine compute elements of various types into a common system, sharing infrastructure. The XT5h combined Opteron, vector, multithreaded, and FPGA compute processors in a single system. In April 2008, Cray and Intel announced they would collaborate on future supercomputer systems. This partnership produced
7081-447: The instruction itself that the instruction will operate again on another item of data, at an address one increment larger than the last. This allows for significant savings in decoding time. To illustrate what a difference this can make, consider the simple task of adding two groups of 10 numbers together. In a normal programming language one would write a "loop" that picked up each of the pairs of numbers in turn, and then added them. To
7178-416: The instructions, they also pipeline the data itself. The processor is fed instructions that say not just to add A to B, but to add all of the numbers "from here to here" to all of the numbers "from there to there". Instead of constantly having to decode instructions and then fetch the data needed to complete them, the processor reads a single instruction from memory, and it is simply implied in the definition of
7275-423: The late 1980s and early 1990s, which out-competed low-end Cray machines in the market. The Convex Computer series, as well as a number of small-scale parallel machines from companies like Pyramid Technology and Alliant Computer Systems were particularly popular. One such vendor was Supertek , whose S-1 machine was an air-cooled CMOS implementation of the X-MP processor. Cray purchased Supertek in 1990 and sold
7372-467: The memory load and store speed correspondingly had to increase as well. This is sometimes claimed to be a disadvantage of Cray-style vector processors: in reality it is part of achieving high performance throughput, as seen in GPUs , which face exactly the same issue. Modern SIMD computers claim to improve on early Cray by directly using multiple ALUs, for a higher degree of parallelism compared to only using
7469-450: The middle (containing the CPU boards), thus forming a "Y" shape in plan view. The system could be configured with one or two Model D IOSs (Input/Output Subsystems) and an optional Solid State Disk (SSD) of 256 MB to 4GB capacity. The Y-MP had a measured GFLOPS of 2.144 and a peak GFLOPS of 2.667 in both 1988 and 1989. The Model D Y-MP was superseded in 1990 by the Y-MP Model E , which replaced IOS Model D with IOS Model E , providing twice
7566-671: The model name was abbreviated to the Cray M90 series. The Y-MP C90 series is described separately. In 1992, Cray launched the cheaper Y-MP EL ( Entry Level ) model. This was a reimplementation of the Y-MP architecture in CMOS technology, based on the S-2 design acquired by Cray from Supertek Computers in 1990. The EL was an air-cooled system with a completely different VMEbus -based IOS. EL configurations with up to four processors (each with
7663-562: The nation’s newest weather and climate supercomputers, two HPE Cray supercomputers installed and operated by General Dynamics (GDIT). Each supercomputer operates at 12.1 petaflops . On November 18, 2024, the US National Nuclear Security Administration (NNSA) unveiled an HPE Cray supercomputer for use in nuclear weapons analysis and inertial confinement fusion design. The supercomputer is housed at Lawrence Livermore National Laboratory (LLNL), and
7760-536: The normal scalar pipeline. Modern vector processors (such as the SX-Aurora TSUBASA ) combine both, by issuing multiple data to multiple internal pipelined SIMD ALUs, the number issued being dynamically chosen by the vector program at runtime. Masks can be used to selectively load and store data in memory locations, and use those same masks to selectively disable processing element of SIMD ALUs. Some processors with SIMD ( AVX-512 , ARM SVE2 ) are capable of this kind of selective, per-element ( "predicated" ) processing, and it
7857-427: The number of units sold was small compared to ordinary mainframes . This perception extended to countries as well: to boost the perception of exclusivity, Cray Research's marketing department had promotional neckties made with a mosaic of tiny national flags illustrating the "club of Cray-operating countries". New vendors introduced small supercomputers, known as minisupercomputers (as opposed to superminis) during
7954-462: The performance leader, continually beating the competition with a series of machines that led to the Cray-2 , Cray X-MP and Cray Y-MP . Since then, the supercomputer market has focused much more on massively parallel processing rather than better implementations of vector processors. However, recognising the benefits of vector processing, IBM developed Virtual Vector Architecture for use in supercomputers coupling several scalar processors to act as
8051-546: The pipelines tend to be long. The "threading" part of SIMT involves the way data is handled independently on each of the compute units. In addition, GPUs such as the Broadcom Videocore IV and other external vector processors like the NEC SX-Aurora TSUBASA may use fewer vector units than the width implies: instead of having 64 units for a 64-number-wide register, the hardware might instead do
8148-524: The previous arrangement, Cray kept the research and development facilities in Chippewa Falls, and put the business headquarters in Minneapolis . The company's first product, the Cray-1 supercomputer, was a major success because it was significantly faster than all other computers at the time. The first system was sold within a month for $ 8.8 million. Seymour Cray continued working, this time on
8245-508: The scalar argument across a SIMD register: Cray Cray Inc. , a subsidiary of Hewlett Packard Enterprise , is an American supercomputer manufacturer headquartered in Seattle, Washington . It also manufactures systems for data storage and analytics. Several Cray supercomputer systems are listed in the TOP500 , which ranks the most powerful supercomputers in the world. In 1972,
8342-461: The supercomputers themselves were, in general, found in places such as weather prediction centers and physics labs, where huge amounts of data are "crunched". However, as shown above and demonstrated by RISC-V RVV the efficiency of vector ISAs brings other benefits which are compelling even for Embedded use-cases. The vector pseudocode example above comes with a big assumption that the vector computer can process more than ten numbers in one batch. For
8439-436: The supercomputing field entirely. In the early and mid-1980s Japanese companies ( Fujitsu , Hitachi and Nippon Electric Corporation (NEC) introduced register-based vector machines similar to the Cray-1, typically being slightly faster and much smaller. Oregon -based Floating Point Systems (FPS) built add-on array processors for minicomputers , later building their own minisupercomputers . Throughout, Cray continued to be
8536-414: The time required to fetch the data from memory. Not all problems can be attacked with this sort of solution. Including these types of instructions necessarily adds complexity to the core CPU. That complexity typically makes other instructions run slower—i.e., whenever it is not adding up many numbers in a row. The more complex instructions also add to the complexity of the decoders, which might slow down
8633-518: The two-cabinet 8E. The 2E and 4E were later available with optional secondary air cooling. The Y-MP M90 was a large-memory variant of the Y-MP Model E introduced in 1992. This replaced the SRAM of the Y-MP with up to 32 GB of slower, but physically smaller DRAM devices. The Y-MP M90 was also available in variants with up to two, four or eight processors ( M92 , M94 and M98 respectively). Later,
8730-450: The vector into registers and then apply as many operations as it could to that data, thereby avoiding many of the much slower memory access operations. The Cray design used pipeline parallelism to implement vector instructions rather than multiple ALUs. In addition, the design had completely separate pipelines for different instructions, for example, addition/subtraction was implemented in different hardware than multiplication. This allowed
8827-542: Was acquired by Silicon Graphics (SGI) for $ 740 million. In May 1996, SGI sold the Superservers business to Sun. Sun then turned the UltraSPARC-based Starfire project then under development into the extremely successful Sun Enterprise 10000 range of servers. SGI used several Cray technologies in its attempt to move from the graphics workstation market into supercomputing. Key among these
8924-630: Was dubbed Jaguar and was the fastest computer in the world as measured by the LINPACK benchmark at the speed of 1.75 petaflops until being surpassed by the Tianhe-1A in October 2010. It was the first system to exceed a sustained performance of 1 petaflops on a 64-bit scientific application. In May 2010, the Cray XE6 supercomputer was announced. The Cray XE6 system had at its core the new Gemini system interconnect. This new interconnect included
9021-624: Was originally intended to unify all high-end/supercomputer product lines including the T90 into a single architecture. This goal was never achieved before SGI divested itself of the Cray business, and the SN2 name was later associated with the SN-IA or SGI Altix 3000 architecture. In October 1996, founder Seymour Cray died as a result of a traffic accident. In 1998, under SGI ownership, one new Cray model line,
9118-518: Was previously known as the OctigaBay 12K before Cray's acquisition of that company. The XD1 provided one Xilinx Virtex II Pro field-programmable gate array ( FPGA ) with each node of four Opteron processors. The FPGAs could be configured to embody various digital hardware designs and could augment the processing or input/output capabilities of the Opteron processors. Furthermore, each FPGA contains
9215-771: Was scheduled to be introduced in early 2013 and designed to use the next-generation network chip and follow-on to Gemini, code named Aries . In early 2010, Cray also introduced the Cray CX1000 , a rack-mounted system with a choice of compute-based, GPU-based, or SMP-based chassis. The CX1 and CX1000 product lines were sold until late 2011. In 2011, Cray announced the Cray XK6 hybrid supercomputer. The Cray XK6 system, capable of scaling to 500,000 processors and 50 petaflops of peak performance, combines Cray's Gemini interconnect, AMD's multi-core scalar processors, and Nvidia 's Tesla GPGPU processors. In October 2012 Cray announced
9312-499: Was the Digital Equipment Corporation Alpha -based Cray T3D and Cray T3E series, which left Cray as the only remaining supercomputer vendor in the market besides NEC's SX architecture by 2000. Most sites with a Cray installation were considered members of the "exclusive club" of Cray operators. Cray computers were considered quite prestigious because Crays were extremely expensive machines, and
9409-596: Was the use of the Cray-developed HIPPI computer bus and details of the interconnects used in the T3 series. SGI's long-term strategy was to merge its high-end server line with Cray's product lines in two phases, code-named SN1 and SN2 (SN standing for "Scalable Node"). The SN1 was intended to replace the T3E and SGI Origin 2000 systems and later became the SN-MIPS or SGI Origin 3000 architecture. The SN2
#668331