The CDC Cyber range of mainframe -class supercomputers were the primary products of Control Data Corporation (CDC) during the 1970s and 1980s. In their day, they were the computer architecture of choice for scientific and mathematically intensive computing. They were used for modeling fluid flow, material science stress analysis, electrochemical machining analysis, probabilistic analysis, energy and academic computing, radiation shielding modeling, and other applications. The lineup also included the Cyber 18 and Cyber 1000 minicomputers . Like their predecessor, the CDC 6600 , they were unusual in using the ones' complement binary representation.
91-467: The Cyber line included five different series of computers: Primarily aimed at large office applications instead of the traditional supercomputer tasks, some of the Cyber machines nevertheless included basic vector instructions for added performance in traditional CDC roles. The Cyber 70 and 170 architectures were successors to the earlier CDC 6600 and CDC 7600 series and therefore shared almost all of
182-625: A LOAD, ADD, MULTIPLY and STORE sequence. If the SIMD width is 4, then the SIMD processor must LOAD four elements entirely before it can move on to the ADDs, must complete all the ADDs before it can move on to the MULTIPLYs, and likewise must complete all of the MULTIPLYs before it can start the STOREs. This is by definition and by design. Having to perform 4-wide simultaneous 64-bit LOADs and 64-bit STOREs
273-466: A certain class of math tasks. The STAR's vector pipeline is a memory to memory pipe, which supports vector lengths of up to 65,536 elements. The latencies of the vector pipeline are very long, so peak speed is approached only when very long vectors are used. The scalar processor was deliberately simplified to provide room for the vector processor and is relatively slow in comparison to the CDC 7600 . As such,
364-540: A co-processor, it is the main computer with the PC-compatible computer into which it is plugged serving support functions. Modern graphics processing units ( GPUs ) include an array of shader pipelines which may be driven by compute kernels , and can be considered vector processors (using a similar strategy for hiding memory latencies). As shown in Flynn's 1972 paper the key distinguishing factor of SIMT-based GPUs
455-513: A corresponding 2X or 4X performance gain. Memory bandwidth was sufficient to support these expanded modes. The STAR-100 was otherwise slower than CDC's own supercomputers like the CDC 7600 , but at data-related tasks they could keep up while being much smaller and less expensive. However the machine also took considerable time decoding the vector instructions and getting ready to run the process, so it required very specific data sets to work on before it actually sped anything up. The vector technique
546-544: A direct memory interconnect architecture (MIA), this was available on NOS 2.2 for the Cyber 170/835, 845, 855 and 180/990 models. Physically, each Cyberplus processor unit was of typical mainframe module size, similar to the Cyber 180 systems, with the exact width dependent on whether the optional FPU was installed, and weighed approximately 1 tonne . Some sites using the Cyberplus were the University of Georgia and
637-468: A greater quantity of numbers in the vector register, it becomes unfeasible for the computer to have a register that large. As a result, the vector processor either gains the ability to perform loops itself, or exposes some sort of vector control (status) register to the programmer, usually known as a vector Length. The self-repeating instructions are found in early vector computers like the STAR-100, where
728-530: A hardware network for conditional microinstruction execution , with four mask registers and a condition-hold register; three bits in the microinstruction format select among nearly 50 conditions for determining execution, including result sign and overflow, I/O conditions, and loop control. At least 21 Cyberplus multiprocessor installations were operational in 1986. These parallel processing systems include from 1 to 256 Cyberplus processors providing 250 MFLOPS each, which are connected to an existing Cyber system via
819-421: A high performance vector processor may have multiple functional units adding those numbers in parallel. The checking of dependencies between those numbers is not required as a vector instruction specifies multiple independent operations. This simplifies the control logic required, and can further improve performance by avoiding stalls. The math operations thus completed far faster overall, the limiting factor being
910-458: A high-level Pascal -like language, was developed for the project with the intent that all languages and the operating system (IPLOS) were going to be written in SWL. SWL was later renamed PASCAL-X and eventually became Cybil . The joint venture was abandoned in 1976, with CDC continuing system development and renaming the Cyber 80 as Cyber 180. The first machines of the series were announced in 1982 and
1001-482: A pipelined loop over 16 units for a hybrid approach. The Broadcom Videocore IV is also capable of this hybrid approach: nominally stating that its SIMD QPU Engine supports 16-long FP array operations in its instructions, it actually does them 4 at a time, as (another) form of "threads". This example starts with an algorithm ("IAXPY"), first show it in scalar instructions, then SIMD, then predicated SIMD, and finally vector instructions. This incrementally helps illustrate
SECTION 10
#17327911592771092-683: A single card). The XN20 was in pre-production stage when the Communication Systems Division was shut down in 1992. Jack Ralph was the chief architect of the Cyber 1000-2, XN-10 and XN-20 systems. Dan Nay was the chief engineer of the XN-20. Vector processor In computing , a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its instructions are designed to operate efficiently and effectively on large one-dimensional arrays of data called vectors . This
1183-428: A special instruction, the significance compared to Videocore IV (and, crucially as will be shown below, SIMD as well) being that the repeat length does not have to be part of the instruction encoding. This way, significantly more work can be done in each batch; the instruction encoding is much more elegant and compact as well. The only drawback is that in order to take full advantage of this extra batch processing capacity,
1274-541: A vector processor. Although vector supercomputers resembling the Cray-1 are less popular these days, NEC has continued to make this type of computer up to the present day with their SX series of computers. Most recently, the SX-Aurora TSUBASA places the processor and either 24 or 48 gigabytes of memory on an HBM 2 module within a card that physically resembles a graphics coprocessor, but instead of serving as
1365-406: Is unable by design to cope with iteration and reduction. This is illustrated further with examples, below. Additionally, vector processors can be more resource-efficient by using slower hardware and saving power, but still achieving throughput and having less latency than SIMD, through vector chaining . Consider both a SIMD processor and a vector processor working on 4 64-bit elements, doing
1456-611: Is significantly more complex and involved than "Packed SIMD" , which is strictly limited to execution of parallel pipelined arithmetic operations only. Although the exact internal details of today's commercial GPUs are proprietary secrets, the MIAOW team was able to piece together anecdotal information sufficient to implement a subset of the AMDGPU architecture. Several modern CPU architectures are being designed as vector processors. The RISC-V vector extension follows similar principles as
1547-521: Is single-issue and uses no SIMD ALUs, only having 1-wide 64-bit LOAD, 1-wide 64-bit STORE (and, as in the Cray-1 , the ability to run MULTIPLY simultaneously with ADD), may complete the four operations faster than a SIMD processor with 1-wide LOAD, 1-wide STORE, and 2-wide SIMD. This more efficient resource utilization, due to vector chaining , is a key advantage and difference compared to SIMD. SIMD, by design and definition, cannot perform chaining except to
1638-642: Is a 16-bit minicomputer which was a successor to the CDC 1700 minicomputer. It was mostly used in real-time environments. One noteworthy application is as the basis of the 2550—a communications processor used by CDC 6000 series and Cyber 70/Cyber 170 mainframes. The 2550 was a product of CDC's Communications Systems Division, in Santa Ana, California (STAOPS). STAOPS also produced another communication processor (CP), used in networks hosted by IBM mainframes. This M1000 CP, later renamed C1000, came from an acquisition of Marshall MDM Communications. A three-board set
1729-525: Is a late-1970s refresh of the Cyber-170 line. The central processor (CPU) and central memory (CM) operated in units of 60-bit words . In CDC lingo, the term "byte" referred to 12-bit entities (which coincided with the word size used by the peripheral processors). Characters were six bits, operation codes were six bits, and central memory addresses were 18 bits. Central processor instructions were either 15 bits or 30 bits. The 18-bit addressing inherent to
1820-416: Is assumed that both x and y are properly aligned here (only start on a multiple of 16) and that n is a multiple of 4, as otherwise some setup code would be needed to calculate a mask or to run a scalar version. It can also be assumed, for simplicity, that the SIMD instructions have an option to automatically repeat scalar operands, like ARM NEON can. If it does not, a "splat" (broadcast) must be used, to copy
1911-505: Is comprehensive individual element-level predicate masks on every vector instruction as is now available in ARM SVE2. And AVX-512 , almost qualifies as a vector processor. Predicated SIMD uses fixed-width SIMD ALUs but allows locally controlled (predicated) activation of units to provide the appearance of variable length vectors. Examples below help explain these categorical distinctions. SIMD, because it uses fixed-width batch processing,
SECTION 20
#17327911592772002-521: Is in contrast to scalar processors , whose instructions operate on single data items only, and in contrast to some of those same scalar processors having additional single instruction, multiple data (SIMD) or SIMD within a register (SWAR) Arithmetic Units. Vector processors can greatly improve performance on certain workloads, notably numerical simulation and similar tasks. Vector processing techniques also operate in video-game console hardware and in graphics accelerators . Vector machines appeared in
2093-448: Is inherent in the design. As with predecessor systems, the Cyber 170 series has eight 18-bit address registers (A0 through A7), eight 18-bit index registers (B0 through B7), and eight 60-bit operand registers (X0 through X7). Seven of the A registers are tied to their corresponding X register. Setting A1 through A5 reads that address and fetches it into the corresponding X1 through X5 register. Likewise, setting register A6 or A7 writes
2184-489: Is instead "pointed to" by passing in an address to a memory location that holds the data. Decoding this address and getting the data out of the memory takes some time, during which the CPU traditionally would sit idle waiting for the requested data to show up. As CPU speeds have increased, this memory latency has historically become a large impediment to performance; see Random-access memory § Memory wall . In order to reduce
2275-404: Is needed. By interleaving independent instructions between the memory fetch instruction and the instructions manipulating the fetched operand, the time occupied by the memory fetch can be used for other computation. With this technique, coupled with the handcrafting of tight loops that fit within the instruction stack, a skilled Cyber assembly programmer can write extremely efficient code that makes
2366-456: Is not possible, then the operations take even longer because the LD may not be issued (started) at the same time as the first ADDs, and so on. If there are only 4-wide 64-bit SIMD ALUs, the completion time is even worse: only when all four LOADs have completed may the SIMD operations start, and only when all ALU operations have completed may the STOREs begin. A vector processor, by contrast, even if it
2457-458: Is that it has a single instruction decoder-broadcaster but that the cores receiving and executing that same instruction are otherwise reasonably normal: their own ALUs, their own register files, their own Load/Store units and their own independent L1 data caches. Thus although all cores simultaneously execute the exact same instruction in lock-step with each other they do so with completely different data from completely different memory locations. This
2548-480: Is that vector processors, inherently by definition and design, have always been variable-length since their inception. Whereas pure (fixed-width, no predication) SIMD is often mistakenly claimed to be "vector" (because SIMD processes data which happens to be vectors), through close analysis and comparison of historic and modern ISAs, actual vector ISAs may be observed to have the following features that no SIMD ISA has: Predicated SIMD (part of Flynn's taxonomy ) which
2639-496: Is these which somewhat deserve the nomenclature "vector processor" or at least deserve the claim of being capable of "vector processing". SIMD processors without per-element predication ( MMX , SSE , AltiVec ) categorically do not. Modern GPUs, which have many small compute units each with their own independent SIMD ALUs, use Single Instruction Multiple Threads (SIMT). SIMT units run from a shared single broadcast synchronised Instruction Unit. The "vector registers" are very wide and
2730-402: Is very costly in hardware (256 bit data paths to memory). Having 4x 64-bit ALUs, especially MULTIPLY, likewise. To avoid these high costs, a SIMD processor would have to have 1-wide 64-bit LOAD, 1-wide 64-bit STORE, and only 2-wide 64-bit ALUs. As shown in the diagram, which assumes a multi-issue execution model , the consequences are that the operations now take longer to complete. If multi-issue
2821-539: The Control Data Corporation STAR-100 and Texas Instruments Advanced Scientific Computer (ASC), which were introduced in 1974 and 1972, respectively. The basic ASC (i.e., "one pipe") ALU used a pipeline architecture that supported both scalar and vector computations, with peak performance reaching approximately 20 MFLOPS, readily achieved when processing long vectors. Expanded ALU configurations supported "two pipes" or "four pipes" with
CDC Cyber - Misplaced Pages Continue
2912-585: The Videocore IV ISA for a REP field, but unlike the STAR-100 which uses memory for its repeats, the Videocore IV repeats are on all operations including arithmetic vector operations. The repeat length can be a small range of power of two or sourced from one of the scalar registers. The Cray-1 introduced the idea of using processor registers to hold vector data in batches. The batch lengths (vector length, VL) could be dynamically set with
3003-483: The 180-mode exchange package, there is a virtual machine identifier (VMID) that determines whether the 8/16/64-bit two's complement 180 instruction set or the 12/60-bit ones' complement 170 instruction set is executed. There were three true 180s in the initial lineup, codenamed P1, P2, P3. P2 and P3 were larger water-cooled designs. The P2 was designed in Mississauga , Ontario , by the same team that later designed
3094-456: The 72 character string starting at word 1000 character 3 to location 2000 character 9". The CMU hardware is not included in the higher-end Cyber CPUs, because hand coded loops could run as fast or faster than the CMU instructions. Later systems typically run CDC's NOS (Network Operating System). Version 1 of NOS continued to be updated until about 1981; NOS version 2 was released early 1982, with
3185-656: The 800-series' true capabilities to its customers, and the true 180s were relabeled as the 180/825 (P1), 180/835 (P2), and 180/855 (P3). At some point, the model 815 was introduced with the delayed microcode and the faster microcode was restored to the model 825. Eventually the THETA was released as the Cyber 990 . In 1974, CDC introduced the STAR architecture. The STAR is an entirely new 64-bit design with virtual memory and vector processing instructions added for high performance on
3276-591: The Advanced Systems Laboratory, a joint CDC/NCR development venture started in 1973 and located in Escondido, California. The machine family was originally called Integrated Product Line (IPL) and was intended to be a virtual memory replacement for the NCR 6150 and CDC Cyber 70 product lines. The IPL system was also called the Cyber 80 in development documents. The Software Writer's Language (SWL),
3367-454: The CDC 6500. As was the case with the CDC 6200, CDC also offered a Cyber-72. The Cyber-72 had identical hardware to a Cyber-73, but added additional clock cycles to each instruction to slow it down. This allowed CDC to offer a lower performance version at a lower price point without the need to develop new hardware. It could also be delivered with dual CPUs. The Cyber 74 was an updated version of
3458-544: The CDC 6600. The Cyber 76 was essentially a renamed CDC 7600 . Neither the Cyber-74 nor the Cyber-76 had CMU instructions. The Cyber-170 series represented CDCs move from discrete electronic components and core memory to integrated circuits and semiconductor memory . The 172, 173, and 174 use integrated circuits and semiconductor memory whereas the 175 uses high-speed discrete transistors. The Cyber-170/700 series
3549-427: The CPU, in the fashion of an assembly line , so the address decoder is constantly in use. Any particular instruction takes the same amount of time to complete, a time known as the latency , but the CPU can process an entire batch of operations, in an overlapping fashion, much faster and more efficiently than if it did so one at a time. Vector processors take this concept one step further. Instead of pipelining just
3640-664: The CPU, this would look something like this: But to a vector processor, this task looks considerably different: Note the complete lack of looping in the instructions, because it is the hardware which has performed 10 sequential operations: effectively the loop count is on an explicit per-instruction basis. Cray-style vector ISAs take this a step further and provide a global "count" register, called vector length (VL): There are several savings inherent in this approach. Additionally, in more modern vector processor ISAs, "Fail on First" or "Fault First" has been introduced (see below) which brings even more advantages. But more than that,
3731-463: The Cyber 170 series imposed a limit of 262,144 (256K) words of main memory, which is semiconductor memory in this series. The central processor has no I/O instructions, relying upon the peripheral processor (PP) units to do I/O. A Cyber 170-series system consists of one or two CPUs that run at either 25 or 40 MHz, and is equipped with 10, 14, 17, or 20 peripheral processors (PP), and up to 24 high-performance channels for high-speed I/O . Due to
CDC Cyber - Misplaced Pages Continue
3822-601: The Cyber 18, known as the MP32, that was 32-bit instead of 16-bit was created for the National Security Agency for crypto-analysis work. The MP32 had the Fortran math runtime library package built into its microcode. The Soviet Union tried to buy several of these systems and they were being built when the U.S. Government cancelled the order. The parts for the MP32 were absorbed into the Cyber 18 production. One of
3913-531: The Gesellschaft für Trendanalysen (GfTA) ( Association for Trend Analyses ) in Germany. A fully configured 256 processor Cyberplus system would have a theoretical performance of 64 GFLOPS, and weigh around 256 tonnes. A nine-unit system was reputedly capable of performing comparative analysis (including pre-processing convolutions) on 1 megapixel images at a rate of one image pair per second. The Cyber 18
4004-513: The P2 and P3. The peripheral processors in the true 180s are always 16-bit machines with the sign bit determining whether a 16/64 bit or 12/60 bit PP instruction is being executed. The single word I/O instructions in the PPs are always 16-bit instructions, so at deadstart the PPs can set up the proper environment to run both EI plus NOS and the customer's existing 170-mode software. To hide this process from
4095-602: The STAR's vector pipeline. Best estimates claim that two Cyber 203s were delivered or upgraded from STAR-100s. In 1980, the successor to the Cyber 203, the Cyber 205 was announced. The UK Meteorological Office at Bracknell , England was the first customer and they received their Cyber 205 in 1981. The Cyber 205 replaces the STAR vector pipeline with redesigned vector pipelines: both scalar and vector units utilized ECL gate array ICs and are cooled with Freon . Cyber 205 systems were available with two or four vector pipelines, with
4186-407: The STAR-100's vectorisation was by design based around memory accesses, an extra slot of memory is now required to process the information. Two times the latency is also needed due to the extra requirement of memory access. A modern packed SIMD architecture, known by many names (listed in Flynn's taxonomy ), can do most of the operation in batches. The code is mostly similar to the scalar version. It
4277-504: The above action would be described in a single instruction (somewhat like vadd c, a, b, $ 10 ). They are also found in the x86 architecture as the REP prefix. However, only very simple calculations can be done effectively in hardware this way without a very large cost increase. Since all operands have to be in memory for the STAR-100 architecture, the latency caused by access became huge too. Broadcom included space in all vector operations of
4368-787: The addition of SIMD cannot, by itself, qualify a processor as an actual vector processor , because SIMD is fixed-length , and vectors are variable-length . The difference is illustrated below with examples, showing and comparing the three categories: Pure SIMD, Predicated SIMD, and Pure Vector Processing. Other CPU designs include some multiple instructions for vector processing on multiple (vectorized) data sets, typically known as MIMD (Multiple Instruction, Multiple Data) and realized with VLIW (Very Long Instruction Word) and EPIC (Explicitly Parallel Instruction Computing). The Fujitsu FR-V VLIW/vector processor combines both technologies. SIMD instruction sets lack crucial features when compared to vector instruction sets. The most important of these
4459-415: The amount of time consumed by these steps, most modern CPUs use a technique known as instruction pipelining in which the instructions pass through several sub-units in turn. The first sub-unit reads the address and decodes it, the next "fetches" the values at those addresses, and the next does the math itself. With pipelining the "trick" is to start decoding the next instruction even before the first has left
4550-475: The corresponding X6 or X7 register to central memory at the address written to the A register. A0 is effectively a scratch register. The higher-end CPUs consisted of multiple functional units (e.g., shift, increment, floating add) which allowed some degree of parallel execution of instructions. This parallelism allows assembly programmers to minimize the effects of the system's slow memory fetch time by pre-fetching data from central memory well before that data
4641-413: The customer, earlier in the 1980s CDC had ceased distribution of the source code for its Deadstart Diagnostic Sequence (DDS) package and turned it into the proprietary Common Tests & Initialization (CTI) package. The initial 170/800 lineup was: 170/825 (P1), 170/835 (P2), 170/855 (P3), 170/865 and 170/875. The 825 was released initially after some delay loops had been added to its microcode; it seemed
SECTION 50
#17327911592774732-431: The decoding of the more common instructions such as normal adding. ( This can be somewhat mitigated by keeping the entire ISA to RISC principles: RVV only adds around 190 vector instructions even with the advanced features. ) Vector processors were traditionally designed to work best only when there are large amounts of data to be worked on. For this reason, these sorts of CPUs were found primarily in supercomputers , as
4823-464: The design folks in Toronto had done a little too well and it was too close to the P2 in performance. The 865 and 875 models were revamped 170/760 heads (one or two processors with 6600/7600-style parallel functional units) with larger memories. The 865 used normal 170 memory; the 875 took its faster main processor memory from the Cyber 205 line. A year or two after the initial release, CDC announced
4914-482: The difference between a traditional vector processor and a modern SIMD one. The example starts with a 32-bit integer variant of the "DAXPY" function, in C : In each iteration, every element of y has an element of x multiplied by a and added to it. The program is expressed in scalar linear form for readability. The scalar version of this would load one of each of x and y, process one calculation, store one result, and loop: The STAR-like code remains concise, but because
5005-543: The difficulties with the ILLIAC concept with its own Distributed Array Processor (DAP) design, categorising the ILLIAC and DAP as cellular array processors that potentially offered substantial performance benefits over conventional vector processor designs such as the CDC STAR-100 and Cray 1. A computer for operations with functions was presented and developed by Kartsev in 1967. The first vector supercomputers are
5096-414: The earlier architecture's characteristics. The Cyber-70 series is a minor upgrade from the earlier systems. The Cyber-73 was largely the same hardware as the CDC 6400 - with the addition of a Compare and Move Unit (CMU). The CMU instructions speeded up comparison and moving of non-word aligned 6-bit character data. The Cyber-73 could be configured with either one or two CPUs. The dual CPU version replaced
5187-507: The early 1970s and dominated supercomputer design through the 1970s into the 1990s, notably the various Cray platforms. The rapid fall in the price-to-performance ratio of conventional microprocessor designs led to a decline in vector supercomputers during the 1990s. Vector processing development began in the early 1960s at the Westinghouse Electric Corporation in their Solomon project. Solomon's goal
5278-635: The early vector processors, and is being implemented in commercial products such as the Andes Technology AX45MPV. There are also several open source vector processor architectures being developed, including ForwardCom and Libre-SOC . As of 2016 most commodity CPUs implement architectures that feature fixed-length SIMD instructions. On first inspection these can be considered a form of vector processing because they operate on multiple (vectorized, explicit length) data sets, and borrow features from vector processors. However, by definition,
5369-418: The entire group of results. In general terms, CPUs are able to manipulate one or two pieces of data at a time. For instance, most CPUs have an instruction that essentially says "add A to B and put the result in C". The data for A, B and C could be—in theory at least—encoded directly into the instruction. However, in efficient implementation things are rarely that simple. The data is rarely sent in raw form, and
5460-488: The final version of 2.8.7 PSR 871, delivered in December 1997, which continues to have minor unofficial bug fixes, Y2K mitigation, etc in support of DtCyber. Besides NOS, the only other operating systems commonly used on the 170 series was NOS/BE or its predecessor SCOPE , a product of CDC's Sunnyvale division. These operating systems provide time-sharing of batch and interactive applications. The predecessor to NOS
5551-463: The form of an array. In 1962, Westinghouse cancelled the project, but the effort was restarted by the University of Illinois at Urbana–Champaign as the ILLIAC IV . Their version of the design originally called for a 1 GFLOPS machine with 256 ALUs, but, when it was finally delivered in 1972, it had only 64 ALUs and could reach only 100 to 150 MFLOPS. Nevertheless, it showed that the basic concept
SECTION 60
#17327911592775642-571: The four-pipe version theoretically delivering 400 64-bit MFLOPs and 800 32-bit MFLOPs. These speeds are rarely seen in practice other than by handcrafted assembly language . The ECL gate array ICs contain 168 logic gates each, with the clock tree networks being tuned by hand-crafted coax length adjustment. The instruction set would be considered V- CISC (very complex instruction set) among modern processors. Many specialized operations facilitate hardware searches, matrix mathematics, and special instructions that enable decryption. The original Cyber 205
5733-441: The hardware. The older 60-bit operating systems, NOS and NOS/BE , could run in a special address space for compatibility with the older systems. The true 180-mode machines are microcoded processors that can support both instruction sets simultaneously. Their hardware is completely different from the earlier 6000/70/170 machines. The small 170-mode exchange package was mapped into the much larger 180-mode exchange package; within
5824-554: The high-end market again with its ETA-10 machine, but it sold poorly and they took that as an opportunity to leave the supercomputing field entirely. In the early and mid-1980s Japanese companies ( Fujitsu , Hitachi and Nippon Electric Corporation (NEC) introduced register-based vector machines similar to the Cray-1, typically being slightly faster and much smaller. Oregon -based Floating Point Systems (FPS) built add-on array processors for minicomputers , later building their own minisupercomputers . Throughout, Cray continued to be
5915-447: The instruction itself that the instruction will operate again on another item of data, at an address one increment larger than the last. This allows for significant savings in decoding time. To illustrate what a difference this can make, consider the simple task of adding two groups of 10 numbers together. In a normal programming language one would write a "loop" that picked up each of the pairs of numbers in turn, and then added them. To
6006-416: The instructions, they also pipeline the data itself. The processor is fed instructions that say not just to add A to B, but to add all of the numbers "from here to here" to all of the numbers "from there to there". Instead of constantly having to decode instructions and then fetch the data needed to complete them, the processor reads a single instruction from memory, and it is simply implied in the definition of
6097-467: The memory load and store speed correspondingly had to increase as well. This is sometimes claimed to be a disadvantage of Cray-style vector processors: in reality it is part of achieving high performance throughput, as seen in GPUs , which face exactly the same issue. Modern SIMD computers claim to improve on early Cray by directly using multiple ALUs, for a higher degree of parallelism compared to only using
6188-628: The mid-to-late 1980s for data communications. In the late 1980s the XN10 was released with an improved processor (a direct memory access instruction was added) as well as a size reduction from two cabinets to one. The XN20 was an improved version of the XN10 with a much smaller footprint. The Line Termination Sub-System was redesigned to use the improved Z180 microprocessor (the Buffer Controller card, Programmable Line Controller card and two Communication Line Interface cards were incorporated on to
6279-492: The most of the power of the hardware. The peripheral processor subsystem uses a technique known as barrel and slot to share the execution unit; each PP had its own memory and registers, but the processor (the slot) itself executed one instruction from each PP in turn (the barrel). This is a crude form of hardware multiprogramming . The peripheral processors have 4096 bytes of 12-bit memory words and an 18-bit accumulator register. Each PP has access to all I/O channels and all of
6370-497: The next operation, the Cray design would load a smaller section of the vector into registers and then apply as many operations as it could to that data, thereby avoiding many of the much slower memory access operations. The Cray design used pipeline parallelism to implement vector instructions rather than multiple ALUs. In addition, the design had completely separate pipelines for different instructions, for example, addition/subtraction
6461-536: The normal scalar pipeline. Modern vector processors (such as the SX-Aurora TSUBASA ) combine both, by issuing multiple data to multiple internal pipelined SIMD ALUs, the number issued being dynamically chosen by the vector program at runtime. Masks can be used to selectively load and store data in memory locations, and use those same masks to selectively disable processing element of SIMD ALUs. Some processors with SIMD ( AVX-512 , ARM SVE2 ) are capable of this kind of selective, per-element ( "predicated" ) processing, and it
6552-498: The original STAR proved to be a great disappointment when it was released (see Amdahl's Law ). Best estimates claim that three STAR-100 systems were delivered. It appeared that all of the problems in the STAR were solvable. In the late 1970s, CDC addressed some of these issues with the Cyber 203 . The new name kept with their new branding, and perhaps to distance itself from the STAR's failure. The Cyber 203 contains redesigned scalar processing and loosely coupled I/O design, but retains
6643-514: The performance leader, continually beating the competition with a series of machines that led to the Cray-2 , Cray X-MP and Cray Y-MP . Since then, the supercomputer market has focused much more on massively parallel processing rather than better implementations of vector processors. However, recognising the benefits of vector processing, IBM developed Virtual Vector Architecture for use in supercomputers coupling several scalar processors to act as
6734-546: The pipelines tend to be long. The "threading" part of SIMT involves the way data is handled independently on each of the compute units. In addition, GPUs such as the Broadcom Videocore IV and other external vector processors like the NEC SX-Aurora TSUBASA may use fewer vector units than the width implies: instead of having 64 units for a 64-number-wide register, the hardware might instead do
6825-407: The primary control program is a 180-mode program known as Environmental Interface (EI). The 170 operating system (NOS) used a single, large, fixed page within the main memory. There were a few clues that an alert user could pick up on, such as the "building page tables" message that flashed on the operator's console at startup and deadstart panels with 16 (instead of 12) toggle switches per PP word on
6916-454: The product announcement for the NOS/VE operating system occurred in 1983. As the computing world standardized to an eight-bit byte size, CDC customers started pushing for the Cyber machines to do the same. The result was a new series of systems that could operate in both 60- and 64-bit modes. The 64-bit operating system was called NOS/VE , and supported the virtual memory capabilities of
7007-607: The relatively slow memory reference times of the CPU (in some models, memory reference instructions were slower than floating-point divides), the higher-end CPUs (e.g., Cyber-74, Cyber-76, Cyber-175, and Cyber-176) are equipped with eight or twelve words of high-speed memory used as an instruction cache. Any loop that fit into the cache (which is usually called in-stack ) runs very fast, without referencing main memory for instruction fetch. The lower-end models do not contain an instruction stack. However, since up to four instructions are packed into each 60-bit word, some degree of prefetching
7098-418: The rest of the 15- and 30-bit instructions, these are 60-bit instructions (three actually use all 60 bits, the other use 30 bits, but its alignment requires 60 bits to be used). The instructions are: move a short string, move a long string, compare strings, and compare a collated string. They operate on six-bit fields (numbered 1 through 10) in central memory. For example, a single instruction can specify "move
7189-705: The smaller P1, and the P3 was designed in Arden Hills, Minnesota . The P1 was a novel air-cooled, 60-board cabinet designed by a group in Mississauga; the P1 ran on 60 Hz current (no motor-generator sets needed). A fourth high-end 180 model 990 (codenamed THETA) was also under development in Arden Hills. The 180s were initially marketed as 170/8xx machines with no mention of the new 8/64-bit system inside. However,
7280-461: The supercomputers themselves were, in general, found in places such as weather prediction centers and physics labs, where huge amounts of data are "crunched". However, as shown above and demonstrated by RISC-V RVV the efficiency of vector ISAs brings other benefits which are compelling even for Embedded use-cases. The vector pseudocode example above comes with a big assumption that the vector computer can process more than ten numbers in one batch. For
7371-630: The system's central memory (CM) in addition to the PP's own memory. The PP instruction set lacks, for example, extensive arithmetic capabilities and does not run user code; the peripheral processor subsystem's purpose is to process I/O and thereby free the more powerful central processor unit(s) to running user computations. A feature of the lower Cyber CPUs is the Compare Move Unit (CMU). It provides four additional instructions intended to aid text processing applications. In an unusual departure from
7462-414: The time required to fetch the data from memory. Not all problems can be attacked with this sort of solution. Including these types of instructions necessarily adds complexity to the core CPU. That complexity typically makes other instructions run slower—i.e., whenever it is not adding up many numbers in a row. The more complex instructions also add to the complexity of the decoders, which might slow down
7553-759: The uses of the Cyber 18 was monitoring the Alaskan Pipeline. The M1000 / C1000, later renamed Cyber 1000, was used as a message store and forward system used by the Federal Reserve System. A version of the Cyber 1000 with its hard drive removed was used by Bell Telephone. This was a RISC processor ( reduced instruction set computer ). An improved version known as the Cyber 1000-2 with the Line Termination Sub-System added 256 Zilog Z80 microprocessors . The Bell Operating Companies purchased large numbers of these systems in
7644-438: Was Kronos which was in common use up until 1975 or so. Due to the strong dependency of developed applications on the particular installation's character set, many installations chose to run the older operating systems rather than convert their applications. Other installations would patch newer versions of the operating system to use the older character set to maintain application compatibility. Cyber 180 development began in
7735-760: Was added to the Cyber 18 to create the 2550. The Cyber 18 was generally programmed in Pascal and assembly language ; FORTRAN , BASIC , and RPG II were also available. Operating systems included RTOS (Real-Time Operating System), MSOS 5 (Mass Storage Operating System), and TIMESHARE 3 ( time-sharing system). "Cyber 18-17" was just a new name for the System 17, based on the 1784 processor. Other Cyber 18s (Cyber 18-05, 18-10, 18-20, and 18-30) had microprogrammable processors with up to 128K words of memory, four additional general registers, and an enhanced instruction set. The Cyber 18-30 had dual processors. A special version of
7826-418: Was first fully exploited in 1976 by the famous Cray-1 . Instead of leaving the data in memory like the STAR-100 and ASC, the Cray design had eight vector registers , which held sixty-four 64-bit words each. The vector instructions were applied between registers, which is much faster than talking to main memory. Whereas the STAR-100 would apply a single operation across a long vector in memory and then move on to
7917-508: Was implemented in different hardware than multiplication. This allowed a batch of vector instructions to be pipelined into each of the ALU subunits, a technique they called vector chaining . The Cray-1 normally had a performance of about 80 MFLOPS, but with up to three chains running it could peak at 240 MFLOPS and averaged around 150 – far faster than any machine of the era. Other examples followed. Control Data Corporation tried to re-enter
8008-555: Was renamed to Cyber 205 Series 400 in 1983 when the Cyber 205 Series 600 was introduced. The Series 600 differs in memory technology and packaging but is otherwise the same. A single four-pipe Cyber 205 was installed. All other sites appear to be two-pipe installations with final count to be determined. The Cyber 205 architecture evolved into the ETA10 as the design team spun off into ETA Systems in September 1983. A final development
8099-514: Was sound, and, when used on data-intensive applications, such as computational fluid dynamics , the ILLIAC was the fastest machine in the world. The ILLIAC approach of using separate ALUs for each data element is not common to later designs, and is often referred to under a separate category, massively parallel computing. Around this time Flynn categorized this type of processing as an early form of single instruction, multiple threads (SIMT). International Computers Limited sought to avoid many of
8190-661: Was the Cyber 250, which was scheduled for release in 1987 priced at $ 20 million; it was later renamed the ETA30 after ETA Systems was absorbed back into CDC. Each Cyberplus (aka Advanced Flexible Processor, AFP) is a 16-bit processor with optional 64-bit floating point capabilities and has 256 K or 512 K words of 64-bit memory. The AFP was the successor to the Flexible Processor (FP), whose design development started in 1972 under black-project circumstances targeted at processing radar and photo image data. The FP control unit had
8281-471: Was to dramatically increase math performance by using a large number of simple coprocessors under the control of a single master Central processing unit (CPU). The CPU fed a single common instruction to all of the arithmetic logic units (ALUs), one per cycle, but with a different data point for each one to work on. This allowed the Solomon machine to apply a single algorithm to a large data set , fed in
#276723