Misplaced Pages

Multimedia Acceleration eXtensions

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

The Multimedia Acceleration eXtensions or MAX are instruction set extensions to the Hewlett-Packard PA-RISC instruction set architecture (ISA). MAX was developed to improve the performance of multimedia applications that were becoming more prevalent during the 1990s.

#666333

91-405: MAX instructions operate on 32- or 64-bit SIMD data types consisting of multiple 16-bit integers packed in general purpose registers . The available functionality includes additions, subtractions and shifts. The first version, MAX-1 , was for the 32-bit PA-RISC 1.1 ISA. The second version, MAX-2 , was for the 64-bit PA-RISC 2.0 ISA. The approach is notable because the set of instructions

182-625: A LOAD, ADD, MULTIPLY and STORE sequence. If the SIMD width is 4, then the SIMD processor must LOAD four elements entirely before it can move on to the ADDs, must complete all the ADDs before it can move on to the MULTIPLYs, and likewise must complete all of the MULTIPLYs before it can start the STOREs. This is by definition and by design. Having to perform 4-wide simultaneous 64-bit LOADs and 64-bit STOREs

273-511: A SIMD processor somewhere in its architecture. The PlayStation 2 was unusual in that one of its vector-float units could function as an autonomous DSP executing its own instruction stream, or as a coprocessor driven by ordinary CPU instructions. 3D graphics applications tend to lend themselves well to SIMD processing as they rely heavily on operations with 4-dimensional vectors. Microsoft 's Direct3D 9.0 now chooses at runtime processor-specific implementations of its own math operations, including

364-404: A SIMD processor there are two improvements to this process. For one the data is understood to be in blocks, and a number of values can be loaded all at once. Instead of a series of instructions saying "retrieve this pixel, now retrieve the next pixel", a SIMD processor will have a single instruction that effectively says "retrieve n pixels" (where n is a number that varies from design to design). For

455-540: A co-processor, it is the main computer with the PC-compatible computer into which it is plugged serving support functions. Modern graphics processing units ( GPUs ) include an array of shader pipelines which may be driven by compute kernels , and can be considered vector processors (using a similar strategy for hiding memory latencies). As shown in Flynn's 1972 paper the key distinguishing factor of SIMT-based GPUs

546-527: A common operation in many multimedia applications. One example would be changing the brightness of an image. Each pixel of an image consists of three values for the brightness of the red (R), green (G) and blue (B) portions of the color. To change the brightness, the R, G and B values are read from memory, a value is added to (or subtracted from) them, and the resulting values are written back out to memory. Audio DSPs would likewise, for volume control, multiply both Left and Right channels simultaneously. With

637-561: A corresponding 2X or 4X performance gain. Memory bandwidth was sufficient to support these expanded modes. The STAR-100 was otherwise slower than CDC's own supercomputers like the CDC 7600 , but at data-related tasks they could keep up while being much smaller and less expensive. However the machine also took considerable time decoding the vector instructions and getting ready to run the process, so it required very specific data sets to work on before it actually sped anything up. The vector technique

728-485: A few "vector registers" that use the same interfaces across all CPUs with this instruction set. The hardware handles all alignment issues and "strip-mining" of loops. Machines with different vector sizes would be able to run the same code. LLVM calls this vector type "vscale". An order of magnitude increase in code size is not uncommon, when compared to equivalent scalar or equivalent vector code, and an order of magnitude or greater effectiveness (work done per instruction)

819-468: A greater quantity of numbers in the vector register, it becomes unfeasible for the computer to have a register that large. As a result, the vector processor either gains the ability to perform loops itself, or exposes some sort of vector control (status) register to the programmer, usually known as a vector Length. The self-repeating instructions are found in early vector computers like the STAR-100, where

910-421: A high performance vector processor may have multiple functional units adding those numbers in parallel. The checking of dependencies between those numbers is not required as a vector instruction specifies multiple independent operations. This simplifies the control logic required, and can further improve performance by avoiding stalls. The math operations thus completed far faster overall, the limiting factor being

1001-482: A pipelined loop over 16 units for a hybrid approach. The Broadcom Videocore IV is also capable of this hybrid approach: nominally stating that its SIMD QPU Engine supports 16-long FP array operations in its instructions, it actually does them 4 at a time, as (another) form of "threads". This example starts with an algorithm ("IAXPY"), first show it in scalar instructions, then SIMD, then predicated SIMD, and finally vector instructions. This incrementally helps illustrate

SECTION 10

#1732786695667

1092-417: A powerful tool in real-time video processing applications like conversion between various video standards and frame rates ( NTSC to/from PAL , NTSC to/from HDTV formats, etc.), deinterlacing , image noise reduction , adaptive video compression , and image enhancement. A more ubiquitous application for SIMD is found in video games : nearly every modern video game console since 1998 has incorporated

1183-402: A safe fallback mechanism on unsupported CPUs to simple loops. Instead of providing an SIMD datatype, compilers can also be hinted to auto-vectorize some loops, potentially taking some assertions about the lack of data dependency. This is not as flexible as manipulating SIMD variables directly, but is easier to use. OpenMP 4.0+ has a #pragma omp simd hint. This OpenMP interface has replaced

1274-556: A single instruction without any overhead. This is similar to C and C++ intrinsics. Benchmarks for 4×4 matrix multiplication , 3D vertex transformation , and Mandelbrot set visualization show near 400% speedup compared to scalar code written in Dart. McCutchan's work on Dart, now called SIMD.js, has been adopted by ECMAScript and Intel announced at IDF 2013 that they are implementing McCutchan's specification for both V8 and SpiderMonkey . However, by 2017, SIMD.js has been taken out of

1365-428: A special instruction, the significance compared to Videocore IV (and, crucially as will be shown below, SIMD as well) being that the repeat length does not have to be part of the instruction encoding. This way, significantly more work can be done in each batch; the instruction encoding is much more elegant and compact as well. The only drawback is that in order to take full advantage of this extra batch processing capacity,

1456-475: A successful model for the multimedia instructions of other CPU designs. The set is also small because the CPU already included powerful shift and bit-manipulation instructions: "Shift pair" which shifts a pair of registers, "extract" and "deposit" of bit fields, and all the common bit-wise logical operations (and, or, exclusive-or, etc.). This set of multimedia instructions has proven its performance, as well. In 1996

1547-565: A time, using a hypercube-connected network or processor-dedicated RAM to find its operands. Supercomputing moved away from the SIMD approach when inexpensive scalar MIMD approaches based on commodity processors such as the Intel i860 XP became more powerful, and interest in SIMD waned. The current era of SIMD processors grew out of the desktop-computer market rather than the supercomputer market. As desktop processors became powerful enough to support real-time gaming and audio/video processing during

1638-417: A variety of reasons, this can take much less time than retrieving each pixel individually, as with a traditional CPU design. Another advantage is that the instruction operates on all loaded data in a single operation. In other words, if the SIMD system works by loading up eight data points at once, the add operation being applied to the data will happen to all eight values at the same time. This parallelism

1729-541: A vector processor. Although vector supercomputers resembling the Cray-1 are less popular these days, NEC has continued to make this type of computer up to the present day with their SX series of computers. Most recently, the SX-Aurora TSUBASA places the processor and either 24 or 48 gigabytes of memory on an HBM 2 module within a card that physically resembles a graphics coprocessor, but instead of serving as

1820-488: A wide set of nonstandard extensions, including Cilk 's #pragma simd , GCC's #pragma GCC ivdep , and many more. Consumer software is typically expected to work on a range of CPUs covering multiple generations, which could limit the programmer's ability to use new SIMD instructions to improve the computational performance of a program. The solution is to include multiple versions of the same code that uses either older or newer SIMD technologies, and pick one that best fits

1911-406: Is unable by design to cope with iteration and reduction. This is illustrated further with examples, below. Additionally, vector processors can be more resource-efficient by using slower hardware and saving power, but still achieving throughput and having less latency than SIMD, through vector chaining . Consider both a SIMD processor and a vector processor working on 4 64-bit elements, doing

SECTION 20

#1732786695667

2002-611: Is significantly more complex and involved than "Packed SIMD" , which is strictly limited to execution of parallel pipelined arithmetic operations only. Although the exact internal details of today's commercial GPUs are proprietary secrets, the MIAOW team was able to piece together anecdotal information sufficient to implement a subset of the AMDGPU architecture. Several modern CPU architectures are being designed as vector processors. The RISC-V vector extension follows similar principles as

2093-521: Is single-issue and uses no SIMD ALUs, only having 1-wide 64-bit LOAD, 1-wide 64-bit STORE (and, as in the Cray-1 , the ability to run MULTIPLY simultaneously with ADD), may complete the four operations faster than a SIMD processor with 1-wide LOAD, 1-wide STORE, and 2-wide SIMD. This more efficient resource utilization, due to vector chaining , is a key advantage and difference compared to SIMD. SIMD, by design and definition, cannot perform chaining except to

2184-516: Is achievable with Vector ISAs. ARM's Scalable Vector Extension takes another approach, known in Flynn's Taxonomy as "Associative Processing", more commonly known today as "Predicated" (masked) SIMD. This approach is not as compact as Vector processing but is still far better than non-predicated SIMD. Detailed comparative examples are given in the Vector processing page. Small-scale (64 or 128 bits) SIMD became popular on general-purpose CPUs in

2275-416: Is assumed that both x and y are properly aligned here (only start on a multiple of 16) and that n is a multiple of 4, as otherwise some setup code would be needed to calculate a mask or to run a scalar version. It can also be assumed, for simplicity, that the SIMD instructions have an option to automatically repeat scalar operands, like ARM NEON can. If it does not, a "splat" (broadcast) must be used, to copy

2366-415: Is common for publishers of the SIMD instruction sets to make their own C/C++ language extensions with intrinsic functions or special datatypes (with operator overloading ) guaranteeing the generation of vector code. Intel, AltiVec, and ARM NEON provide extensions widely adopted by the compilers targeting their CPUs. (More complex operations are the task of vector math libraries.) The GNU C Compiler takes

2457-505: Is comprehensive individual element-level predicate masks on every vector instruction as is now available in ARM SVE2. And AVX-512 , almost qualifies as a vector processor. Predicated SIMD uses fixed-width SIMD ALUs but allows locally controlled (predicated) activation of units to provide the appearance of variable length vectors. Examples below help explain these categorical distinctions. SIMD, because it uses fixed-width batch processing,

2548-699: Is easier to achieve as only compiler switches need to be changed. Glibc supports LMV and this functionality is adopted by the Intel-backed Clear Linux project. In 2013 John McCutchan announced that he had created a high-performance interface to SIMD instruction sets for the Dart programming language, bringing the benefits of SIMD to web programs for the first time. The interface consists of two types: Instances of these types are immutable and in optimized code are mapped directly to SIMD registers. Operations expressed in Dart typically are compiled into

2639-427: Is headed by computer architect Bill Dally . Their Storm-1 processor (2007) contains 80 SIMD cores controlled by a MIPS CPU. Vector processor In computing , a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its instructions are designed to operate efficiently and effectively on large one-dimensional arrays of data called vectors . This

2730-687: Is heavily SIMD based. Philips , now NXP , developed several SIMD processors named Xetal . The Xetal has 320 16-bit processor elements especially designed for vision tasks. Intel's AVX-512 SIMD instructions process 512 bits of data at once. SIMD instructions are widely used to process 3D graphics, although modern graphics cards with embedded SIMD have largely taken over this task from the CPU. Some systems also include permute functions that re-pack elements inside vectors, making them particularly useful for data processing and compression. They are also used in cryptography. The trend of general-purpose computing on GPUs ( GPGPU ) may lead to wider use of SIMD in

2821-521: Is in contrast to scalar processors , whose instructions operate on single data items only, and in contrast to some of those same scalar processors having additional single instruction, multiple data (SIMD) or SIMD within a register (SWAR) Arithmetic Units. Vector processors can greatly improve performance on certain workloads, notably numerical simulation and similar tasks. Vector processing techniques also operate in video-game console hardware and in graphics accelerators . Vector machines appeared in

Multimedia Acceleration eXtensions - Misplaced Pages Continue

2912-440: Is instead "pointed to" by passing in an address to a memory location that holds the data. Decoding this address and getting the data out of the memory takes some time, during which the CPU traditionally would sit idle waiting for the requested data to show up. As CPU speeds have increased, this memory latency has historically become a large impediment to performance; see Random-access memory § Memory wall . In order to reduce

3003-424: Is much smaller than in other multimedia CPUs, and also more general-purpose. The small set and simplicity of the instructions reduce the recurring costs of the electronics, as well as the costs and difficulty of the design. The general-purpose nature of the instructions increases their overall value. These instructions require only small changes to a CPU's arithmetic-logic unit. A similar design approach promises to be

3094-456: Is not possible, then the operations take even longer because the LD may not be issued (started) at the same time as the first ADDs, and so on. If there are only 4-wide 64-bit SIMD ALUs, the completion time is even worse: only when all four LOADs have completed may the SIMD operations start, and only when all ALU operations have completed may the STOREs begin. A vector processor, by contrast, even if it

3185-440: Is separate from the parallelism provided by a superscalar processor ; the eight values are processed in parallel even on a non-superscalar processor, and a superscalar processor may be able to perform multiple SIMD operations in parallel. To remedy problems 1 and 5, RISC-V 's vector extension uses an alternative approach: instead of exposing the sub-register-level details to the programmer, the instruction set abstracts them out as

3276-458: Is that it has a single instruction decoder-broadcaster but that the cores receiving and executing that same instruction are otherwise reasonably normal: their own ALUs, their own register files, their own Load/Store units and their own independent L1 data caches. Thus although all cores simultaneously execute the exact same instruction in lock-step with each other they do so with completely different data from completely different memory locations. This

3367-480: Is that vector processors, inherently by definition and design, have always been variable-length since their inception. Whereas pure (fixed-width, no predication) SIMD is often mistakenly claimed to be "vector" (because SIMD processes data which happens to be vectors), through close analysis and comparison of historic and modern ISAs, actual vector ISAs may be observed to have the following features that no SIMD ISA has: Predicated SIMD (part of Flynn's taxonomy ) which

3458-496: Is these which somewhat deserve the nomenclature "vector processor" or at least deserve the claim of being capable of "vector processing". SIMD processors without per-element predication ( MMX , SSE , AltiVec ) categorically do not. Modern GPUs, which have many small compute units each with their own independent SIMD ALUs, use Single Instruction Multiple Threads (SIMT). SIMT units run from a shared single broadcast synchronised Instruction Unit. The "vector registers" are very wide and

3549-402: Is very costly in hardware (256 bit data paths to memory). Having 4x 64-bit ALUs, especially MULTIPLY, likewise. To avoid these high costs, a SIMD processor would have to have 1-wide 64-bit LOAD, 1-wide 64-bit STORE, and only 2-wide 64-bit ALUs. As shown in the diagram, which assumes a multi-issue execution model , the consequences are that the operations now take longer to complete. If multi-issue

3640-539: The Control Data Corporation STAR-100 and Texas Instruments Advanced Scientific Computer (ASC), which were introduced in 1974 and 1972, respectively. The basic ASC (i.e., "one pipe") ALU used a pipeline architecture that supported both scalar and vector computations, with peak performance reaching approximately 20 MFLOPS, readily achieved when processing long vectors. Expanded ALU configurations supported "two pipes" or "four pipes" with

3731-547: The Cray-1 (1977). The first era of modern SIMD computers was characterized by massively parallel processing -style supercomputers such as the Thinking Machines CM-1 and CM-2 . These computers had many limited-functionality processors that would work in parallel. For example, each of 65,536 single-bit processors in a Thinking Machines CM-2 would execute the same instruction at the same time, allowing, for instance, to logically combine 65,536 pairs of bits at

Multimedia Acceleration eXtensions - Misplaced Pages Continue

3822-700: The FPU and MMX registers . Compilers also often lacked support, requiring programmers to resort to assembly language coding. SIMD on x86 had a slow start. The introduction of 3DNow! by AMD and SSE by Intel confused matters somewhat, but today the system seems to have settled down (after AMD adopted SSE) and newer compilers should result in more SIMD-enabled software. Intel and AMD now both provide optimized math libraries that use SIMD instructions, and open source alternatives like libSIMD , SIMDx86 and SLEEF have started to appear (see also libm ). Apple Computer had somewhat more success, even though they entered

3913-585: The Videocore IV ISA for a REP field, but unlike the STAR-100 which uses memory for its repeats, the Videocore IV repeats are on all operations including arithmetic vector operations. The repeat length can be a small range of power of two or sourced from one of the scalar registers. The Cray-1 introduced the idea of using processor registers to hold vector data in batches. The batch lengths (vector length, VL) could be dynamically set with

4004-479: The 1990s, demand grew for this particular type of computing power, and microprocessor vendors turned to SIMD to meet the demand. Hewlett-Packard introduced MAX instructions into PA-RISC 1.1 desktops in 1994 to accelerate MPEG decoding. Sun Microsystems introduced SIMD integer instructions in its " VIS " instruction set extensions in 1995, in its UltraSPARC I microprocessor. MIPS followed suit with their similar MDMX system. The first widely deployed desktop SIMD

4095-555: The 64-bit "MAX-2" instructions enabled real-time performance of MPEG-1 and MPEG-2 video while increasing the area of a RISC CPU by only 0.2%. MAX-1 was first implemented with the PA-7100LC in 1994. It is usually attributed as being the first SIMD extensions to an ISA. The second version, MAX-2 , was for the 64-bit PA-RISC 2.0 ISA. It was first implemented in the PA-8000 microprocessor released in 1996. The basic approach to

4186-427: The CPU, in the fashion of an assembly line , so the address decoder is constantly in use. Any particular instruction takes the same amount of time to complete, a time known as the latency , but the CPU can process an entire batch of operations, in an overlapping fashion, much faster and more efficiently than if it did so one at a time. Vector processors take this concept one step further. Instead of pipelining just

4277-664: The CPU, this would look something like this: But to a vector processor, this task looks considerably different: Note the complete lack of looping in the instructions, because it is the hardware which has performed 10 sequential operations: effectively the loop count is on an explicit per-instruction basis. Cray-style vector ISAs take this a step further and provide a global "count" register, called vector length (VL): There are several savings inherent in this approach. Additionally, in more modern vector processor ISAs, "Fail on First" or "Fault First" has been introduced (see below) which brings even more advantages. But more than that,

4368-583: The ECMAScript standard queue in favor of pursuing a similar interface in WebAssembly . As of August 2020, the WebAssembly interface remains unfinished, but its portable 128-bit SIMD feature has already seen some use in many engines. Emscripten, Mozilla's C/C++-to-JavaScript compiler, with extensions can enable compilation of C++ programs that make use of SIMD intrinsics or GCC-style vector code to

4459-562: The GCC extension. LLVM's libcxx seems to implement it. For GCC and libstdc++, a wrapper library that builds on top of the GCC extension is available. Microsoft added SIMD to .NET in RyuJIT. The System.Numerics.Vector package, available on NuGet, implements SIMD datatypes. Java also has a new proposed API for SIMD instructions available in OpenJDK 17 in an incubator module. It also has

4550-579: The SIMD API of JavaScript, resulting in equivalent speedups compared to scalar code. It also supports (and now prefers) the WebAssembly 128-bit SIMD proposal. It has generally proven difficult to find sustainable commercial applications for SIMD-only processors. One that has had some measure of success is the GAPP , which was developed by Lockheed Martin and taken to the commercial sector by their spin-off Teranex . The GAPP's recent incarnations have become

4641-540: The SIMD market later than the rest. AltiVec offered a rich system and can be programmed using increasingly sophisticated compilers from Motorola , IBM and GNU , therefore assembly language programming is rarely needed. Additionally, many of the systems that would benefit from SIMD were supplied by Apple itself, for example iTunes and QuickTime . However, in 2006, Apple computers moved to Intel x86 processors. Apple's APIs and development tools ( XCode ) were modified to support SSE2 and SSE3 as well as AltiVec. Apple

SECTION 50

#1732786695667

4732-407: The STAR-100's vectorisation was by design based around memory accesses, an extra slot of memory is now required to process the information. Two times the latency is also needed due to the extra requirement of memory access. A modern packed SIMD architecture, known by many names (listed in Flynn's taxonomy ), can do most of the operation in batches. The code is mostly similar to the scalar version. It

4823-504: The above action would be described in a single instruction (somewhat like vadd c, a, b, $ 10 ). They are also found in the x86 architecture as the REP prefix. However, only very simple calculations can be done effectively in hardware this way without a very large cost increase. Since all operands have to be in memory for the STAR-100 architecture, the latency caused by access became huge too. Broadcom included space in all vector operations of

4914-787: The addition of SIMD cannot, by itself, qualify a processor as an actual vector processor , because SIMD is fixed-length , and vectors are variable-length . The difference is illustrated below with examples, showing and comparing the three categories: Pure SIMD, Predicated SIMD, and Pure Vector Processing. Other CPU designs include some multiple instructions for vector processing on multiple (vectorized) data sets, typically known as MIMD (Multiple Instruction, Multiple Data) and realized with VLIW (Very Long Instruction Word) and EPIC (Explicitly Parallel Instruction Computing). The Fujitsu FR-V VLIW/vector processor combines both technologies. SIMD instruction sets lack crucial features when compared to vector instruction sets. The most important of these

5005-415: The amount of time consumed by these steps, most modern CPUs use a technique known as instruction pipelining in which the instructions pass through several sub-units in turn. The first sub-unit reads the address and decodes it, the next "fetches" the values at those addresses, and the next does the math itself. With pipelining the "trick" is to start decoding the next instruction even before the first has left

5096-489: The arithmetic in MAX-2 is to "interrupt the carries" between the 16-bit subwords, and choose between modular arithmetic, signed and unsigned saturation. This requires only small changes to the arithmetic logic unit. MAX-2 instructions are register-to-register instructions that operate on multiple integers in 64-bit quantities. All have a one cycle latency in the PA-8000 microprocessor and its derivatives. Memory accesses are via

5187-418: The code to "clone" functions, while ICC does so automatically (under the command-line option /Qax ). The Rust programming language also supports FMV. The setup is similar to GCC and Clang in that the code defines what instruction sets to compile for, but cloning is manually done via inlining. As using FMV requires code modification on GCC and Clang, vendors more commonly use library multi-versioning: this

5278-431: The decoding of the more common instructions such as normal adding. ( This can be somewhat mitigated by keeping the entire ISA to RISC principles: RVV only adds around 190 vector instructions even with the advanced features. ) Vector processors were traditionally designed to work best only when there are large amounts of data to be worked on. For this reason, these sorts of CPUs were found primarily in supercomputers , as

5369-482: The difference between a traditional vector processor and a modern SIMD one. The example starts with a 32-bit integer variant of the "DAXPY" function, in C : In each iteration, every element of y has an element of x multiplied by a and added to it. The program is expressed in scalar linear form for readability. The scalar version of this would load one of each of x and y, process one calculation, store one result, and loop: The STAR-like code remains concise, but because

5460-543: The difficulties with the ILLIAC concept with its own Distributed Array Processor (DAP) design, categorising the ILLIAC and DAP as cellular array processors that potentially offered substantial performance benefits over conventional vector processor designs such as the CDC STAR-100 and Cray 1. A computer for operations with functions was presented and developed by Kartsev in 1967. The first vector supercomputers are

5551-507: The early 1970s and dominated supercomputer design through the 1970s into the 1990s, notably the various Cray platforms. The rapid fall in the price-to-performance ratio of conventional microprocessor designs led to a decline in vector supercomputers during the 1990s. Vector processing development began in the early 1960s at the Westinghouse Electric Corporation in their Solomon project. Solomon's goal

SECTION 60

#1732786695667

5642-594: The early 1990s and continued through 1997 and later with Motion Video Instructions (MVI) for Alpha . SIMD instructions can be found, to one degree or another, on most CPUs, including IBM 's AltiVec and SPE for PowerPC , HP 's PA-RISC Multimedia Acceleration eXtensions (MAX), Intel 's MMX and iwMMXt , SSE , SSE2 , SSE3 SSSE3 and SSE4.x , AMD 's 3DNow! , ARC 's ARC Video subsystem, SPARC 's VIS and VIS2, Sun 's MAJC , ARM 's Neon technology, MIPS ' MDMX (MaDMaX) and MIPS-3D . The IBM, Sony, Toshiba co-developed Cell Processor 's SPU 's instruction set

5733-635: The early vector processors, and is being implemented in commercial products such as the Andes Technology AX45MPV. There are also several open source vector processor architectures being developed, including ForwardCom and Libre-SOC . As of 2016 most commodity CPUs implement architectures that feature fixed-length SIMD instructions. On first inspection these can be considered a form of vector processing because they operate on multiple (vectorized, explicit length) data sets, and borrow features from vector processors. However, by definition,

5824-418: The entire group of results. In general terms, CPUs are able to manipulate one or two pieces of data at a time. For instance, most CPUs have an instruction that essentially says "add A to B and put the result in C". The data for A, B and C could be—in theory at least—encoded directly into the instruction. However, in efficient implementation things are rarely that simple. The data is rarely sent in raw form, and

5915-509: The extensions a step further by abstracting them into a universal interface that can be used on any platform by providing a way of defining SIMD datatypes. The LLVM Clang compiler also implements the feature, with an analogous interface defined in the IR. Rust's packed_simd crate (and the experimental std::sims ) uses this interface, and so does Swift 2.0+. C++ has an experimental interface std::experimental::simd that works similarly to

6006-521: The form of an array. In 1962, Westinghouse cancelled the project, but the effort was restarted by the University of Illinois at Urbana–Champaign as the ILLIAC IV . Their version of the design originally called for a 1 GFLOPS machine with 256 ALUs, but, when it was finally delivered in 1972, it had only 64 ALUs and could reach only 100 to 150 MFLOPS. Nevertheless, it showed that the basic concept

6097-465: The future. Adoption of SIMD systems in personal computer software was at first slow, due to a number of problems. One was that many of the early SIMD instruction sets tended to slow overall performance of the system due to the re-use of existing floating point registers. Other systems, like MMX and 3DNow! , offered support for data types that were not interesting to a wide audience and had expensive context switching instructions to switch between using

6188-474: The ground up with no separate scalar registers. Ziilabs produced an SIMD type processor for use on mobile devices, such as media players and mobile phones. Larger scale commercial SIMD processors are available from ClearSpeed Technology, Ltd. and Stream Processors, Inc. ClearSpeed 's CSX600 (2004) has 96 cores each with two double-precision floating point units while the CSX700 (2008) has 192. Stream Processors

6279-554: The high-end market again with its ETA-10 machine, but it sold poorly and they took that as an opportunity to leave the supercomputing field entirely. In the early and mid-1980s Japanese companies ( Fujitsu , Hitachi and Nippon Electric Corporation (NEC) introduced register-based vector machines similar to the Cray-1, typically being slightly faster and much smaller. Oregon -based Floating Point Systems (FPS) built add-on array processors for minicomputers , later building their own minisupercomputers . Throughout, Cray continued to be

6370-447: The instruction itself that the instruction will operate again on another item of data, at an address one increment larger than the last. This allows for significant savings in decoding time. To illustrate what a difference this can make, consider the simple task of adding two groups of 10 numbers together. In a normal programming language one would write a "loop" that picked up each of the pairs of numbers in turn, and then added them. To

6461-416: The instructions, they also pipeline the data itself. The processor is fed instructions that say not just to add A to B, but to add all of the numbers "from here to here" to all of the numbers "from there to there". Instead of constantly having to decode instructions and then fetch the data needed to complete them, the processor reads a single instruction from memory, and it is simply implied in the definition of

6552-467: The memory load and store speed correspondingly had to increase as well. This is sometimes claimed to be a disadvantage of Cray-style vector processors: in reality it is part of achieving high performance throughput, as seen in GPUs , which face exactly the same issue. Modern SIMD computers claim to improve on early Cray by directly using multiple ALUs, for a higher degree of parallelism compared to only using

6643-428: The newer architectures are then considered "short-vector" architectures, as earlier SIMD and vector supercomputers had vector lengths from 64 to 64,000. A modern supercomputer is almost always a cluster of MIMD computers, each of which implements (short-vector) SIMD instructions. An application that may take advantage of SIMD is one where the same value is being added to (or subtracted from) a large number of data points,

6734-497: The next operation, the Cray design would load a smaller section of the vector into registers and then apply as many operations as it could to that data, thereby avoiding many of the much slower memory access operations. The Cray design used pipeline parallelism to implement vector instructions rather than multiple ALUs. In addition, the design had completely separate pipelines for different instructions, for example, addition/subtraction

6825-536: The normal scalar pipeline. Modern vector processors (such as the SX-Aurora TSUBASA ) combine both, by issuing multiple data to multiple internal pipelined SIMD ALUs, the number issued being dynamically chosen by the vector program at runtime. Masks can be used to selectively load and store data in memory locations, and use those same masks to selectively disable processing element of SIMD ALUs. Some processors with SIMD ( AVX-512 , ARM SVE2 ) are capable of this kind of selective, per-element ( "predicated" ) processing, and it

6916-514: The performance leader, continually beating the competition with a series of machines that led to the Cray-2 , Cray X-MP and Cray Y-MP . Since then, the supercomputer market has focused much more on massively parallel processing rather than better implementations of vector processors. However, recognising the benefits of vector processing, IBM developed Virtual Vector Architecture for use in supercomputers coupling several scalar processors to act as

7007-436: The performance of multimedia use. SIMD has three different subcategories in Flynn's 1972 Taxonomy , one of which is SIMT . SIMT should not be confused with software threads or hardware threads , both of which are task time-sharing (time-slicing). SIMT is true simultaneous parallel hardware-level execution. Modern graphics processing units (GPUs) are often wide SIMD implementations. The first use of SIMD instructions

7098-546: The pipelines tend to be long. The "threading" part of SIMT involves the way data is handled independently on each of the compute units. In addition, GPUs such as the Broadcom Videocore IV and other external vector processors like the NEC SX-Aurora TSUBASA may use fewer vector units than the width implies: instead of having 64 units for a 64-number-wide register, the hardware might instead do

7189-484: The same operation on multiple data points simultaneously. Such machines exploit data level parallelism , but not concurrency : there are simultaneous (parallel) computations, but each unit performs the exact same instruction at any given moment (just with different data). SIMD is particularly applicable to common tasks such as adjusting the contrast in a digital image or adjusting the volume of digital audio . Most modern CPU designs include SIMD instructions to improve

7280-590: The standard 64-bit loads and stores. The "MIX" and "PERMH" instructions are a notable innovation because they permute words in the register set without accessing memory. This can substantially speed many operations. SIMD Single instruction, multiple data ( SIMD ) is a type of parallel processing in Flynn's taxonomy . SIMD can be internal (part of the hardware design) and it can be directly accessible through an instruction set architecture (ISA), but it should not be confused with an ISA. SIMD describes computers with multiple processing elements that perform

7371-461: The supercomputers themselves were, in general, found in places such as weather prediction centers and physics labs, where huge amounts of data are "crunched". However, as shown above and demonstrated by RISC-V RVV the efficiency of vector ISAs brings other benefits which are compelling even for Embedded use-cases. The vector pseudocode example above comes with a big assumption that the vector computer can process more than ten numbers in one batch. For

7462-414: The time required to fetch the data from memory. Not all problems can be attacked with this sort of solution. Including these types of instructions necessarily adds complexity to the core CPU. That complexity typically makes other instructions run slower—i.e., whenever it is not adding up many numbers in a row. The more complex instructions also add to the complexity of the decoders, which might slow down

7553-609: The use of SIMD-capable instructions. A later processor that used vector processing is the Cell Processor used in the Playstation 3, which was developed by IBM in cooperation with Toshiba and Sony . It uses a number of SIMD processors (a NUMA architecture, each with independent local store and controlled by a general purpose CPU) and is geared towards the huge datasets required by 3D and video processing applications. It differs from traditional ISAs by being SIMD from

7644-482: The user's CPU at run-time ( dynamic dispatch ). There are two main camps of solutions: FMV, manually coded in assembly language, is quite commonly used in a number of performance-critical libraries such as glibc and libjpeg-turbo. Intel C++ Compiler , GNU Compiler Collection since GCC 6, and Clang since clang 7 allow for a simplified approach, with the compiler taking care of function duplication and selection. GCC and clang requires explicit target_clones labels in

7735-418: Was first fully exploited in 1976 by the famous Cray-1 . Instead of leaving the data in memory like the STAR-100 and ASC, the Cray design had eight vector registers , which held sixty-four 64-bit words each. The vector instructions were applied between registers, which is much faster than talking to main memory. Whereas the STAR-100 would apply a single operation across a long vector in memory and then move on to

7826-508: Was implemented in different hardware than multiplication. This allowed a batch of vector instructions to be pipelined into each of the ALU subunits, a technique they called vector chaining . The Cray-1 normally had a performance of about 80 MFLOPS, but with up to three chains running it could peak at 240 MFLOPS and averaged around 150 – far faster than any machine of the era. Other examples followed. Control Data Corporation tried to re-enter

7917-656: Was in the ILLIAC IV , which was completed in 1966. SIMD was the basis for vector supercomputers of the early 1970s such as the CDC Star-100 and the Texas Instruments ASC , which could operate on a "vector" of data with a single instruction. Vector processing was especially popularized by Cray in the 1970s and 1980s. Vector processing architectures are now considered separate from SIMD computers: Duncan's Taxonomy includes them whereas Flynn's Taxonomy does not, due to Flynn's work (1966, 1972) pre-dating

8008-514: Was sound, and, when used on data-intensive applications, such as computational fluid dynamics , the ILLIAC was the fastest machine in the world. The ILLIAC approach of using separate ALUs for each data element is not common to later designs, and is often referred to under a separate category, massively parallel computing. Around this time Flynn categorized this type of processing as an early form of single instruction, multiple threads (SIMT). International Computers Limited sought to avoid many of

8099-594: Was the dominant purchaser of PowerPC chips from IBM and Freescale Semiconductor . Even though Apple has stopped using PowerPC processors in their products, further development of AltiVec is continued in several PowerPC and Power ISA designs from Freescale and IBM. SIMD within a register , or SWAR , is a range of techniques and tricks used for performing SIMD in general-purpose registers on hardware that does not provide any direct support for SIMD instructions. This can be used to exploit parallelism in certain algorithms even on hardware that does not support SIMD directly. It

8190-471: Was to dramatically increase math performance by using a large number of simple coprocessors under the control of a single master Central processing unit (CPU). The CPU fed a single common instruction to all of the arithmetic logic units (ALUs), one per cycle, but with a different data point for each one to work on. This allowed the Solomon machine to apply a single algorithm to a large data set , fed in

8281-911: Was with Intel's MMX extensions to the x86 architecture in 1996. This sparked the introduction of the much more powerful AltiVec system in the Motorola PowerPC and IBM's POWER systems. Intel responded in 1999 by introducing the all-new SSE system. Since then, there have been several extensions to the SIMD instruction sets for both architectures. Advanced vector extensions AVX, AVX2 and AVX-512 are developed by Intel. AMD supports AVX, AVX2 , and AVX-512 in their current products. All of these developments have been oriented toward support for real-time graphics, and are therefore oriented toward processing in two, three, or four dimensions, usually with vector lengths of between two and sixteen words, depending on data type and architecture. When new SIMD architectures need to be distinguished from older ones,

#666333