Misplaced Pages

Single program, multiple data

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

In computing , single program, multiple data ( SPMD ) is a term that has been used to refer to computational models for exploiting parallelism where-by multiple processors cooperate in the execution of a program in order to obtain results faster.

#866133

45-465: The term SPMD was introduced in 1983 and was used to denote two different computational models: The (IBM) SPMD is the most common style of parallel programming and can be considered a subcategory of MIMD in that it refers to MIMD execution of a given (“single”) program. It is also a prerequisite for research concepts such as active messages and distributed shared memory . In SPMD parallel execution, multiple autonomous processors simultaneously execute

90-406: A function (called omp_get_thread_num() ). The thread ID is an integer, and the primary thread has an ID of 0 . After the execution of the parallelized code, the threads join back into the primary thread, which continues onward to the end of the program. By default, each thread executes the parallelized section of code independently. Work-sharing constructs can be used to divide a task among

135-575: A header file labelled omp.h in C / C++ . The OpenMP Architecture Review Board (ARB) published its first API specifications, OpenMP for Fortran 1.0, in October 1997. In October the following year they released the C/C++ standard. 2000 saw version 2.0 of the Fortran specifications with version 2.0 of the C/C++ specifications being released in 2002. Version 2.5 is a combined C/C++/Fortran specification that

180-414: A portable , scalable model that gives programmers a simple and flexible interface for developing parallel applications for platforms ranging from the standard desktop computer to the supercomputer . An application built with the hybrid model of parallel programming can run on a computer cluster using both OpenMP and Message Passing Interface (MPI), such that OpenMP is used for parallelism within

225-409: A (multi-core) node while MPI is used for parallelism between nodes. There have also been efforts to run OpenMP on software distributed shared memory systems, to translate OpenMP into MPI and to extend OpenMP for non-shared memory systems. OpenMP is an implementation of multithreading , a method of parallelizing whereby a primary thread (a series of instructions executed consecutively) forks

270-598: A collection of interconnected, independent computers, called nodes. For parallel execution, each node starts its own program and communicates with other nodes by sending and receiving messages, calling send/receive routines for that purpose. Other parallelization directives such as Barrier synchronization may also be implemented by messages. The messages can be sent by a number of communication mechanisms, such as TCP/IP over Ethernet , or specialized high-speed interconnects such as Myrinet and Supercomputer Interconnect. For distributed memory environments, serial sections of

315-419: A computer with two cores, and thus two threads: However, the output may also be garbled because of the race condition caused from the two threads sharing the standard output . Whether printf is atomic depends on the underlying implementation unlike C++11's std::cout , which is thread-safe by default. Used to specify how to assign independent work to one or all of the threads. Example: initialize

360-604: A parallel task on different data. A typical example is the parallel DO loop, where different processors work on separate parts of the arrays involved in the loop. At the end of the loop, execution is synchronized (with soft- or hard-barriers), and processors (processes) continue to the next available section of the program to execute. The (IBM) SPMD has been implemented in the current standard interface for shared memory multiprocessing, OpenMP , which uses multithreading , usually implemented by lightweight processes, called threads . Current computers allow exploiting many parallel modes at

405-481: A program parallelized using OpenMP on a N processor platform. However, this seldom occurs for these reasons: Some vendors recommend setting the processor affinity on OpenMP threads to associate them with particular processor cores. This minimizes thread migration and context-switching cost among cores. It also improves the data locality and reduces the cache-coherency traffic among the cores (or processors). A variety of benchmarks has been developed to demonstrate

450-578: A proposal in 2007, taking inspiration from task parallelism features in Cilk , X10 and Chapel . Version 3.0 was released in May 2008. Included in the new features in 3.0 is the concept of tasks and the task construct, significantly broadening the scope of OpenMP beyond the parallel loop constructs that made up most of OpenMP 2.0. Version 4.0 of the specification was released in July 2013. It adds or improves

495-472: A set of compiler directives , library routines , and environment variables that influence run-time behavior. OpenMP is managed by the nonprofit technology consortium OpenMP Architecture Review Board (or OpenMP ARB ), jointly defined by a broad swath of leading computer hardware and software vendors, including Arm , AMD , IBM , Intel , Cray , HP , Fujitsu , Nvidia , NEC , Red Hat , Texas Instruments , and Oracle Corporation . OpenMP uses

SECTION 10

#1732798039867

540-429: A specified number of sub -threads and the system divides a task among them. The threads then run concurrently , with the runtime environment allocating threads to different processors. The section of code that is meant to run in parallel is marked accordingly, with a compiler directive that will cause the threads to form before the section is executed. Each thread has an ID attached to it which can be obtained using

585-548: A version of i that runs from 0 to 49999 while the second gets a version running from 50000 to 99999. Variant directives are one of the major features introduced in OpenMP 5.0 specification to facilitate programmers to improve performance portability. They enable adaptation of OpenMP pragmas and user code at compile time. The specification defines traits to describe active OpenMP constructs, execution devices, and functionality provided by an implementation, context selectors based on

630-472: Is also more general than just “data-parallel” computational model and can encompass fork&join (as a subcategory implementation). The original context of the (IBM) SPMD was the RP3 computer (the 512-prosessor IBM Research Parallel Processor Prototype), which supported general purpose computing, with both distributed and (logically) shared memory. The (IBM) SPMD model was implemented by Darema and IBM colleagues into

675-426: Is that Active Messages are actually a lower-level mechanism that can be used to implement data parallel or message passing efficiently. The basic idea is that each message has a header containing the address or index of a userspace handler to be executed upon message arrival, with the contents of the message passed as an argument to the handler. Early active message systems passed the actual remote code address across

720-1129: Is used to specify number of threads for an application. OpenMP has been implemented in many commercial compilers. For instance, Visual C++ 2005, 2008, 2010, 2012 and 2013 support it (OpenMP 2.0, in Professional, Team System, Premium and Ultimate editions ), as well as Intel Parallel Studio for various processors. Oracle Solaris Studio compilers and tools support the latest OpenMP specifications with productivity enhancements for Solaris OS (UltraSPARC and x86/x64) and Linux platforms. The Fortran, C and C++ compilers from The Portland Group also support OpenMP 2.5. GCC has also supported OpenMP since version 4.2. Compilers with an implementation of OpenMP 3.0: Several compilers support OpenMP 3.1: Compilers supporting OpenMP 4.0: Several Compilers supporting OpenMP 4.5: Partial support for OpenMP 5.0: Auto-parallelizing compilers that generates source code annotated with OpenMP directives: Several profilers and debuggers expressly support OpenMP: Pros: Cons: One might expect to get an N times speedup when running

765-593: The EPEX (Environment for Parallel Execution), one of the first prototype programming environments. The effectiveness of the (IBM) SPMD was demonstrated for a wide class of applications, and in 1988 was implemented in the IBM FORTRAN, the first vendor-product in parallel programming; and in MPI (1991 and on),   OpenMP (1997 and on), and other environments which have adopted and cite the (IBM) SPMD Computational Model. By

810-411: The OpenMP directive. The different types of clauses are: Used to modify/check the number of threads, detect if the execution context is in a parallel region, how many processors in current system, set/unset locks, timing functions, etc A method to alter the execution features of OpenMP applications. Used to control loop iterations scheduling, default number of threads, etc. For example, OMP_NUM_THREADS

855-698: The SwarmESB project. The basic model of the active messages is extend with new concepts and Java Script is used to express the code of the active messages. This software-engineering -related article is a stub . You can help Misplaced Pages by expanding it . OpenMP OpenMP ( Open Multi-Processing ) is an application programming interface (API) that supports multi-platform shared-memory multiprocessing programming in C , C++ , and Fortran , on many platforms, instruction-set architectures and operating systems , including Solaris , AIX , FreeBSD , HP-UX , Linux , macOS , and Windows . It consists of

900-585: The constructs for thread creation, workload distribution (work sharing), data-environment management, thread synchronization, user-level runtime routines and environment variables. In C/C++, OpenMP uses #pragmas . The OpenMP specific pragmas are listed below. The pragma omp parallel is used to fork additional threads to carry out the work enclosed in the construct in parallel. The original thread will be denoted as master thread with thread ID 0. Example (C program): Display "Hello, world." using multiple threads. Use flag -fopenmp to compile using GCC: Output on

945-408: The current parallel computing standards. The (IBM) SPMD programming model assumes a multiplicity of processors which operate cooperatively, all executing the same program but can take different paths through the program based on parallelization directives embedded in the program; and specifically as stated in “ all processes participating in the parallel computation are created at the beginning of

SECTION 20

#1732798039867

990-445: The execution and remain in existence until the end ”, (the processors/processes) “ execute different instructions and act on different data ”, “ the job (work) to be done by each process is allocated dynamically ”, that is the processes “ self-schedule themselves to execute different instructions and act on different data ”, thus self-assign themselves to cooperate in execution of serial and parallel tasks (as well as replicate tasks) in

1035-469: The final directive from variants and context. Since OpenMP is a shared memory programming model, most variables in OpenMP code are visible to all threads by default. But sometimes private variables are necessary to avoid race conditions and there is a need to pass values between the sequential part and the parallel region (the code block executed in parallel), so data environment management is introduced as data sharing attribute clauses by appending them to

1080-479: The following features: support for accelerators ; atomics ; error handling; thread affinity ; tasking extensions; user defined reduction ; SIMD support; Fortran 2003 support. The current version is 5.2, released in November 2021. Version 6.0 was released in November 2024. Note that not all compilers (and OSes) support the full set of features for the latest version/s. The core elements of OpenMP are

1125-887: The largest clusters on the Teragrid , as well as present GPU -based supercomputers. On a shared memory machine (a computer with several interconnected CPUs that access the same memory space), the sharing can be implemented in the context of either physically shared memory or logically shared (but physically distributed) memory; in addition to the shared memory, the CPUs in the computer system can also include local (or private) memory. For either of these contexts, synchronization can be enabled with hardware enabled primitives (such as compare-and-swap , or fetch-and-add . For machines that do not have such hardware support, locks can be used and data can be “exchanged” across processors (or, more generally, processes or threads ) by depositing

1170-608: The late 1980s, there were many distributed computers with proprietary message passing libraries. The first SPMD standard was PVM . The current de facto standard is MPI . The Cray parallel directives were a direct predecessor of OpenMP . Active message An Active message (in computing ) is a messaging object capable of performing processing on its own. It is a lightweight messaging protocol used to optimize network communications with an emphasis on reducing latency by removing software overheads associated with buffering and providing applications with direct user-level access to

1215-427: The local address of a handler function; in these systems the sender of an active message provides an index into the remote handler table, and upon arrival of the active message the table is used to map this index to the handler address that is invoked to handle the message. Other variations of active messages carry the actual code itself, not a pointer to the code. The message typically carries some data. On arrival at

1260-449: The network hardware. This contrasts with traditional computer-based messaging systems in which messages are passive entities with no processing power. Active messages are communications primitive for exploiting the full performance and flexibility of modern computer interconnects. They are often classified as one of the three main types of distributed memory programming, the other two being data parallel and message passing . The view

1305-451: The network, however this approach required the initiator to know the address of the remote handler function when composing a message, which can be quite limiting even within the context of a SPMD programming model (and generally relies upon address space uniformity which is absent in many modern systems). Newer active message interfaces require the client to register a table with the software at initialization time that maps an integer index to

1350-473: The program can be implemented by identical computation of the serial section on all nodes rather than computing the result on one node and sending it to the others, if that improves performance by reducing communication overhead. Nowadays, the programmer is isolated from the details of the message passing by standard interfaces, such as PVM and MPI . Distributed memory is the programming style used on parallel supercomputers from homegrown Beowulf clusters to

1395-429: The program.  The notion process was used as a generalization of the term processor in the sense that multiple processes can execute on a processor (to for example exploit larger degrees of parallelism for more efficiency and load-balancing). The (IBM) SPMD model was proposed by Darema as an approach different and more efficient than the fork-and-join that was pursued by all others in the community at that time; it

Single program, multiple data - Misplaced Pages Continue

1440-417: The programmer with a common memory space and the possibility to parallelize execution. With the (IBM) SPMD model the cooperating processors (or processes) take different paths through the program, using parallel directives ( parallelization and synchronization directives , which can utilize compare-and-swap and fetch-and-add operations on shared memory synchronization variables), and perform operations on data in

1485-433: The receiving end, more data is acquired, and the computation in the active message is performed, making use of data in the message as well as data in the receiving node. This form of active messaging is not restricted to SPMD , although originator and receiver must share some notions as to what data can be accessed at the receiving node. A higher level implementation for active messages is also named Swarm communication in

1530-737: The same (parallel) task (“ same program ”) is executed on different (SIMD) processors (“ operating in lock-step mode ” acting on a part (“slice”) of the data-vector. Specifically, in their 1985 paper (and similarly in) is stated: “ we consider the SPMD (Single Program, Multiple Data) operating mode. This mode allows simultaneous execution of the same task (one per processor) but prevents data exchange between processors.  Data exchanges are only performed under SIMD mode by means of vector assignments.  We assume synchronizations are summed-up to switchings (sic) between SIMD and SPMD operatings (sic) modes using global fork-join primitives ”). Starting around

1575-840: The same program at independent points, rather than in the lockstep that SIMD or SIMT imposes on different data. With SPMD, tasks can be executed on general purpose CPUs . In SIMD the same operation (instruction) is applied on multiple data to manipulate data streams (a version of SIMD is vector processing where the data are organized as vectors). Another class of processors, GPUs encompass multiple SIMD streams processing.  Note that SPMD and SIMD are not mutually exclusive; SPMD parallel execution can include SIMD, or vector, or GPU sub-processing. SPMD has been used for parallel programming of both message passing and shared-memory machine architectures. On distributed memory computer architectures, SPMD implementations usually employ message passing programming. A distributed memory computer consists of

1620-471: The same time for maximum combined effect. A distributed memory program using MPI may run on a collection of nodes. Each node may be a shared memory computer and execute in parallel on multiple CPUs using OpenMP. Within each CPU, SIMD vector instructions (usually generated automatically by the compiler) and superscalar instruction execution (usually handled transparently by the CPU itself), such as pipelining and

1665-435: The same timeframe (in late 1983 – early 1984), the SPMD term was proposed by Frederica Darema (at IBM at that time, and part of the RP3 group) to define a different SPMD computational model that she proposed, as a programming model which in the intervening years has been applied to a wide range of general-purpose high-performance computers (including RP3 - the 512-processor IBM Research Parallel Processor Prototype) and has led to

1710-576: The sharable data in a shared memory area. When the hardware does not support shared memory, packing the data as a “message” is often the most efficient way to program (logically) shared memory computers with large number of processors, where the physical memory is local to processors and accessing memory of another processor takes longer. SPMD on a shared memory machine can be implemented by standard processes (heavyweight) or threads (lightweight). Shared memory multiprocessing (both symmetric multiprocessing , SMP, and non-uniform memory access , NUMA) presents

1755-400: The shared memory (“shared data”); the processors (or processes) can also have access and perform operations on data in their local memory (“private data”). In distinction, with fork-and-join approaches, the program starts executing on one processor and the execution splits in a parallel region, which is started when parallel directives are encountered; in a parallel region, the processors execute

1800-438: The threads so that each thread executes its allocated part of the code. Both task parallelism and data parallelism can be achieved using OpenMP in this way. The runtime environment allocates threads to processors depending on usage, machine load and other factors. The runtime environment can assign the number of threads based on environment variables , or the code can do so using functions. The OpenMP functions are included in

1845-404: The traits and user-defined conditions, and metadirective and declare directive directives for users to program the same code region with variant directives. The mechanism provided by the two variant directives for selecting variants is more convenient to use than the C/C++ preprocessing since it directly supports variant selection in OpenMP and allows an OpenMP compiler to analyze and determine

Single program, multiple data - Misplaced Pages Continue

1890-486: The use of multiple parallel functional units, are used for maximum single CPU speed. The acronym SPMD for “Single-Program Multiple-Data” has been used to describe two different computational models for exploiting parallel computing, and this is due to both terms being natural extensions of Flynn’s taxonomy. The two respective groups of researchers were unaware of each other’s use of the term SPMD to independently describe different models of parallel programming. The term SPMD

1935-411: The value of a large array in parallel, using each thread to do part of the work This example is embarrassingly parallel , and depends only on the value of i . The OpenMP parallel for flag tells the OpenMP system to split this task among its working threads. The threads will each receive a unique and private version of the variable. For instance, with two worker threads, one thread might be handed

1980-456: Was proposed first in 1983 by Michel Auguin (University of Nice Sophia-Antipolis) and François Larbey (Thomson/Sintra) in the context of the OPSILA parallel computer and in the context of a fork-and-join and data parallel computational model approach. This computer consisted of a master (controller processor) and SIMD processors (or vector processor mode as proposed by Flynn). In Auguin’s SPMD model,

2025-421: Was released in 2005. Up to version 2.0, OpenMP primarily specified ways to parallelize highly regular loops, as they occur in matrix-oriented numerical programming , where the number of iterations of the loop is known at entry time. This was recognized as a limitation, and various task parallel extensions were added to implementations. In 2005, an effort to standardize task parallelism was formed, which published

#866133