NVM Express ( NVMe ) or Non-Volatile Memory Host Controller Interface Specification ( NVMHCIS ) is an open, logical-device interface specification for accessing a computer's non-volatile storage media usually attached via the PCI Express bus. The initial NVM stands for non-volatile memory , which is often NAND flash memory that comes in several physical form factors, including solid-state drives (SSDs), PCIe add-in cards, and M.2 cards, the successor to mSATA cards. NVM Express, as a logical-device interface, has been designed to capitalize on the low latency and internal parallelism of solid-state storage devices.
75-468: Architecturally, the logic for NVMe is physically stored within and executed by the NVMe controller chip that is physically co-located with the storage media, usually an SSD. Version changes for NVMe, e.g., 1.3 to 1.4, are incorporated within the storage media, and do not affect PCIe-compatible components such as motherboards and CPUs. By its design, NVM Express allows host hardware and software to fully exploit
150-402: A PCIe 2.0 or 3.0 interface. A HHHL NVMe solid-state drive card is easy to insert into a PCIe slot of a server. SATA Express allows the use of two PCI Express 2.0 or 3.0 lanes and two SATA 3.0 (6 Gbit/s) ports through the same host-side SATA Express connector (but not both at the same time). SATA Express supports NVMe as the logical device interface for attached PCI Express storage devices. It
225-437: A variable that is shared between them. Without synchronization, the instructions between the two threads may be interleaved in any order. For example, consider the following program: If instruction 1B is executed between 1A and 3A, or if instruction 1A is executed between 1B and 3B, the program will produce incorrect data. This is known as a race condition . The programmer must use a lock to provide mutual exclusion . A lock
300-467: A vector of buffers. Scatter/gather refers to the process of gathering data from, or scattering data into, the given set of buffers. Vectored I/O can operate synchronously or asynchronously. The main reasons for using vectored I/O are efficiency and convenience. Vectored I/O has several potential uses: Standards bodies document the applicable functions readv and writev in POSIX 1003.1-2001 and
375-532: A combination of parallelism and concurrency characteristics. Parallel computers can be roughly classified according to the level at which the hardware supports parallelism, with multi-core and multi-processor computers having multiple processing elements within a single machine, while clusters , MPPs , and grids use multiple computers to work on the same task. Specialized parallel computer architectures are sometimes used alongside traditional processors, for accelerating specific tasks. In some cases parallelism
450-472: A long chain of dependent calculations; there are usually opportunities to execute independent calculations in parallel. Let P i and P j be two program segments. Bernstein's conditions describe when the two are independent and can be executed in parallel. For P i , let I i be all of the input variables and O i the output variables, and likewise for P j . P i and P j are independent if they satisfy Violation of
525-404: A more realistic assessment of the parallel performance. Understanding data dependencies is fundamental in implementing parallel algorithms . No program can run more quickly than the longest chain of dependent calculations (known as the critical path ), since calculations that depend upon prior calculations in the chain must be executed in order. However, most algorithms do not consist of just
600-520: A multi-core processor can issue multiple instructions per clock cycle from multiple instruction streams. IBM 's Cell microprocessor , designed for use in the Sony PlayStation 3 , is a prominent multi-core processor. Each core in a multi-core processor can potentially be superscalar as well—that is, on every clock cycle, each core can issue multiple instructions from one thread. Simultaneous multithreading (of which Intel's Hyper-Threading
675-575: A new standard for accessing non-volatile memory emerged at the Intel Developer Forum 2007, when NVMHCI was shown as the host-side protocol of a proposed architectural design that had Open NAND Flash Interface Working Group (ONFI) on the memory (flash) chips side. A NVMHCI working group led by Intel was formed that year. The NVMHCI 1.0 specification was completed in April 2008 and released on Intel's web site. Technical work on NVMe began in
750-403: A node), or n-dimensional mesh . Parallel computers based on interconnected networks need to have some kind of routing to enable the passing of messages between nodes that are not directly connected. The medium used for communication between the processors is likely to be hierarchical in large multiprocessor machines. Parallel computers can be roughly classified according to the level at which
825-543: A pipelined processor is a RISC processor, with five stages: instruction fetch (IF), instruction decode (ID), execute (EX), memory access (MEM), and register write back (WB). The Pentium 4 processor had a 35-stage pipeline. Most modern processors also have multiple execution units . They usually combine this feature with pipelining and thus can issue more than one instruction per clock cycle ( IPC > 1 ). These processors are known as superscalar processors. Superscalar processors differ from multi-core processors in that
SECTION 10
#1732781080988900-453: A request and data transfer, where data speeds are much slower than RAM speeds, and where disk rotation and seek time give rise to further optimization requirements. NVM Express devices are chiefly available in the form of standard-sized PCI Express expansion cards and as 2.5-inch form-factor devices that provide a four-lane PCI Express interface through the U.2 connector (formerly known as SFF-8639). Storage devices using SATA Express and
975-471: A result, shared memory computer architectures do not scale as well as distributed memory systems do. Processor–processor and processor–memory communication can be implemented in hardware in several ways, including via shared (either multiported or multiplexed ) memory, a crossbar switch , a shared bus or an interconnect network of a myriad of topologies including star , ring , tree , hypercube , fat hypercube (a hypercube with more than one processor at
1050-412: A single address space ), or distributed memory (in which each processing element has its own local address space). Distributed memory refers to the fact that the memory is logically distributed, but often implies that it is physically distributed as well. Distributed shared memory and memory virtualization combine the two approaches, where the processing element has its own local memory and access to
1125-500: A sufficient amount of memory bandwidth exists. A distributed computer (also known as a distributed memory multiprocessor) is a distributed memory computer system in which the processing elements are connected by a network. Distributed computers are highly scalable. The terms " concurrent computing ", "parallel computing", and "distributed computing" have a lot of overlap, and no clear distinction exists between them. The same system may be characterized both as "parallel" and "distributed";
1200-512: A time from multiple threads. A symmetric multiprocessor (SMP) is a computer system with multiple identical processors that share memory and connect via a bus . Bus contention prevents bus architectures from scaling. As a result, SMPs generally do not comprise more than 32 processors. Because of the small size of the processors and the significant reduction in the requirements for bus bandwidth achieved by large caches, such symmetric multiprocessors are extremely cost-effective, provided that
1275-416: Is a programming language construct that allows one thread to take control of a variable and prevent other threads from reading or writing it, until that variable is unlocked. The thread holding the lock is free to execute its critical section (the section of a program that requires exclusive access to some variable), and to unlock the data when it is finished. Therefore, to guarantee correct program execution,
1350-608: Is achieved by balancing enhancements to both parallelizable and non-parallelizable components of a task. Furthermore, it reveals that increasing the number of processors yields diminishing returns, with negligible speedup gains beyond a certain point. Amdahl's Law has limitations, including assumptions of fixed workload, neglecting inter-process communication and synchronization overheads, primarily focusing on computational aspect and ignoring extrinsic factors such as data persistence, I/O operations, and memory access overheads. Gustafson's law and Universal Scalability Law give
1425-486: Is constructed and implemented as a serial stream of instructions. These instructions are executed on a central processing unit on one computer. Only one instruction may execute at a time—after that instruction is finished, the next one is executed. Parallel computing, on the other hand, uses multiple processing elements simultaneously to solve a problem. This is accomplished by breaking the problem into independent parts so that each processing element can execute its part of
1500-558: Is directed by a thirteen-member board of directors selected from the Promoter Group, which includes Cisco, Dell, EMC, HGST, Intel, Micron, Microsoft, NetApp, Oracle, PMC, Samsung, SanDisk and Seagate. In September 2016, the CompactFlash Association announced that it would be releasing a new memory card specification, CFexpress , which uses NVMe. NVMe Host Memory Buffer (HMB) feature added in version 1.2 of
1575-486: Is electrically compatible with MultiLink SAS , so a backplane can support both at the same time. U.2, formerly known as SFF-8639 , uses the same physical port as SATA Express but allows up to four PCI Express lanes. Available servers can combine up to 48 U.2 NVMe solid-state drives. U.3 (SFF-TA-1001) is built on the U.2 spec and uses the same SFF-8639 connector. Unlike in U.2, a single "tri-mode" (PCIe/SATA/SAS) backplane receptacle can handle all three types of connections;
SECTION 20
#17327810809881650-516: Is equivalent to an entirely sequential program. The single-instruction-multiple-data (SIMD) classification is analogous to doing the same operation repeatedly over a large data set. This is commonly done in signal processing applications. Multiple-instruction-single-data (MISD) is a rarely used classification. While computer architectures to deal with this were devised (such as systolic arrays ), few applications that fit this class materialized. Multiple-instruction-multiple-data (MIMD) programs are by far
1725-637: Is expected that future revisions will significantly enhance namespace management. Because of its feature focus, NVMe 1.1 was initially called "Enterprise NVMHCI". An update for the base NVMe specification, called version 1.0e, was released in January 2013. In June 2011, a Promoter Group led by seven companies was formed. The first commercially available NVMe chipsets were released by Integrated Device Technology (89HF16P04AG3 and 89HF32P08AG3) in August 2012. The first NVMe drive, Samsung 's XS1715 enterprise drive ,
1800-465: Is known as burst buffer , which is typically built from arrays of non-volatile memory physically distributed across multiple I/O nodes. Computer architectures in which each element of main memory can be accessed with equal latency and bandwidth are known as uniform memory access (UMA) systems. Typically, that can be achieved only by a shared memory system, in which the memory is not physically distributed. A system that does not have this property
1875-429: Is known as a non-uniform memory access (NUMA) architecture. Distributed memory systems have non-uniform memory access. Computer systems make use of caches —small and fast memories located close to the processor which store temporary copies of memory values (nearby in both the physical and logical sense). Parallel computer systems have difficulties with caches that may store the same value in more than one location, with
1950-493: Is necessary, such as semaphores , barriers or some other synchronization method . Subtasks in a parallel program are often called threads . Some parallel computer architectures use smaller, lightweight versions of threads known as fibers , while others use bigger versions known as processes . However, "threads" is generally accepted as a generic term for subtasks. Threads will often need synchronized access to an object or other resource , for example when they must update
2025-457: Is similar to how USB mass storage devices are built to follow the USB mass-storage device class specification and work with all computers, with no per-device drivers needed. NVM Express devices are also used as the building block of the burst buffer storage in many leading supercomputers, such as Fugaku Supercomputer , Summit Supercomputer and Sierra Supercomputer , etc. The first details of
2100-429: Is the best known) was an early form of pseudo-multi-coreism. A processor capable of concurrent multithreading includes multiple execution units in the same processing unit—that is it has a superscalar architecture—and can issue multiple instructions per clock cycle from multiple threads. Temporal multithreading on the other hand includes a single execution unit in the same processing unit and can issue one instruction at
2175-529: Is the characteristic of a parallel program that "entirely different calculations can be performed on either the same or different sets of data". This contrasts with data parallelism, where the same calculation is performed on the same or different sets of data. Task parallelism involves the decomposition of a task into sub-tasks and then allocating each sub-task to a processor for execution. The processors would then execute these sub-tasks concurrently and often cooperatively. Task parallelism does not usually scale with
2250-455: Is the computing unit of the processor and in multi-core processors each core is independent and can access the same memory concurrently. Multi-core processors have brought parallel computing to desktop computers . Thus parallelization of serial programs has become a mainstream programming task. In 2012 quad-core processors became standard for desktop computers , while servers have 10+ core processors. From Moore's law it can be predicted that
2325-440: Is transparent to the programmer, such as in bit-level or instruction-level parallelism, but explicitly parallel algorithms , particularly those that use concurrency, are more difficult to write than sequential ones, because concurrency introduces several new classes of potential software bugs , of which race conditions are the most common. Communication and synchronization between the different subtasks are typically some of
NVM Express - Misplaced Pages Continue
2400-458: The M.2 specification which support NVM Express as the logical-device interface are a popular use-case for NVMe and have become the dominant form of solid-state storage for servers, desktops, and laptops alike. Specifications for NVMe released to date include: Historically, most SSDs used buses such as SATA , SAS , or Fibre Channel for interfacing with the rest of a computer system. Since SSDs became available in mass markets, SATA has become
2475-548: The Single UNIX Specification version 2. The Windows API has analogous functions ReadFileScatter and WriteFileGather ; however, unlike the POSIX functions, they require the alignment of each buffer on a memory page . Winsock provides separate WSASend and WSARecv functions without this requirement. While working directly with a vector of buffers can be significantly harder than working with
2550-561: The speedup from parallelization would be linear—doubling the number of processing elements should halve the runtime, and doubling it a second time should again halve the runtime. However, very few parallel algorithms achieve optimal speedup. Most of them have a near-linear speedup for small numbers of processing elements, which flattens out into a constant value for large numbers of processing elements. The maximum potential speedup of an overall system can be calculated by Amdahl's law . Amdahl's Law indicates that optimal performance improvement
2625-423: The 1970s until about 1986, speed-up in computer architecture was driven by doubling computer word size —the amount of information the processor can manipulate per cycle. Increasing the word size reduces the number of instructions the processor must execute to perform an operation on variables whose sizes are greater than the length of the word. For example, where an 8-bit processor must add two 16-bit integers ,
2700-797: The Intel SSD data center family that interfaces with the host through PCI Express bus, which includes the DC P3700 series, the DC P3600 series, and the DC P3500 series. As of November 2014, NVMe drives are commercially available. In March 2014, the group incorporated to become NVM Express, Inc., which as of November 2014 consists of more than 65 companies from across the industry. NVM Express specifications are owned and maintained by NVM Express, Inc., which also promotes industry awareness of NVM Express as an industry-wide standard. NVM Express, Inc.
2775-412: The M.2 connector are PCI Express 3.0 or higher (up to four lanes ). NVM Express over Fabrics ( NVMe-oF ) is the concept of using a transport protocol over a network to connect remote NVMe devices, contrary to regular NVMe where physical NVMe devices are connected to a PCIe bus either directly or over a PCIe switch to a PCIe bus. In August 2017, a standard for using NVMe over Fibre Channel (FC)
2850-502: The NVMe and AHCI logical-device interfaces. The nvmecontrol tool is used to control an NVMe disk from the command line on FreeBSD. It was added in FreeBSD 9.2. NVM-Express user space tooling for Linux. Parallel computing Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at
2925-926: The NVMe specification. HMB allows SSDs to utilize the host's DRAM , which can improve the I/O performance for DRAM-less SSDs. For example, HMB can be used for cache the FTL table by the SSD controller, which can improve I/O performance. NVMe 2.0 added optional Zoned Namespaces (ZNS) feature and Key-Value (KV) feature, and support for rotating media such as hard drives. ZNS and KV allows data to be mapped directly to its physical location in flash memory to directly access data on an SSD. ZNS and KV can also decrease write amplification of flash media. There are many form factors of NVMe solid-state drive, such as AIC, U.2, U.3, M.2 etc. Almost all early NVMe solid-state drives are HHHL (half height, half length) or FHHL (full height, half length) AIC, with
3000-484: The above program can be rewritten to use locks: One thread will successfully lock variable V, while the other thread will be locked out —unable to proceed until V is unlocked again. This guarantees correct execution of the program. Locks may be necessary to ensure correct program execution when threads must serialize access to resources, but their use can greatly slow a program and may affect its reliability . Locking multiple variables using non-atomic locks introduces
3075-443: The algorithm simultaneously with the others. The processing elements can be diverse and include resources such as a single computer with multiple processors, several networked computers, specialized hardware, or any combination of the above. Historically parallel computing was used for scientific computing and the simulation of scientific problems, particularly in the natural and engineering sciences , such as meteorology . This led to
NVM Express - Misplaced Pages Continue
3150-486: The amount of power used in a processor. Increasing processor power consumption led ultimately to Intel 's May 8, 2004 cancellation of its Tejas and Jayhawk processors, which is generally cited as the end of frequency scaling as the dominant computer architecture paradigm. To deal with the problem of power consumption and overheating the major central processing unit (CPU or processor) manufacturers started to produce power efficient processors with multiple cores. The core
3225-462: The available cores. However, for a serial software program to take full advantage of the multi-core architecture the programmer needs to restructure and parallelize the code. A speed-up of application software runtime will no longer be achieved through frequency scaling, instead programmers will need to parallelize their software code to take advantage of the increasing computing power of multicore architectures. Main article: Amdahl's law Optimally,
3300-460: The average time it takes to execute an instruction. An increase in frequency thus decreases runtime for all compute-bound programs. However, power consumption P by a chip is given by the equation P = C × V × F , where C is the capacitance being switched per clock cycle (proportional to the number of transistors whose inputs change), V is voltage , and F is the processor frequency (cycles per second). Increases in frequency increase
3375-692: The benefit of wide software compatibility, but has the downside of not delivering optimal performance when used with SSDs connected via the PCI Express bus. As a logical-device interface, AHCI was developed when the purpose of a host bus adapter (HBA) in a system was to connect the CPU/memory subsystem with a much slower storage subsystem based on rotating magnetic media . As a result, AHCI introduces certain inefficiencies when used with SSD devices, which behave much more like RAM than like spinning media. The NVMe device interface has been designed from
3450-488: The controller automatically detects the type of connection used. This is unlike U.2, where users need to use separate controllers for SATA/SAS and NVMe. U.3 devices are required to be backwards-compatible with U.2 hosts. U.2 devices can be used with U.3 hosts. M.2, formerly known as the Next Generation Form Factor ( NGFF ), uses a M.2 NVMe solid-state drive computer bus . Interfaces provided through
3525-444: The costs associated with merging data from multiple processes. Specifically, inter-process communication and synchronization can lead to overheads that are substantially higher—often by two or more orders of magnitude—compared to processing the same data on a single thread. Therefore, the overall improvement should be carefully evaluated. From the advent of very-large-scale integration (VLSI) computer-chip fabrication technology in
3600-400: The design of parallel hardware and software, as well as high performance computing . Frequency scaling was the dominant reason for improvements in computer performance from the mid-1980s until 2004. The runtime of a program is equal to the number of instructions multiplied by the average time per instruction. Maintaining everything else constant, increasing the clock frequency decreases
3675-497: The dominant paradigm in computer architecture , mainly in the form of multi-core processors . In computer science , parallelism and concurrency are two different things: a parallel program uses multiple CPU cores , each core performing a task independently. On the other hand, concurrency enables a program to deal with multiple tasks even on a single CPU core; the core switches between tasks (i.e. threads ) without necessarily completing each one. A program can have both, neither or
3750-457: The easiest to parallelize. Michael J. Flynn created one of the earliest classification systems for parallel (and sequential) computers and programs, now known as Flynn's taxonomy . Flynn classified programs and computers by whether they were operating using a single set or multiple sets of instructions, and whether or not those instructions were using a single set or multiple sets of data. The single-instruction-single-data (SISD) classification
3825-436: The first condition introduces a flow dependency, corresponding to the first segment producing a result used by the second segment. The second condition represents an anti-dependency, when the second segment produces a variable needed by the first segment. The third and final condition represents an output dependency: when two segments write to the same location, the result comes from the logically last executed segment. Consider
SECTION 50
#17327810809883900-551: The following functions, which demonstrate several kinds of dependencies: In this example, instruction 3 cannot be executed before (or even in parallel with) instruction 2, because instruction 3 uses a result from instruction 2. It violates condition 1, and thus introduces a flow dependency. In this example, there are no dependencies between the instructions, so they can all be run in parallel. Bernstein's conditions do not allow memory to be shared between different processes. For that, some means of enforcing an ordering between accesses
3975-415: The greatest obstacles to getting optimal parallel program performance. A theoretical upper bound on the speed-up of a single program as a result of parallelization is given by Amdahl's law , which states that it is limited by the fraction of time for which the parallelization can be utilised. Traditionally, computer software has been written for serial computation . To solve a problem, an algorithm
4050-558: The ground up, capitalizing on the lower latency and parallelism of PCI Express SSDs, and complementing the parallelism of contemporary CPUs, platforms and applications. At a high level, the basic advantages of NVMe over AHCI relate to its ability to exploit parallelism in host hardware and software, manifested by the differences in command queue depths, efficiency of interrupt processing, the number of uncacheable register accesses, etc., resulting in various performance improvements. The table below summarizes high-level differences between
4125-547: The hardware supports parallelism. This classification is broadly analogous to the distance between basic computing nodes. These are not mutually exclusive; for example, clusters of symmetric multiprocessors are relatively common. A multi-core processor is a processor that includes multiple processing units (called "cores") on the same chip. This processor differs from a superscalar processor, which includes multiple execution units and can issue multiple instructions per clock cycle from one instruction stream (thread); in contrast,
4200-608: The introduction of 32-bit processors, which has been a standard in general-purpose computing for two decades. Not until the early 2000s, with the advent of x86-64 architectures, did 64-bit processors become commonplace. A computer program is, in essence, a stream of instructions executed by a processor. Without instruction-level parallelism, a processor can only issue less than one instruction per clock cycle ( IPC < 1 ). These processors are known as subscalar processors. These instructions can be re-ordered and combined into groups which are then executed in parallel without changing
4275-433: The levels of parallelism possible in modern SSDs. As a result, NVM Express reduces I/O overhead and brings various performance improvements relative to previous logical-device interfaces, including multiple long command queues, and reduced latency. The previous interface protocols like AHCI were developed for use with far slower hard disk drives (HDD) where a very lengthy delay (relative to CPU operations) exists between
4350-445: The maximum throughput of SATA. High-end SSDs had been made using the PCI Express bus before NVMe, but using non-standard specification interfaces, or by emulating a hardware RAID controller. By standardizing the interface of SSDs, operating systems only need one common device driver to work with all SSDs adhering to the specification. It also means that each SSD manufacturer does not have to design specific interface drivers. This
4425-511: The memory on non-local processors. Accesses to local memory are typically faster than accesses to non-local memory. On the supercomputers , distributed shared memory space can be implemented using the programming model such as PGAS . This model allows processes on one compute node to transparently access the remote memory of another compute node. All compute nodes are also connected to an external shared memory system via high-speed interconnect, such as Infiniband , this external shared memory system
4500-442: The most common type of parallel programs. According to David A. Patterson and John L. Hennessy , "Some machines are hybrids of these categories, of course, but this classic model has survived because it is simple, easy to understand, and gives a good first approximation. It is also—perhaps because of its understandability—the most widely used scheme." Parallel computing can incur significant overhead in practice, primarily due to
4575-474: The most typical way for connecting SSDs in personal computers ; however, SATA was designed primarily for interfacing with mechanical hard disk drives (HDDs), and it became increasingly inadequate for SSDs, which improved in speed over time. For example, within about five years of mass market mainstream adoption (2005–2010) many SSDs were already held back by the comparatively slow data rates available for hard drives—unlike hard disk drives, some SSDs are limited by
SECTION 60
#17327810809884650-474: The number of cores per processor will double every 18–24 months. This could mean that after 2020 a typical processor will have dozens or hundreds of cores, however in reality the standard is somewhere in the region of 4 to 16 cores, with some designs having a mix of performance and efficiency cores (such as ARM's big.LITTLE design) due to thermal and design constraints. An operating system can ensure that different tasks and user programs are run in parallel on
4725-824: The overhead from resource contention or communication dominates the time spent on other computation, further parallelization (that is, splitting the workload over even more threads) increases rather than decreases the amount of time required to finish. This problem, known as parallel slowdown , can be improved in some cases by software analysis and redesign. Applications are often classified according to how often their subtasks need to synchronize or communicate with each other. An application exhibits fine-grained parallelism if its subtasks must communicate many times per second; it exhibits coarse-grained parallelism if they do not communicate many times per second, and it exhibits embarrassing parallelism if they rarely or never have to communicate. Embarrassingly parallel applications are considered
4800-452: The possibility of incorrect program execution. These computers require a cache coherency system, which keeps track of cached values and strategically purges them, thus ensuring correct program execution. Bus snooping is one of the most common methods for keeping track of which values are being accessed (and thus should be purged). Designing large, high-performance cache coherence systems is a very difficult problem in computer architecture. As
4875-495: The possibility of program deadlock . An atomic lock locks multiple variables all at once. If it cannot lock all of them, it does not lock any of them. If two threads each need to lock the same two variables using non-atomic locks, it is possible that one thread will lock one of them and the second thread will lock the second variable. In such a case, neither thread can complete, and deadlock results. Many parallel programs require that their subtasks act in synchrony . This requires
4950-564: The processor must first add the 8 lower-order bits from each integer using the standard addition instruction, then add the 8 higher-order bits using an add-with-carry instruction and the carry bit from the lower order addition; thus, an 8-bit processor requires two instructions to complete a single operation, where a 16-bit processor would be able to complete the operation with a single instruction. Historically, 4-bit microprocessors were replaced with 8-bit, then 16-bit, then 32-bit microprocessors. This trend generally came to an end with
5025-428: The processors in a typical distributed system run concurrently in parallel. Scatter-gather In computing , vectored I/O , also known as scatter/gather I/O , is a method of input and output by which a single procedure call sequentially reads data from multiple buffers and writes it to a single data stream (gather), or reads data from a data stream and writes it to multiple buffers (scatter), as defined in
5100-629: The result of the program. This is known as instruction-level parallelism. Advances in instruction-level parallelism dominated computer architecture from the mid-1980s until the mid-1990s. All modern processors have multi-stage instruction pipelines . Each stage in the pipeline corresponds to a different action the processor performs on that instruction in that stage; a processor with an N -stage pipeline can have up to N different instructions at different stages of completion and thus can issue one instruction per clock cycle ( IPC = 1 ). These processors are known as scalar processors. The canonical example of
5175-440: The same time. There are several different forms of parallel computing: bit-level , instruction-level , data , and task parallelism . Parallelism has long been employed in high-performance computing , but has gained broader interest due to the physical constraints preventing frequency scaling . As power consumption (and consequently heat generation) by computers has become a concern in recent years, parallel computing has become
5250-525: The second half of 2009. The NVMe specifications were developed by the NVM Express Workgroup, which consists of more than 90 companies; Amber Huffman of Intel was the working group's chair. Version 1.0 of the specification was released on 1 March 2011, while version 1.1 of the specification was released on 11 October 2012. Major features added in version 1.1 are multi-path I/O (with namespace sharing) and arbitrary-length scatter-gather I/O. It
5325-473: The several execution units are not entire processors (i.e. processing units). Instructions can be grouped together only if there is no data dependency between them. Scoreboarding and the Tomasulo algorithm (which is similar to scoreboarding but makes use of register renaming ) are two of the most common techniques for implementing out-of-order execution and instruction-level parallelism. Task parallelisms
5400-432: The size of a problem. Superword level parallelism is a vectorization technique based on loop unrolling and basic block vectorization. It is distinct from loop vectorization algorithms in that it can exploit parallelism of inline code , such as manipulating coordinates, color channels or in loops unrolled by hand. Main memory in a parallel computer is either shared memory (shared between all processing elements in
5475-576: The use of a barrier . Barriers are typically implemented using a lock or a semaphore . One class of algorithms, known as lock-free and wait-free algorithms , altogether avoids the use of locks and barriers. However, this approach is generally difficult to implement and requires correctly designed data structures. Not all parallelization results in speed-up. Generally, as a task is split up into more and more threads, those threads spend an ever-increasing portion of their time communicating with each other or waiting on each other for access to resources. Once
5550-647: Was announced in July 2013; according to Samsung, this drive supported 3 GB/s read speeds, six times faster than their previous enterprise offerings. The LSI SandForce SF3700 controller family, released in November 2013, also supports NVMe. A Kingston HyperX " prosumer " product using this controller was showcased at the Consumer Electronics Show 2014 and promised similar performance. In June 2014, Intel announced their first NVM Express products,
5625-496: Was submitted by the standards organization International Committee for Information Technology Standards (ICITS), and this combination is often referred to as FC-NVMe or sometimes NVMe/FC. As of May 2021, supported NVMe transport protocols are: The standard for NVMe over Fabrics was published by NVM Express, Inc. in 2016. The following software implements the NVMe-oF protocol: The Advanced Host Controller Interface (AHCI) has
#987012