The IBM XIV Storage System was a line of cabinet-size disk storage servers . The system is a collection of modules , each of which is an independent computer with its own memory , interconnections, disk drives, and other subcomponents, laid out in a grid and connected together in parallel using either InfiniBand (third generation systems) or Ethernet (second generation systems) connections. Each module has an x86 CPU and runs a software platform consisting largely of a modified Linux kernel and other open source software .
59-432: Traditional storage systems distribute a volume across a subset of disk drives in a clustered fashion. The XIV storage system distributes volumes across all modules in 1 MiB chunks ( partitions ) so that all of the modules' resources are used evenly. For robustness, each logical partition is stored in at least two copies on separate modules, so that if a part of a disk drive, an entire disk drive, or an entire module fails,
118-400: A challenge. In a heterogeneous CPU-GPU cluster with a complex application environment, the performance of each job depends on the characteristics of the underlying cluster. Therefore, mapping tasks onto CPU cores and GPU devices provides significant challenges. This is an area of ongoing research; algorithms that combine and extend MapReduce and Hadoop have been proposed and studied. When
177-643: A cluster requires parallel language primitives and suitable tools such as those discussed by the High Performance Debugging Forum (HPDF) which resulted in the HPD specifications. Tools such as TotalView were then developed to debug parallel implementations on computer clusters which use Message Passing Interface (MPI) or Parallel Virtual Machine (PVM) for message passing. The University of California, Berkeley Network of Workstations (NOW) system gathers cluster data and stores them in
236-570: A cluster was the Burroughs B5700 in the mid-1960s. This allowed up to four computers, each with either one or two processors, to be tightly coupled to a common disk storage subsystem in order to distribute the workload. Unlike standard multiprocessor systems, each computer could be restarted without disrupting overall operation. The first commercial loosely coupled clustering product was Datapoint Corporation's "Attached Resource Computer" (ARC) system, developed in 1977, and using ARCnet as
295-488: A database, while a system such as PARMON, developed in India, allows visually observing and managing large clusters. Application checkpointing can be used to restore a given state of the system when a node fails during a long multi-node computation. This is essential in large clusters, given that as the number of nodes increases, so does the likelihood of node failure under heavy computational loads. Checkpointing can restore
354-409: A dedicated network, is densely located, and probably has homogeneous nodes. The other extreme is where a computer job uses one or few nodes, and needs little or no inter-node communication, approaching grid computing . In a Beowulf cluster , the application programs never see the computational nodes (also called slave computers) but only interact with the "Master" which is a specific computer handling
413-664: A group is creating a make-work system to justify extra funding, rather than providing a low-cost system which meets the basic needs, regardless of the use of COTS products. Applying the lessons of processor obsolescence learned during the Lockheed Martin F-22 Raptor , the Lockheed Martin F-35 Lightning II planned for processor upgrades during development, and switched to the more widely supported C++ programming language. They have also moved from ASICs to FPGAs . This moves more of
472-609: A handful of nodes to some of the fastest supercomputers in the world such as IBM's Sequoia . Prior to the advent of clusters, single-unit fault tolerant mainframes with modular redundancy were employed; but the lower upfront cost of clusters, and increased speed of network fabric has favoured the adoption of clusters. In contrast to high-reliability mainframes, clusters are cheaper to scale out, but also have increased complexity in error handling, as in clusters error modes are not opaque to running programs. The desire to get more computing power and better reliability by orchestrating
531-504: A high-availability approach, etc. " Load-balancing " clusters are configurations in which cluster-nodes share computational workload to provide better overall performance. For example, a web server cluster may assign different queries to different nodes, so the overall response time will be optimized. However, approaches to load-balancing may significantly differ among applications, e.g. a high-performance cluster used for scientific computations would balance load with different algorithms from
590-685: A large and shared file server that stores global persistent data, accessed by the slaves as needed. A special purpose 144-node DEGIMA cluster is tuned to running astrophysical N-body simulations using the Multiple-Walk parallel tree code, rather than general purpose scientific computations. Due to the increasing computing power of each generation of game consoles , a novel use has emerged where they are repurposed into High-performance computing (HPC) clusters. Some examples of game console clusters are Sony PlayStation clusters and Microsoft Xbox clusters. Another example of consumer game product
649-402: A large number of computers clustered together, this lends itself to the use of distributed file systems and RAID , both of which can increase the reliability and speed of a cluster. One of the issues in designing a cluster is how tightly coupled the individual nodes may be. For instance, a single computer job may require frequent communication among nodes: this implies that the cluster shares
SECTION 10
#1732801395404708-840: A major threat. Gartner predicts that "enterprise IT supply chains will be targeted and compromised, forcing changes in the structure of the IT marketplace and how IT will be managed moving forward". Also, the SANS Institute published a survey of 700 IT and security professionals in December 2012 that found that only 14% of companies perform security reviews on every commercial application brought in house, and over half of other companies do not perform security assessments. Instead companies either rely on vendor reputation (25%) and legal liability agreements (14%) or they have no policies for dealing with COTS at all and therefore have limited visibility into
767-455: A node in a cluster fails, strategies such as " fencing " may be employed to keep the rest of the system operational. Fencing is the process of isolating a node or protecting shared resources when a node appears to be malfunctioning. There are two classes of fencing methods; one disables a node itself, and the other disallows access to resources such as shared disks. The STONITH method stands for "Shoot The Other Node In The Head", meaning that
826-428: A number of low-cost commercial off-the-shelf computers has given rise to a variety of architectures and configurations. The computer clustering approach usually (but not always) connects a number of readily available computing nodes (e.g. personal computers used as servers) via a fast local area network . The activities of the computing nodes are orchestrated by "clustering middleware", a software layer that sits atop
885-551: A reduction in initial cost and development time over an increase in software component-integration work, dependency on the vendor , security issues and incompatibilities from future changes. COTS software and services are built and delivered usually from a third party vendor. COTS can be purchased, leased or even licensed to the general public. COTS can be obtained and operated at a lower cost over in-house development, and provide increased reliability and quality over custom-built software as these are developed by specialists within
944-455: A replacement system. Such obsolescence problems have led to government-industry partnerships, where various businesses agree to stabilize some product versions for government use and plan some future features, in those product lines, as a joint effort. Hence, some partnerships have led to complaints of favoritism, to avoiding competitive procurement practices, and to claims of the use of sole-source agreements where not actually needed. There
1003-427: A run-time environment for message-passing, task and resource management, and fault notification. PVM can be used by user programs written in C, C++, or Fortran, etc. MPI emerged in the early 1990s out of discussions among 40 organizations. The initial effort was supported by ARPA and National Science Foundation . Rather than starting anew, the design of MPI drew on various features available in commercial systems of
1062-440: A simple two-node system which just connects two personal computers, or may be a very fast supercomputer . A basic approach to building a cluster is that of a Beowulf cluster which may be built with a few personal computers to produce a cost-effective alternative to traditional high-performance computing . An early project that showed the viability of the concept was the 133-node Stone Soupercomputer . The developers used Linux ,
1121-451: A single computer, while typically being much more cost-effective than single computers of comparable speed or availability. Computer clusters emerged as a result of the convergence of a number of computing trends including the availability of low-cost microprocessors, high-speed networks, and software for high-performance distributed computing . They have a wide range of applicability and deployment, ranging from small business clusters with
1180-467: A small number of users need to take advantage of the parallel processing capabilities of the cluster and partition "the same computation" among several nodes. Automatic parallelization of programs remains a technical challenge, but parallel programming models can be used to effectuate a higher degree of parallelism via the simultaneous execution of separate portions of a program on different processors. Developing and debugging parallel programs on
1239-565: A web-server cluster which may just use a simple round-robin method by assigning each new request to a different node. Computer clusters are used for computation-intensive purposes, rather than handling IO-oriented operations such as web service or databases. For instance, a computer cluster might support computational simulations of vehicle crashes or weather. Very tightly coupled computer clusters are designed for work that may approach " supercomputing ". " High-availability clusters " (also known as failover clusters, or HA clusters) improve
SECTION 20
#17328013954041298-571: Is cloud computing . The components of a cluster are usually connected to each other through fast local area networks , with each node (computer used as a server) running its own instance of an operating system . In most circumstances, all of the nodes use the same hardware and the same operating system, although in some setups (e.g. using Open Source Cluster Application Resources (OSCAR)), different operating systems can be used on each computer, or different hardware. Clusters are usually deployed to improve performance and availability over that of
1357-423: Is a COTS software provider. Goods and construction materials may qualify as COTS but bulk cargo does not. Services associated with the commercial items may also qualify as COTS, including installation services, training services, and cloud services. COTS purchases are alternatives to custom software or one-off developments – government-funded developments or otherwise. Although COTS products can be used out of
1416-428: Is a software package that can be installed on operating systems including Microsoft Windows , Linux and Mac OS. An XIV Mobile Dashboard available for Android or iOS . The IBM XIV Storage System was developed in 2002 by an Israeli start-up company funded and headed by engineer and businessman Moshe Yanai . They delivered their first system to a customer in 2005. Their product was called Nextra. In December 2007,
1475-420: Is also the danger of pre-purchasing a multi-decade supply of replacement parts (and materials) which would become obsolete within 10 years. All these considerations lead to compare a simple solution (such as "paper & pencil") to avoid overly complex solutions creating a " Rube Goldberg " system of creeping featurism , where a simple solution would have sufficed instead. Such comparisons also consider whether
1534-513: Is integrated or networked with other software products to create a new composite application or a system of systems. The composite application can inherit risks from its COTS components. The US Department of Homeland Security has sponsored efforts to manage supply chain cyber security issues related to the use of COTS. However, software industry observers such as Gartner and the SANS Institute indicate that supply chain disruption poses
1593-657: Is the Nvidia Tesla Personal Supercomputer workstation, which uses multiple graphics accelerator processor chips. Besides game consoles, high-end graphics cards too can be used instead. The use of graphics cards (or rather their GPU's) to do calculations for grid computing is vastly more economical than using CPU's, despite being less precise. However, when using double-precision values, they become as precise to work with as CPU's and are still much less costly (purchase cost). Computer clusters have historically run on separate physical computers with
1652-804: The Tandem NonStop (a 1976 high-availability commercial product) and the IBM S/390 Parallel Sysplex (circa 1994, primarily for business use). Within the same time frame, while computer clusters used parallelism outside the computer on a commodity network, supercomputers began to use them within the same computer. Following the success of the CDC 6600 in 1964, the Cray 1 was delivered in 1976, and introduced internal parallelism via vector processing . While early supercomputers excluded clusters and relied on shared memory , in time some of
1711-775: The Enabling Grids for E-sciencE (EGEE) project. slurm is also used to schedule and manage some of the largest supercomputer clusters (see top500 list). Although most computer clusters are permanent fixtures, attempts at flash mob computing have been made to build short-lived clusters for specific computations. However, larger-scale volunteer computing systems such as BOINC -based systems have had more followers. Basic concepts Distributed computing Specific systems Commercial off-the-shelf Commercial-off-the-shelf or commercially available off-the-shelf ( COTS ) products are packaged or canned (ready-made) hardware or software, which are adapted aftermarket to
1770-510: The Linux operating system. Clusters are primarily designed with performance in mind, but installations are based on many other factors. Fault tolerance ( the ability of a system to continue operating despite a malfunctioning node ) enables scalability , and in high-performance situations, allows for a low frequency of maintenance routines, resource consolidation (e.g., RAID ), and centralized management. Advantages include enabling data recovery in
1829-598: The Oracle Cluster File System . Two widely used approaches for communication between cluster nodes are MPI ( Message Passing Interface ) and PVM ( Parallel Virtual Machine ). PVM was developed at the Oak Ridge National Laboratory around 1989 before MPI was available. PVM must be directly installed on every cluster node and provides a set of software libraries that paint the node as a "parallel virtual machine". PVM provides
IBM XIV Storage System - Misplaced Pages Continue
1888-587: The Parallel Virtual Machine toolkit and the Message Passing Interface library to achieve high performance at a relatively low cost. Although a cluster may consist of just a few personal computers connected by a simple network, the cluster architecture may also be used to achieve very high levels of performance. The TOP500 organization's semiannual list of the 500 fastest supercomputers often includes many clusters, e.g.
1947-481: The kernel that provide for automatic process migration among homogeneous nodes. OpenSSI , openMosix and Kerrighed are single-system image implementations. Microsoft Windows computer cluster Server 2003 based on the Windows Server platform provides pieces for high-performance computing like the job scheduler, MSMPI library and management tools. gLite is a set of middleware technologies created by
2006-561: The 1980s, so were supercomputers . One of the elements that distinguished the three classes at that time was that the early supercomputers relied on shared memory . Clusters do not typically use physically shared memory, while many supercomputer architectures have also abandoned it. However, the use of a clustered file system is essential in modern computer clusters. Examples include the IBM General Parallel File System , Microsoft's Cluster Shared Volumes or
2065-512: The COTS product. The use of COTS has been mandated across many government and business programs, as such products may offer significant savings in procurement, development, and maintenance. Motivations for using COTS components include hopes for reduction system whole of life costs. In the 1990s, many regarded COTS as extremely effective in reducing the time and cost of software development . COTS software came with many not-so-obvious tradeoffs –
2124-451: The GNBD server. Load balancing clusters such as web servers use cluster architectures to support a large number of users and typically each user request is routed to a specific node, achieving task parallelism without multi-node cooperation, given that the main goal of the system is providing rapid user access to shared data. However, "computer clusters" which perform complex computations for
2183-745: The IBM Corporation acquired XIV, renaming the product the IBM XIV Storage System. The first IBM version of the product was launched publicly on September 8, 2008. Unofficially within IBM this product is called Generation 2 of the XIV. The differences between Gen1 and Gen2 were not architectural, they were mainly physical. New disks were introduced, new controllers, new interconnects, improved management, additional software functions. In September 2011, IBM announced larger disk drives, changing
2242-438: The availability of the cluster approach. They operate by having redundant nodes , which are then used to provide service when system components fail. HA cluster implementations attempt to use redundancy of cluster components to eliminate single points of failure . There are commercial implementations of High-Availability clusters for many operating systems. The Linux-HA project is one commonly used free software HA package for
2301-458: The box, in practice the COTS product must be configured to achieve the needs of the business and integrated to existing organizational systems. Extending the functionality of COTS products via custom development is also an option, however this decision should be carefully considered due to the long term support and maintenance implications. Such customized functionality is not supported by the COTS vendor, so brings its own sets of issues when upgrading
2360-488: The challenges in the use of a computer cluster is the cost of administrating it which can at times be as high as the cost of administrating N independent machines, if the cluster has N nodes. In some cases this provides an advantage to shared memory architectures with lower administration costs. This has also made virtual machines popular, due to the ease of administration. When a large multi-user cluster needs to access very large amounts of data, task scheduling becomes
2419-522: The cluster interface. Clustering per se did not really take off until Digital Equipment Corporation released their VAXcluster product in 1984 for the VMS operating system. The ARC and VAXcluster products not only supported parallel computing , but also shared file systems and peripheral devices. The idea was to provide the advantages of parallel processing, while maintaining data reliability and uniqueness. Two other noteworthy early commercial clusters were
IBM XIV Storage System - Misplaced Pages Continue
2478-409: The cluster. This property of computer clusters can allow for larger computational loads to be executed by a larger number of lower performing computers. When adding a new node to a cluster, reliability increases because the entire cluster does not need to be taken down. A single node can be taken down for maintenance, while the rest of the cluster takes on the load of that individual node. If you have
2537-691: The data is still available. One can increase the system storage capacity by adding additional modules. When one adds a module, the system automatically redistributes previously stored data to make optimal use of its I/O capacity. Depending on the model and disk type chosen when the machine is ordered, one system can be configured for storage capacity from 27 TB to 324 TB. The XIV software features include remote mirroring, thin provisioning , quality of service controls, LDAP authentication support, VMware support, differential, writable snapshots, online volume migration between two XIV systems and encryption protecting data at rest . The IBM XIV management GUI
2596-480: The device itself if the steps are not taken to ensure fair and safe standards are complied with. The standard IEC 62304:2006 "Medical device software – Software life cycle processes" outlines specific practices to ensure that SOUP components support the safety requirements for the device being developed. In the case where the software components are COTS, DHS best practices for COTS software risk review can be applied. Simply being COTS software does not necessarily imply
2655-415: The event of a disaster and providing parallel data processing and high processing capacity. In terms of scalability, clusters provide this in their ability to add nodes horizontally. This means that more computers may be added to the cluster, to improve its performance, redundancy and fault tolerance. This can be an inexpensive solution for a higher performing cluster compared to scaling up a single node in
2714-432: The fastest supercomputers (e.g. the K computer ) relied on cluster architectures. Computer clusters may be configured for different purposes ranging from general purpose business needs such as web-service support, to computation-intensive scientific calculations. In either case, the cluster may use a high-availability approach. Note that the attributes described below are not exclusive and a "computer cluster" may also use
2773-500: The industry and are validated by various independent organizations, often over an extended period of time. According to the United States Department of Homeland Security , software security is a serious risk of using COTS software. If the COTS software contains severe security vulnerabilities it can introduce significant risk into an organization's software supply chain . The risks are compounded when COTS software
2832-464: The inter-connectivity layer to use InfiniBand rather than Ethernet. In 2012-2013 IBM added the support of SSD devices and 10GbE host connectivity. Computer cluster A computer cluster is a set of computers that work together so that they can be viewed as a single system. Unlike grid computers , computer clusters have each node set to perform the same task, controlled and scheduled by software. The newest manifestation of cluster computing
2891-617: The lack of a fault history or transparent software development process. For well documented COTS software a distinction as clear SOUP is made, meaning that it may be used in medical devices. A striking example of product obsolescence are PlayStation 3 clusters , which used Linux to operate. Sony disabled the use of Linux on the PS3 in April 2010, leaving no means to procure functioning Linux replacement units . In general, COTS product obsolescence can require customized support or development of
2950-562: The needs of the purchasing organization, rather than the commissioning of custom-made, or bespoke , solutions. A related term, Mil-COTS , refers to COTS products for use by the U.S. military. In the context of the U.S. government , the Federal Acquisition Regulation (FAR) has defined "COTS" as a formal term for commercial items, including services, available in the commercial marketplace that can be bought and used under government contract. For example, Microsoft
3009-447: The nodes and allows the users to treat the cluster as by and large one cohesive computing unit, e.g. via a single system image concept. Computer clustering relies on a centralized management approach which makes the nodes available as orchestrated shared servers. It is distinct from other approaches such as peer-to-peer or grid computing which also use many nodes, but with a far more distributed nature . A computer cluster may be
SECTION 50
#17328013954043068-445: The risks introduced into their software supply chain by COTS. In the medical device industry, COTS software can sometimes be identified as SOUP ( software of unknown pedigree or software of unknown provenance), i.e., software that has not been developed with a known software development process or methodology, which precludes its use in medical devices. In this industry, faults in software components could become system failures in
3127-442: The same operating system . With the advent of virtualization , the cluster nodes may run on separate physical computers with different operating systems which are painted above with a virtual layer to look similar. The cluster may also be virtualized on various configurations as maintenance takes place; an example implementation is Xen as the virtualization manager with Linux-HA . As the computer clusters were appearing during
3186-465: The scheduling and management of the slaves. In a typical implementation the Master has two network interfaces, one that communicates with the private Beowulf network for the slaves, the other for the general purpose network of the organization. The slave computers typically have their own version of the same operating system, and local memory and disk space. However, the private slave network may also have
3245-475: The suspected node is disabled or powered off. For instance, power fencing uses a power controller to turn off an inoperable node. The resources fencing approach disallows access to resources without powering off the node. This may include persistent reservation fencing via the SCSI3 , fibre channel fencing to disable the fibre channel port, or global network block device (GNBD) fencing to disable access to
3304-446: The system to a stable state so that processing can resume without needing to recompute results. The Linux world supports various cluster software; for application clustering, there is distcc , and MPICH . Linux Virtual Server , Linux-HA – director-based clusters that allow incoming requests for services to be distributed across multiple cluster nodes. MOSIX , LinuxPMI , Kerrighed , OpenSSI are full-blown clusters integrated into
3363-456: The time. The MPI specifications then gave rise to specific implementations. MPI implementations typically use TCP/IP and socket connections. MPI is now a widely available communications model that enables parallel programs to be written in languages such as C , Fortran , Python , etc. Thus, unlike PVM which provides a concrete implementation, MPI is a specification which has been implemented in systems such as MPICH and Open MPI . One of
3422-433: The world's fastest machine in 2011 was the K computer which has a distributed memory , cluster architecture. Greg Pfister has stated that clusters were not invented by any specific vendor but by customers who could not fit all their work on one computer, or needed a backup. Pfister estimates the date as some time in the 1960s. The formal engineering basis of cluster computing as a means of doing parallel work of any sort
3481-445: Was arguably invented by Gene Amdahl of IBM , who in 1967 published what has come to be regarded as the seminal paper on parallel processing: Amdahl's Law . The history of early computer clusters is more or less directly tied to the history of early networks, as one of the primary motivations for the development of a network was to link computing resources, creating a de facto computer cluster. The first production system designed as
#403596