Misplaced Pages

Availability

Article snapshot taken from Wikipedia with creative commons attribution-sharealike license. Give it a read and then ask your questions in the chat. We can research this topic together.

In reliability engineering , the term availability has the following meanings:

#269730

76-553: Normally high availability systems might be specified as 99.98%, 99.999% or 99.9996%. The simplest representation of availability ( A ) is a ratio of the expected value of the uptime of a system to the aggregate of the expected values of up and down time (that results in the "total amount of time" C of the observation window) Another equation for availability ( A ) is a ratio of the Mean Time To Failure (MTTF) and Mean Time Between Failure (MTBF), or If we define

152-569: A system administrator , but its services do not appear "up" to the end user or customer. The subject of the terms is thus important here: whether the focus of a discussion is the server hardware, server OS, functional service, software service/process, or similar, it is only if there is a single, consistent subject of the discussion that the words uptime and availability can be used synonymously. A simple mnemonic rule states that 5 nines allows approximately 5 minutes of downtime per year. Variants can be derived by multiplying or dividing by 10: 4 nines

228-466: A constant failure rate λ {\displaystyle \lambda } implies that T {\displaystyle T} has an exponential distribution with parameter λ {\displaystyle \lambda } . Since the MTBF is the expected value of T {\displaystyle T} , it is given by the reciprocal of the failure rate of the system, Once

304-489: A dangerous condition. It can be calculated as follows: where B 10 is the number of operations that a device will operate prior to 10% of a sample of those devices would fail and n op is number of operations. B 10d is the same calculation, but where 10% of the sample would fail to danger. n op is the number of operations/cycle in one year. In fact the MTBF counting only failures with at least some systems still operating that have not yet failed underestimates

380-404: A fault indicator activates. Failure is only significant if this occurs during a mission critical period. Modeling and simulation is used to evaluate the theoretical reliability for large systems. The outcome of this kind of model is used to evaluate different design options. A model of the entire system is created, and the model is stressed by removing components. Redundancy simulation involves

456-440: A given point in time when used in an actual or realistic operating and support environment. It includes logistics time, ready time, and waiting or administrative downtime, and both preventive and corrective maintenance downtime. This value is equal to the mean time between failure ( MTBF ) divided by the mean time between failure plus the mean downtime (MDT). This measure extends the definition of availability to elements controlled by

532-468: A half nines", but this is incorrect: a 5 is only a factor of 2, while a 9 is a factor of 10, so a 5 is 0.3 nines (per below formula: log 10 ⁡ 2 ≈ 0.3 {\displaystyle \log _{10}2\approx 0.3} ): 99.95% availability is 3.3 nines, not 3.5 nines. More simply, going from 99.9% availability to 99.95% availability is a factor of 2 (0.1% to 0.05% unavailability), but going from 99.95% to 99.99% availability

608-444: A hardware or software failure or environmental anomaly. Examples of unscheduled downtime events include power outages, failed CPU or RAM components (or possibly other failed hardware components), an over-temperature related shutdown, logically or physically severed network connections, security breaches, or various application , middleware , and operating system failures. If users can be warned away from scheduled downtimes, then

684-495: A mechanical or electronic system during normal system operation. MTBF can be calculated as the arithmetic mean (average) time between failures of a system. The term is used for repairable systems while mean time to failure ( MTTF ) denotes the expected time to failure for a non-repairable system. The definition of MTBF depends on the definition of what is considered a failure. For complex, repairable systems, failures are considered to be those out of design conditions which place

760-610: A more proactive maintenance approach. This synergy allows for the identification of patterns and potential failures before they occur, enabling preventive maintenance and reducing unplanned downtime. As a result, MTBF becomes a key performance indicator (KPI) within TPM, guiding decisions on maintenance schedules, spare parts inventory, and ultimately, optimizing the lifespan and efficiency of machinery. This strategic use of MTBF within TPM frameworks enhances overall production efficiency, reduces costs associated with breakdowns, and contributes to

836-420: A product's MTBF according to various methods and standards (MIL-HDBK-217F, Telcordia SR332, Siemens SN 29500, FIDES, UTE 80-810 (RDF2000), etc.). The Mil-HDBK-217 reliability calculator manual in combination with RelCalc software (or other comparable tool) enables MTBF reliability rates to be predicted based on design. A concept which is closely related to MTBF, and is important in the computations involving MTBF,

SECTION 10

#1732781185270

912-432: A quantitative identity between working and failed units. Since MTBF can be expressed as “average life (expectancy)”, many engineers assume that 50% of items will have failed by time t = MTBF. This inaccuracy can lead to bad design decisions. Furthermore, probabilistic failure prediction based on MTBF implies the total absence of systematic failures (i.e., a constant failure rate with only intrinsic, random failures), which

988-773: A series component is composed of components A, B and C. Then following formula applies: Availability of series component = (availability of component A) x (availability of component B) x (availability of component C) Therefore, combined availability of multiple components in a series is always lower than the availability of individual components. On the other hand, following formula applies to parallel components: Availability of parallel components = 1 - (1 - availability of component A) X (1 - availability of component B) X (1 - availability of component C) In corollary, if you have N parallel components each having X availability, then: Availability of parallel components = 1 - (1 - X)^ N Using parallel components can exponentially increase

1064-419: A series system with replacement and repair, Iyer [1992] for imperfect repair models, Murdock [1995] for age replacement preventive maintenance models, Nachlas [1998, 1989] for preventive maintenance models, and Wang and Pham [1996] for imperfect maintenance models. A very comprehensive recent book is by Trivedi and Bobbio [2017]. Availability factor is used extensively in power plant engineering . For example,

1140-573: A system or a functional failure condition within a system including many factors like: Furthermore, these methods are capable to identify the most critical items and failure modes or events that impact availability. Availability, inherent (A i ) The probability that an item will operate satisfactorily at a given point in time when used under stated conditions in an ideal support environment. It excludes logistics time, waiting or administrative downtime, and preventive maintenance downtime. It includes corrective maintenance downtime. Inherent availability

1216-500: A system out of two serial components can be calculated as: and for a system out of two parallel components MDT can be calculated as: Through successive application of these four formulae, the MTBF and MDT of any network of repairable components can be computed, provided that the MTBF and MDT is known for each component. In a special but all-important case of several serial components, MTBF calculation can be easily generalised into which can be shown by induction, and likewise since

1292-428: A true availability measure is holistic. Availability must be measured to be determined, ideally with comprehensive monitoring tools ("instrumentation") that are themselves highly available. If there is a lack of instrumentation, systems supporting high volume transaction processing throughout the day and night, such as credit card processing systems or telephone switches, are often inherently better monitored, at least by

1368-428: A voting scheme. This is used with complex computing systems that are linked. Internet routing is derived from early work by Birman and Joseph in this area. Active redundancy may introduce more complex failure modes into a system, such as continuous system reconfiguration due to faulty voting logic. Zero downtime system design means that modeling and simulation indicates mean time between failures significantly exceeds

1444-866: Is mean time to recovery (MTTR). Recovery time could be infinite with certain system designs and failures, i.e. full recovery is impossible. One such example is a fire or flood that destroys a data center and its systems when there is no secondary disaster recovery data center. Another related concept is data availability , that is the degree to which databases and other information storage systems faithfully record and report system transactions. Information management often focuses separately on data availability, or Recovery Point Objective , in order to determine acceptable (or actual) data loss with various failure events. Some users can tolerate application service interruptions but cannot tolerate data loss. A service level agreement ("SLA") formalizes an organization's availability objectives and requirements. High availability

1520-472: Is 50 minutes and 3 nines is 500 minutes. In the opposite direction, 6 nines is 0.5 minutes (30 sec) and 7 nines is 3 seconds. Another memory trick to calculate the allowed downtime duration for an " n {\displaystyle n} -nines" availability percentage is to use the formula 8.64 × 10 4 − n {\displaystyle 8.64\times 10^{4-n}} seconds per day. For example, 90% ("one nine") yields

1596-402: Is a factor of 5 (0.05% to 0.01% unavailability), over twice as much. A formulation of the class of 9s c {\displaystyle c} based on a system's unavailability x {\displaystyle x} would be (cf. Floor and ceiling functions ). A similar measurement is sometimes used to describe the purity of substances. In general, the number of nines

SECTION 20

#1732781185270

1672-477: Is a result of maintenance that is disruptive to system operation and usually cannot be avoided with a currently installed system design. Scheduled downtime events might include patches to system software that require a reboot or system configuration changes that only take effect upon a reboot. In general, scheduled downtime is usually the result of some logical, management-initiated event. Unscheduled downtime events typically arise from some physical event, such as

1748-557: Is contingent on the various mechanisms for downtime such as the inherent availability, achieved availability, and operational availability. (Blanchard [1998], Lie, Hwang, and Tillman [1977]). Mi [1998] gives some comparison results of availability considering inherent availability. Availability considered in maintenance modeling can be found in Barlow and Proschan [1975] for replacement models, Fawzi and Hawkes [1991] for an R-out-of-N system with spares and repairs, Fawzi and Hawkes [1990] for

1824-608: Is generally derived from analysis of an engineering design: It is based on quantities under control of the designer. Availability, achieved (Aa) The probability that an item will operate satisfactorily at a given point in time when used under stated conditions in an ideal support environment (i.e., that personnel, tools, spares, etc. are instantaneously available). It excludes logistics time and waiting or administrative downtime. It includes active preventive and corrective maintenance downtime. Availability, operational (Ao) The probability that an item will operate satisfactorily at

1900-406: Is not an actual goal, but rather a sarcastic reference to something totally failing to meet any reasonable target. Availability measurement is subject to some degree of interpretation. A system that has been up for 365 days in a non-leap year might have been eclipsed by a network failure that lasted for 9 hours during a peak usage period; the user community will see the system as unavailable, whereas

1976-409: Is not considered to be a failure unless the resulting performance decline exceeds the specification limits for the entire system. Active redundancy is used in complex systems to achieve high availability with no performance decline. Multiple items of the same kind are incorporated into a design that includes a method to detect failure and automatically reconfigure the system to bypass failed items using

2052-400: Is not easy to verify. Assuming no systematic errors, the probability the system survives during a duration, T, is calculated as exp^(-T/MTBF). Hence the probability a system fails during a duration T, is given by 1 - exp^(-T/MTBF). MTBF value prediction is an important element in the development of products. Reliability engineers and design engineers often use reliability software to calculate

2128-416: Is not often used by a network engineer when modeling and measuring availability because it is hard to apply in formula. More often, the unavailability expressed as a probability (like 0.00001), or a downtime per year is quoted. Availability specified as a number of nines is often seen in marketing documents. The use of the "nines" has been called into question, since it does not appropriately reflect that

2204-607: Is one of the primary requirements of the control systems in unmanned vehicles and autonomous maritime vessels . If the controlling system becomes unavailable, the Ground Combat Vehicle (GCV) or ASW Continuous Trail Unmanned Vessel (ACTUV) would be lost. On one hand, adding more components to an overall system design can undermine efforts to achieve high availability because complex systems inherently have more potential failure points and are more difficult to implement correctly. While some analysts would put forth

2280-431: Is particularly significant in the context of total productive maintenance (TPM), a comprehensive maintenance strategy aimed at maximizing equipment effectiveness . MTBF provides a quantitative measure of the time elapsed between failures of a system during normal operation, offering insights into the reliability and performance of manufacturing equipment. By integrating MTBF with TPM principles, manufacturers can achieve

2356-419: Is represented by Limiting average availability is also defined on an interval [ 0 , c ] {\displaystyle [0,c]} as, Availability is the probability that an item will be in an operable and committable state at the start of a mission when the mission is called for at a random time, and is generally defined as uptime divided by total time (uptime plus downtime). Let's say

Availability - Misplaced Pages Continue

2432-478: Is the mean down time (MDT). MDT can be defined as mean time which the system is down after the failure. Usually, MDT is considered different from MTTR (Mean Time To Repair); in particular, MDT usually includes organizational and logistical factors (such as business days or waiting for components to arrive) while MTTR is usually understood as more narrow and more technical. MTBF serves as a crucial metric for managing machinery and equipment reliability. Its application

2508-447: Is the network in which the components are arranged in parallel, and P F ( c , t ) {\displaystyle PF(c,t)} is the probability of failure of component c {\displaystyle c} during "vulnerability window" t {\displaystyle t} . Intuitively, both these formulae can be explained from the point of view of failure probabilities. First of all, let's note that

2584-2186: Is the network in which the components are arranged in series. For the network containing parallel repairable components, to find out the MTBF of the whole system, in addition to component MTBFs, it is also necessary to know their respective MDTs. Then, assuming that MDTs are negligible compared to MTBFs (which usually stands in practice), the MTBF for the parallel system consisting from two parallel repairable components can be written as follows: mtbf ( c 1 ∥ c 2 ) = 1 1 mtbf ( c 1 ) × PF ( c 2 , mdt ( c 1 ) ) + 1 mtbf ( c 2 ) × PF ( c 1 , mdt ( c 2 ) ) = 1 1 mtbf ( c 1 ) × mdt ( c 1 ) mtbf ( c 2 ) + 1 mtbf ( c 2 ) × mdt ( c 2 ) mtbf ( c 1 ) = mtbf ( c 1 ) × mtbf ( c 2 ) mdt ( c 1 ) + mdt ( c 2 ) , {\displaystyle {\begin{aligned}{\text{mtbf}}(c_{1}\parallel c_{2})&={\frac {1}{{\frac {1}{{\text{mtbf}}(c_{1})}}\times {\text{PF}}(c_{2},{\text{mdt}}(c_{1}))+{\frac {1}{{\text{mtbf}}(c_{2})}}\times {\text{PF}}(c_{1},{\text{mdt}}(c_{2}))}}\\[1em]&={\frac {1}{{\frac {1}{{\text{mtbf}}(c_{1})}}\times {\frac {{\text{mdt}}(c_{1})}{{\text{mtbf}}(c_{2})}}+{\frac {1}{{\text{mtbf}}(c_{2})}}\times {\frac {{\text{mdt}}(c_{2})}{{\text{mtbf}}(c_{1})}}}}\\[1em]&={\frac {{\text{mtbf}}(c_{1})\times {\text{mtbf}}(c_{2})}{{\text{mdt}}(c_{1})+{\text{mdt}}(c_{2})}}\;,\end{aligned}}} where c 1 ∥ c 2 {\displaystyle c_{1}\parallel c_{2}}

2660-474: Is used to achieve high availability by including enough excess capacity in the design to accommodate a performance decline. The simplest example is a boat with two separate engines driving two separate propellers. The boat continues toward its destination despite failure of a single engine or propeller. A more complex example is multiple redundant power generation facilities within a large system involving electric power transmission . Malfunction of single components

2736-412: Is well established in the literature of stochastic modeling and optimal maintenance . Barlow and Proschan [1975] define availability of a repairable system as "the probability that the system is operating at a specified time t." Blanchard [1998] gives a qualitative definition of availability as "a measure of the degree of a system which is in the operable and committable state at the start of mission when

2812-758: The North American Electric Reliability Corporation implemented the Generating Availability Data System in 1982. High availability High availability ( HA ) is a characteristic of a system that aims to ensure an agreed level of operational performance, usually uptime , for a higher than normal period. There is now more dependence on these systems as a result of modernization. For instance, in order to carry out their regular daily tasks, hospitals and data centers need their systems to be highly available. Availability refers to

2888-456: The "up time". The difference ("down time" minus "up time") is the amount of time it was operating between these two events. By referring to the figure above, the MTBF of a component is the sum of the lengths of the operational periods divided by the number of observed failures: In a similar manner, mean down time (MDT) can be defined as The MTBF is the expected value of the random variable T {\displaystyle T} indicating

2964-417: The MTBF assumes that the system is working within its "useful life period", which is characterized by a relatively constant failure rate (the middle part of the " bathtub curve ") when only random failures are occurring. In other words, it is assumed that the system has survived initial setup stresses and has not yet approached its expected end of life, both of which often increase the failure rate. Assuming

3040-415: The MTBF by failing to include in the computations the partial lifetimes of the systems that have not yet failed. With such lifetimes, all we know is that the time to failure exceeds the time they've been running. This is called censoring . In fact with a parametric model of the lifetime, the likelihood for the experience on any given day is as follows : where For a constant exponential distribution ,

3116-403: The MTBF of a system is known, and assuming a constant failure rate, the probability that any one particular system will be operational for a given duration can be inferred from the reliability function of the exponential distribution , R T ( t ) = e − λ t {\displaystyle R_{T}(t)=e^{-\lambda t}} . In particular,

Availability - Misplaced Pages Continue

3192-617: The N-x criteria. N represents the total number of components in the system. x is the number of components used to stress the system. N-1 means the model is stressed by evaluating performance with all possible combinations where one component is faulted. N-2 means the model is stressed by evaluating performance with all possible combinations where two component are faulted simultaneously. A survey among academic availability experts in 2010 ranked reasons for unavailability of enterprise IT systems. All reasons refer to not following best practice in each of

3268-425: The ability of the user community to obtain a service or good, access the system, whether to submit new work, update or alter existing work, or collect the results of previous work. If a user cannot access the system, it is – from the user's point of view – unavailable . Generally, the term downtime is used to refer to periods when a system is unavailable. High availability is a property of network resilience ,

3344-510: The ability to "provide and maintain an acceptable level of service in the face of faults and challenges to normal operation." Threats and challenges for services can range from simple misconfiguration over large scale natural disasters to targeted attacks. As such, network resilience touches a very wide range of topics. In order to increase the resilience of a given communication network, the probable challenges and risks have to be identified and appropriate resilience metrics have to be defined for

3420-402: The allowed downtime is 8.64 × 10 − 1 {\displaystyle 8.64\times 10^{-1}} seconds per day. Percentages of a particular order of magnitude are sometimes referred to by the number of nines or "class of nines" in the digits. For example, electricity that is delivered without interruptions ( blackouts , brownouts or surges ) 99.999% of

3496-488: The availability of overall system. For example if each of your hosts has only 50% availability, by using 10 of hosts in parallel, you can achieve 99.9023% availability. Note that redundancy doesn’t always lead to higher availability. In fact, redundancy increases complexity which in turn reduces availability. According to Marc Brooker, to take advantage of redundancy, ensure that: Reliability Block Diagrams or Fault Tree Analysis are developed to calculate availability of

3572-419: The average of the three failure times, which is 116.667 hours. If the systems were non-repairable, then their MTTF would be 116.667 hours. In general, MTBF is the "up-time" between two failure states of a repairable system during operation as outlined here: [REDACTED] For each observation, the "down time" is the instantaneous time it went down, which is after (i.e. greater than) the moment it went up,

3648-411: The components. With parallel components the situation is a bit more complicated: the whole system will fail if and only if after one of the components fails, the other component fails while the first component is being repaired; this is where MDT comes into play: the faster the first component is repaired, the less is the "vulnerability window" for the other component to fail. Using similar logic, MDT for

3724-423: The continuous improvement of manufacturing processes. Two components c 1 , c 2 {\displaystyle c_{1},c_{2}} (for instance hard drives, servers, etc.) may be arranged in a network, in series or in parallel . The terminology is here used by close analogy to electrical circuits, but has a slightly different meaning. We say that the two components are in series if

3800-882: The distinction is useful. But if the requirement is for true high availability, then downtime is downtime whether or not it is scheduled. Many computing sites exclude scheduled downtime from availability calculations, assuming that it has little or no impact upon the computing user community. By doing this, they can claim to have phenomenally high availability, which might give the illusion of continuous availability . Systems that exhibit truly continuous availability are comparatively rare and higher priced, and most have carefully implemented specialty designs that eliminate any single point of failure and allow online hardware, network, operating system, middleware , and application upgrades, patches, and replacements. For certain systems, scheduled downtime does not matter, for example, system downtime at an office building after everybody has gone home for

3876-400: The exponent 4 − 1 = 3 {\displaystyle 4-1=3} , and therefore the allowed downtime is 8.64 × 10 3 {\displaystyle 8.64\times 10^{3}} seconds per day. Also, 99.999% ("five nines") gives the exponent 4 − 5 = − 1 {\displaystyle 4-5=-1} , and therefore

SECTION 50

#1732781185270

3952-431: The failure of either causes the failure of the network, and that they are in parallel if only the failure of both causes the network to fail. The MTBF of the resulting two-component network with repairable components can be computed according to the following formulae, assuming that the MTBF of both individual components is known: where c 1 ; c 2 {\displaystyle c_{1};c_{2}}

4028-453: The failure of the FM radio does not prevent the primary operation of the vehicle. It is recommended to use Mean time to failure (MTTF) instead of MTBF in cases where a system is replaced after a failure ("non-repairable system"), since MTBF denotes time between failures in a system which can be repaired. MTTFd is an extension of MTTF, and is only concerned about failures which would result in

4104-420: The following areas (in order of importance): A book on the factors themselves was published in 2003. In a 1998 report from IBM Global Services , unavailable systems were estimated to have cost American businesses $ 4.54 billion in 1996, due to lost productivity and revenues. Mean time between failures Mean time between failures ( MTBF ) is the predicted elapsed time between inherent failures of

4180-466: The formula for the mdt of two components in parallel is identical to that of the mtbf for two components in series. There are many variations of MTBF, such as mean time between system aborts (MTBSA), mean time between critical failures (MTBCF) or mean time between unscheduled removal (MTBUR). Such nomenclature is used when it is desirable to differentiate among types of failures, such as critical and non-critical failures. For example, in an automobile,

4256-509: The hazard, λ {\displaystyle \lambda } , is constant. In this case, the MBTF is where λ ^ {\displaystyle {\hat {\lambda }}} is the maximum likelihood estimate of λ {\displaystyle \lambda } , maximizing the likelihood given above and k = ∑ σ i {\displaystyle k=\sum \sigma _{i}}

4332-430: The impact of unavailability varies with its time of occurrence. For large amounts of 9s, the "unavailability" index (measure of downtime rather than uptime) is easier to handle. For example, this is why an "unavailability" rather than availability metric is used in hard disk or data link bit error rates . Sometimes the humorous term "nine fives" (55.5555555%) is used to contrast with "five nines" (99.999%), though this

4408-414: The logisticians and mission planners such as quantity and proximity of spares, tools and manpower to the hardware item. Refer to Systems engineering for more details If we are using equipment which has a mean time to failure (MTTF) of 81.5 years and mean time to repair (MTTR) of 1 hour: —Ả≥〈〉〈〉〈〉 Outage due to equipment in hours per year = 1/rate = 1/MTTF = 0.01235 hours per year. Availability

4484-401: The longer a system is likely to work before failing. Mean time between failures (MTBF) describes the expected time between two failures for a repairable system. For example, three identical systems starting to function properly at time 0 are working until all of them fail. The first system fails after 100 hours, the second after 120 hours and the third after 130 hours. The MTBF of the systems is

4560-722: The mission is called for at an unknown random point in time." This definition comes from the MIL-STD-721. Lie, Hwang, and Tillman [1977] developed a complete survey along with a systematic classification of availability. Availability measures are classified by either the time interval of interest or the mechanisms for the system downtime . If the time interval of interest is the primary concern, we consider instantaneous, limiting, average, and limiting average availability. The aforementioned definitions are developed in Barlow and Proschan [1975], Lie, Hwang, and Tillman [1977], and Nachlas [1998]. The second primary classification for availability

4636-429: The night. Availability is usually expressed as a percentage of uptime in a given year. The following table shows the downtime that will be allowed for a particular percentage of availability, presuming that the system is required to operate continuously. Service level agreements often refer to monthly downtime or availability in order to calculate service credits to match monthly billing cycles. The following table shows

SECTION 60

#1732781185270

4712-466: The period of time between planned maintenance , upgrade events, or system lifetime. Zero downtime involves massive redundancy, which is needed for some types of aircraft and for most kinds of communications satellites . Global Positioning System is an example of a zero downtime system. Fault instrumentation can be used in systems with limited redundancy to achieve high availability. Maintenance actions occur during brief periods of downtime only after

4788-408: The probability of a system failing within a certain timeframe is the inverse of its MTBF. Then, when considering series of components, failure of any component leads to the failure of the whole system, so (assuming that failure probabilities are small, which is usually the case) probability of the failure of the whole system within a given interval can be approximated as a sum of failure probabilities of

4864-408: The probability that a particular system will survive to its MTBF is 1 / e {\displaystyle 1/e} , or about 37% (i.e., it will fail earlier with probability 63%). The MTBF value can be used as a system reliability parameter or to compare different systems or designs. This value should only be understood conditionally as the “mean lifetime” (an average value), and not as

4940-566: The provisioning of services over the network, instead of the services of the network itself. This may require coordinated response from both the network and from the services running on top of the network. These services include: Resilience and survivability are interchangeably used according to the specific context of a given study. There are three principles of systems design in reliability engineering that can help achieve high availability. A distinction can be made between scheduled and unscheduled downtime. Typically, scheduled downtime

5016-423: The reason for this being that the most common cause for outages is human error. On the other hand, redundancy is used to create systems with high levels of availability (e.g. popular ecommerce websites). In this case it is required to have high levels of failure detectability and avoidance of common cause failures. If redundant parts are used in parallel and have independent failure (e.g. by not being within

5092-516: The same data center), they can exponentially increase the availability and make the overall system highly available. If you have N parallel components each having X availability, then you can use following formula: Availability of parallel components = 1 - (1 - X)^ N So for example if each of your components has only 50% availability, by using 10 of components in parallel, you can achieve 99.9023% availability. Two kinds of redundancy are passive redundancy and active redundancy. Passive redundancy

5168-411: The service to be protected. The importance of network resilience is continuously increasing, as communication networks are becoming a fundamental component in the operation of critical infrastructures. Consequently, recent efforts focus on interpreting and improving network and computing resilience with applications to critical infrastructures. As an example, one can consider as a resilience objective

5244-432: The status function X ( t ) {\displaystyle X(t)} as therefore, the availability A ( t ) at time t  > 0 is represented by Average availability must be defined on an interval of the real line. If we consider an arbitrary constant c > 0 {\displaystyle c>0} , then average availability is represented as Limiting (or steady-state) availability

5320-522: The system administrator will claim 100% uptime. However, given the true definition of availability, the system will be approximately 99.9% available, or three nines (8751 hours of available time out of 8760 hours per non-leap year). Also, systems experiencing performance problems are often deemed partially or entirely unavailable by users, even when the systems are continuing to function. Similarly, unavailability of select application functions might go unnoticed by administrators yet be devastating to users –

5396-400: The system out of service and into a state for repair. Failures which occur that can be left or maintained in an unrepaired condition, and do not place the system out of service, are not considered failures under this definition. In addition, units that are taken down for routine scheduled maintenance or inventory control are not considered within the definition of failure. The higher the MTBF,

5472-568: The theory that the most highly available systems adhere to a simple architecture (a single, high-quality, multi-purpose physical system with comprehensive internal hardware redundancy), this architecture suffers from the requirement that the entire system must be brought down for patching and operating system upgrades. More advanced system designs allow for systems to be patched and upgraded without compromising service availability (see load balancing and failover ). High availability requires less human intervention to restore operation in complex systems;

5548-528: The time until failure. Thus, it can be written as where f T ( t ) {\displaystyle f_{T}(t)} is the probability density function of T {\displaystyle T} . Equivalently, the MTBF can be expressed in terms of the reliability function R T ( t ) {\displaystyle R_{T}(t)} as The MTBF and T {\displaystyle T} have units of time (e.g., hours). Any practically-relevant calculation of

5624-400: The time would have 5 nines reliability, or class five. In particular, the term is used in connection with mainframes or enterprise computing, often as part of a service-level agreement . Similarly, percentages ending in a 5 have conventional names, traditionally the number of nines, then "five", so 99.95% is "three nines five", abbreviated 3N5. This is casually referred to as "three and

5700-418: The translation from a given availability percentage to the corresponding amount of time a system would be unavailable. The terms uptime and availability are often used interchangeably but do not always refer to the same thing. For example, a system can be "up" with its services not "available" in the case of a network outage . Or a system undergoing software maintenance can be "available" to be worked on by

5776-407: The users themselves, than systems which experience periodic lulls in demand. An alternative metric is mean time between failures (MTBF). Recovery time (or estimated time of repair (ETR), also known as recovery time objective (RTO) is closely related to availability, that is the total time required for a planned outage or the time required to fully recover from an unplanned outage. Another metric

#269730