“Achieving Instantaneous Fault Tolerance for Any Application on Commodity Hardware” aimed at Telcos and Communications Application Providers. I was pleasantly surprised at the turn out. We had hundreds of people interested in this topic and here is a brief overview of what we discussed.A few weeks ago, Stratus hosted a Webinar with Light Reading titled
Communications networks have always needed high availability and resiliency. As more networking applications such as SDN Controllers and virtualized functions are being deployed on commodity servers rather than proprietary purpose built hardware, the need for software-based resiliency and fault tolerance has never been greater. A reliable network depends on its ability to quickly and reliably connect end-points, transfer data and maintain quality of service (QoS). If the network goes down, even just for a few seconds, many people can be affected. System failure may not only result in loss of revenue for the provider, but it can seriously damage its reputation and/or trigger penalty payments.
Unplanned server and data center outages are expensive, and the cost of downtime is rising. The average cost per minute of unplanned downtime is $7,900 to $11,000 per minute, depending on which study you believe. With average data center downtime of 90 minutes per year, this translates to a costs of about $1M per year, per data center.
A Highly Available (HA) network is one that ensures the network and its services are always on, always accessible (service accessibility) and active sessions are always maintained without disruption (service continuity). Five nines (99.999%) availability is the minimum benchmark, meaning that on average, the service is never down for more than five minutes in a one year period. While a typical HA of five nines (99.999%) or even six nines (99.9999%) sounds impressive, for maintained QoS, it may not be good enough! Let’s look at an example. Consider an application that has six nines (99.9999%) of availability. At this level of HA it means the application will not go down for more than 31.5 seconds a year, which may seem impressive. However, if the application were to fail once a week for just a second and was not capable of returning to its original state after a failure, this would result in a situation where active sessions would likely be disrupted or degraded. So technically, a service may still be up (maintaining its HA metrics), but if active customer sessions are experiencing connection disruption or degradation in the form of reconnecting, less throughput, higher latency or less functionality, it will likely violate the Service Level Agreement (SLA) and result in significant customer dissatisfaction and penalty consequences for the service provider.
So what Telcos and Communications Providers need is more than just five nines or even six nines of availability – they need resilient platforms that can sophistically manage faults and continue service without disruption and degradation in performance, functionality and latency and maintain minimum acceptable levels of service as defined in the SLA. And since not all applications require the same levels of resiliency, it is important to manage Resiliency SLA based on the different types of applications and their requirements. This is the difference between traditional HA solutions and resilient fault-tolerant solutions like everRun from Stratus Technologies.
everRun is a Software Defined Availability (SDA) infrastructure that moves fault management and automatic failover from the applications to software infrastructure. This provides fully automated and complete fault tolerance for all applications, which includes fault detection, localization, isolation, service restoration, redundancy restoration, and, if desired, state replication – all without requiring application code change and with dynamic levels of resiliency. This means any application can be instantaneously deployed with high resiliency, multiple levels of state protection and ultra-fast service restoration speed – on commercial off-the-shelf (COTS) hardware in any network, without the complexity, time consuming effort and risk associated with modifying and testing every application. This is why everRun is ideal for communications applications that include video monitoring, network management, signaling gateways, firewalls, network controllers and more.
In the Webinar, we discussed the differences between standard HA system and resilient platforms like everRun, options for deploying resiliency (in the apps versus the software infrastructure), a brief overview of everRun, customer use cases and examples of how everRun is used in the communications space for telco networks and converging industries. To watch the webinar and learn more, please click here.
Lots of good statistics about the causes, costs and next steps that companies can utilize for understanding their risk and potential costs related to downtime, so they can procure additional funds to protect against future availability issues.
Let’s take a quick look at the high level findings published in the Infographic.
91% still experience downtime
33% of all downtime is caused by IT equipment failure
IT equipment failure is the most expensive outage (23%). Twice as high as every other except cyber crime (21%).
Average length of downtime is still over 86 minutes
Average cost of downtime has increased 54% to $8,023 per minute.
Based on these statistics, 30% (33% of 91%) of all data centers will have downtime related to IT equipment failure. Assuming they only have one incident of the average length, they would incur $689,978 (86 x $8,023) in downtime related costs.
Stratus can address 33% of the most costly downtime with our fault-tolerant hardware and software solutions.
52% believe the outages could have been prevented. This makes sense, because 48% is caused by accident and human error. Only training, personnel changes or outsourcing can improve that cause of downtime.
70% believe cloud is equal or better than their existing availability. That’s if you don’t look too close at the SLA details (i.e. excluding “emergency maintenance” or downtime only counts toward SLA if over xx min per incident). Certainly most cloud providers can provide better than the 99.98% [(525,600-86)/525,600] availability these data centers are currently averaging (assuming only one incident of average length). But remember, all SLAs are limited to the cost of the service, which I assume is far less than the almost $700k downtime related cost most in the survey have realized.
Cloud solutions are constantly improving; but we continue to hear from our customers that availability still has a long way to go, especially when it comes to stateful legacy workloads that don’t have availability built into the application like native cloud apps. Of course, this is something that we at Stratus are working on.
I say look into availability options and invest upfront in the best availability you can afford, it might not pay dividends upfront, but an ounce of prevention is worth a pound of cure. Because $50k spent on availability might be worth $700k in related costs, not to mention headaches and tarnished reputation.
On Tuesday, March 13th, Boston experienced a large power outage due to a transformer fire. NStar crews arrived to the scene in mass in a heroic effort to contain the fire and get the Back Bay, Fenway and South Boston residents and businesses back online within a matter of days.
The rancor of citizens and public officials, it seems, was not with the outage itself, or even the response effort of NStar to fix the damage. NStar created its own PR problem when they repeatedly set and failed impossible deadlines.
In and NECN interview, Mayor Menino said, “NStar was responsive to a point, but sometimes they overpromised.”
The 115,000-volt transformer fire occurred at 6:30 p.m. on March 13. NStar responded quickly, reporting that they were
“assessing the situation and will begin power restoration as soon as possible,” via their Twitter account, @NSTAR_NEWS.
At 5:02 a.m. Wednesday, March 14, they claimed via Twitter to have restored power to 8,000 customers and would restore power throughout the day and into the evening for the remaining 13,000. That tweet, widely reported by Boston news stations, set the standard that power would be completely restored by the end of Wednesday. When residents and shopkeepers awoke Thursday without power, they started to get angry.
When power restoration did not happen Wednesday, NStar promised citizens via news conferences that they would restore power during the Wednesday evening commute.
That, too, did not happen for some 12,000 Back Bay, Kenmore Square and Fenway residents.
Later Wednesday, at 5:59 p.m., the City of Boston tweeted via @NotifyBoston that “NSTAR reports power back to Back Bay/Kenmore restored by 7 p.m. Power to Pru/Copley area around 4 a.m.”
Ironically, Boston resident Marcela Garcia retweeted them, qualifying “FOR SHO???”
100% uptime.I had a few thoughts but am interested in what others had to say, as well. Share them below, or on Spiceworks!I was on Spiceworks today and ran into this conversation about
Most SLAs will claim 100% uptime (which most of you know is unattainable) with the provisions that “an outage doesn’t count if it is under 10 minutes” or caused by certain factors, or a host of other excuses.
Uptime, in the above context then, has two components: reliability and availability. Availability refers to the amount of time the server is working, and the reliability refers to the number of times the server fails.
To put it in a simpler context, imagine we are in a boat. Availability refers to the percentage of time in a given time period that we are out of the water, and reliability refers to the number of times we get wet in that same time period.
There are three typical solutions for business critical applications: clusters, fault-tolerant servers, or the cloud.
Microsoft clusters, which only work for cluster-aware applications, work as a team of servers. When one server fails, the next server takes over the application, however, whatever transaction was happening at the time of the fault is lost.
Fault-tolerant servers work in tandem: two servers are doing all the work all of the time, at the same time. If one fails, the application is still running and the users never know a fault has occurred. (Incidentally, with our Stratus servers, when a fault occurs or is about to occur, the server will call home to our service center for pro-active maintenance.)
This can be hard to imagine, so here is an analogy. For clusters, imagine a dance team enters a competition. They start the music and a dancer starts her number, but falls and breaks her ankle. A new dancer takes her place, the music is restarted, and the dancing continues.
For fault-tolerant servers, imagine the Rockettes. If one Rockette falls offstage, kicking and dancing is still happening.
On to the “cloud” option. Clouds, like Rackspace, Amazon Cloud, or even many parts of the Google brand , sound like a great plan. But clouds, despite their name, do not run on rainbows and unicorn dust. Their data and applications live on a physical server which is vulnerable to faults.
Just as an aside, a private cloud is another great option: hosting your own cloud on a high availability solution like a fault tolerant server or a cluster.
Rreading the fine print in SLAs is crucial. SLAs should be meaningful, and incur damages onto the company if they are broken. To give some perspective, if our ftserver customers incur ANY downtime at all for any reason, no matter how small, we pay $50,000. Again I say, responsible, customer-oriented companies have wiggle-proof SLAs.
An interview with Laura DiDio of ITIC on the cost of downtime.
www.Panopta.com, that reports on which online retail sites are down and for how long. Seventy-five retailers’ sites have gone dark since Black Friday, according to this website. The winner – or loser as the case may be – was a major North American company, logging more than 10 hours off the grid. I don’t know for sure, but I’m guessing it lost millions of dollars in holiday gift buying. To its credit, the site also lists retailer sites that have been up 7/24.I recently came across a fascinating website,
The website belongs to a company that sells infrastructure monitoring. Every minute or so, it checks in on a client’s site to see if it’s up and operating. If not, alerts go out so people can start fixing things. This is symptomatic of what’s wrong in a majority of data centers today. They focus on recovery from failure, not failure prevention. They probably don’t know what’s broken or how to fix it immediately, extending recovery time even more.
Doesn’t it make better sense to monitor and diagnose in real time, anticipate potential failures, and head them off before a crash? Remote monitoring and management should be more than a passive exercise. It should be, and can be, eyes and ears that not only notify but enable pro-active issue remediation without downtime or data loss. After all, we want to make darn sure Aunt Millie gets that strawberry huller, mushroom brush, and melon baller in her Christmas stocking, don’t we?
PCI stands for the Payment Card Industry, referring to debit, credit, prepaid, ATM and POS cards and related businesses. The governing body is the PCI Security Standards Council (PCI SSC) which is responsible for the implementation, dissemination, development, enhancement and storage of security standards for securing account data.
The PCI SSC has defined security standards outlined by the PCI DSS – Payment Card Industry Data Security Standards. PCI dictates regulations on how organizations (retail, commercial or otherwise) must follow when storing, processing or transmitting their customer’s credit card data. The PCI standard dictates 12 requirements for security best practices as it relates to customer credit card data. Of note; the PCI standard does not dictate how you implement PCI compliance, only that you must ensure to comply to the 12 requirements.
Do you have more questions? Feel free to leave them in the comments, or tweet me at @Stratus4Uptime.
Presented by Sue Nemetz, Director, Services Business Development.