Stratus Blog

Showing archives for category SLA

Achieving Instantaneous Fault Tolerance for Any Application on Commodity Hardware

3.8.2016High Availability, SLA, TelcoBy: A few weeks ago, Stratus hosted a Webinar with Light Reading titled “Achieving Instantaneous Fault Tolerance for Any Application on Commodity Hardware” aimed at Telcos and Communications Application Providers. I was pleasantly surprised at the turn out. We had hundreds of people interested in this topic and here is a brief overview of what we discussed.

Communications networks have always needed high availability and resiliency. As more networking applications such as SDN Controllers and virtualized functions are being deployed on commodity servers rather than proprietary purpose built hardware, the need for software-based resiliency and fault tolerance has never been greater. A reliable network depends on its ability to quickly and reliably connect end-points, transfer data and maintain quality of service (QoS). If the network goes down, even just for a few seconds, many people can be affected. System failure may not only result in loss of revenue for the provider, but it can seriously damage its reputation and/or trigger penalty payments.

Unplanned server and data center outages are expensive, and the cost of downtime is rising. The average cost per minute of unplanned downtime is $7,900 to $11,000 per minute, depending on which study you believe. With average data center downtime of 90 minutes per year, this translates to a costs of about $1M per year, per data center.

A Highly Available (HA) network is one that ensures the network and its services are always on, always accessible (service accessibility) and active sessions are always maintained without disruption (service continuity). Five nines (99.999%) availability is the minimum benchmark, meaning that on average, the service is never down for more than five minutes in a one year period. While a typical HA of five nines (99.999%) or even six nines (99.9999%) sounds impressive, for maintained QoS, it may not be good enough!  Let’s look at an example. Consider an application that has six nines (99.9999%) of availability. At this level of HA it means the application will not go down for more than 31.5 seconds a year, which may seem impressive. However, if the application were to fail once a week for just a second and was not capable of returning to its original state after a failure, this would result in a situation where active sessions would likely be disrupted or degraded. So technically, a service may still be up (maintaining its HA metrics), but if active customer sessions are experiencing connection disruption or degradation in the form of reconnecting, less throughput, higher latency or less functionality, it will likely violate the Service Level Agreement (SLA) and result in significant customer dissatisfaction and penalty consequences for the service provider.

So what Telcos and Communications Providers need is more than just five nines or even six nines of availability – they need resilient platforms that can sophistically manage faults and continue service without disruption and degradation in performance, functionality and latency and maintain minimum acceptable levels of service as defined in the SLA.  And since not all applications require the same levels of resiliency, it is important to manage Resiliency SLA based on the different types of applications and their requirements. This is the difference between traditional HA solutions and resilient fault-tolerant solutions like everRun from Stratus Technologies.

everRun is a Software Defined Availability (SDA) infrastructure that moves fault management and automatic failover from the applications to software infrastructure. This provides fully automated and complete fault tolerance for all applications, which includes fault detection, localization, isolation, service restoration, redundancy restoration, and, if desired, state replication – all without requiring application code change and with dynamic levels of resiliency.  This means any application can be instantaneously deployed with high resiliency, multiple levels of state protection and ultra-fast service restoration speed – on commercial off-the-shelf (COTS) hardware in any network, without the complexity, time consuming effort and risk associated with modifying and testing every application. This is why everRun is ideal for communications applications that include video monitoring, network management, signaling gateways, firewalls, network controllers and more.

In the Webinar, we discussed the differences between standard HA system and resilient platforms like everRun, options for deploying resiliency (in the apps versus the software infrastructure), a brief overview of everRun, customer use cases and examples of how everRun is used in the communications space for telco networks and converging industries. To watch the webinar and learn more, please click here.

How Downtime Impacts the Bottom Line 2014

9.26.2014Cost of Downtime, High Availability, SLA, Technology, uptimeBy: Rackspace® — How Downtime Impacts The Bottom Line 2014 [Infographic]
Rackspace® — How Downtime Impacts The Bottom Line 2014 [Infographic]

Lots of good statistics about the causes, costs and next steps that companies can utilize for understanding their risk and potential costs related to downtime, so they can procure additional funds to protect against future availability issues.

Let’s take a quick look at the high level findings published in the Infographic.

91% still experience downtime

33% of all downtime is caused by IT equipment failure

IT equipment failure is the most expensive outage (23%). Twice as high as every other except cyber crime (21%).

Average length of downtime is still over 86 minutes

Average cost of downtime has increased 54% to $8,023 per minute.

Based on these statistics, 30% (33% of 91%) of all data centers will have downtime related to IT equipment failure.  Assuming they only have one incident of the average length, they would incur $689,978 (86 x $8,023) in downtime related costs.

Stratus can address 33% of the most costly downtime with our fault-tolerant hardware and software solutions.

52% believe the outages could have been prevented.  This makes sense, because 48% is caused by accident and human error.  Only training, personnel changes or outsourcing can improve that cause of downtime.

70% believe cloud is equal or better than their existing availability.  That’s if you don’t look too close at the SLA details (i.e. excluding “emergency maintenance” or downtime only counts toward SLA if over xx min per incident).  Certainly most cloud providers can provide better than the 99.98% [(525,600-86)/525,600] availability these data centers are currently averaging (assuming only one incident of average length).  But remember, all SLAs are limited to the cost of the service, which I assume is far less than the almost $700k downtime related cost most in the survey have realized.

Cloud solutions are constantly improving; but we continue to hear from our customers that availability still has a long way to go, especially when it comes to stateful legacy workloads that don’t have availability built into the application like native cloud apps. Of course, this is something that we at Stratus are working on.

I say look into availability options and invest upfront in the best availability you can afford, it might not pay dividends upfront, but an ounce of prevention is worth a pound of cure. Because $50k spent on availability might be worth $700k in related costs, not to mention headaches and tarnished reputation.

Service Level Agreements and Outages

3.16.2012Disaster Recovery, Fault Tolerance, High Availability, SLABy: On Tuesday, March 13th, Boston experienced a large power outage due to a transformer fire. NStar crews arrived to the scene in mass in a heroic effort to contain the fire and get the Back Bay, Fenway and South Boston residents and businesses back online within a matter of days.

The rancor of citizens and public officials, it seems, was not with the outage itself, or even the response effort of NStar to fix the damage. NStar created its own PR problem when they repeatedly set and failed impossible deadlines.

In and NECN interview, Mayor Menino said, “NStar was responsive to a point, but sometimes they overpromised.”

The 115,000-volt transformer fire occurred at 6:30 p.m. on March 13. NStar responded quickly, reporting that they were
“assessing the situation and will begin power restoration as soon as possible,” via their Twitter account, @NSTAR_NEWS.

At 5:02 a.m. Wednesday, March 14, they claimed via Twitter to have restored power to 8,000 customers and would restore power throughout the day and into the evening for the remaining 13,000. That tweet, widely reported by Boston news stations, set the standard that power would be completely restored by the end of Wednesday. When residents and shopkeepers awoke Thursday without power, they started to get angry.

When power restoration did not happen Wednesday, NStar promised citizens via news conferences that they would restore power during the Wednesday evening commute.

That, too, did not happen for some 12,000 Back Bay, Kenmore Square and Fenway residents.

Later Wednesday, at 5:59 p.m., the City of Boston tweeted via @NotifyBoston that “NSTAR reports power back to Back Bay/Kenmore restored by 7 p.m. Power to Pru/Copley area around 4 a.m.”

Ironically, Boston resident Marcela Garcia retweeted them, qualifying “FOR SHO???”

Read More

What does 100% uptime mean, and how does it pertain to SLAs?

1.16.2012Fault Tolerance, High Availability, Mission Critical, SLA, TechnologyBy: I was on Spiceworks today and ran into this conversation about 100% uptime.I had a few thoughts but am interested in what others had to say, as well. Share them below, or on Spiceworks!

Most SLAs will claim 100% uptime (which most of you know is unattainable) with the provisions that “an outage doesn’t count if it is under 10 minutes” or caused by certain factors, or a host of other excuses.

Uptime, in the above context then, has two components: reliability and availability. Availability refers to the amount of time the server is working, and the reliability refers to the number of times the server fails.

To put it in a simpler context, imagine we are in a boat. Availability refers to the percentage of time in a given time period that we are out of the water, and reliability refers to the number of times we get wet in that same time period.

There are three typical solutions for business critical applications: clusters, fault-tolerant servers, or the cloud.

Microsoft clusters, which only work for cluster-aware applications, work as a team of servers. When one server fails, the next server takes over the application, however, whatever transaction was happening at the time of the fault is lost.

Fault-tolerant servers work in tandem: two servers are doing all the work all of the time, at the same time. If one fails, the application is still running and the users never know a fault has occurred. (Incidentally, with our Stratus servers, when a fault occurs or is about to occur, the server will call home to our service center for pro-active maintenance.)

This can be hard to imagine, so here is an analogy. For clusters, imagine a dance team enters a competition. They start the music and a dancer starts her number, but falls and breaks her ankle. A new dancer takes her place, the music is restarted, and the dancing continues.

For fault-tolerant servers, imagine the Rockettes. If one Rockette falls offstage, kicking and dancing is still happening.

On to the “cloud” option. Clouds, like Rackspace, Amazon Cloud, or even many parts of the Google brand , sound like a great plan. But clouds, despite their name, do not run on rainbows and unicorn dust. Their data and applications live on a physical server which is vulnerable to faults.

Just as an aside, a private cloud is another great option: hosting your own cloud on a high availability solution like a fault tolerant server or a cluster.

Rreading the fine print in SLAs is crucial. SLAs should be meaningful, and incur damages onto the company if they are broken. To give some perspective, if our ftserver customers incur ANY downtime at all for any reason, no matter how small, we pay $50,000. Again I say, responsible, customer-oriented companies have wiggle-proof SLAs.

ITIC’s Laura Didio at #Cloudtalk on the Cost of Downtime

12.14.2011Cloud, Cost of Downtime, Failure, High Availability, Mission Critical, SLA, Technology, uptimeBy: An interview with Laura DiDio of ITIC on the cost of downtime.

Finger Pointing and Problem Solving

12.5.2011Disaster Recovery, Failure, High Availability, Mission Critical, SLA, TechnologyBy: I recently came across a fascinating website,, that reports on which online retail sites are down and for how long. Seventy-five retailers’ sites have gone dark since Black Friday, according to this website. The winner – or loser as the case may be – was a major North American company, logging more than 10 hours off the grid. I don’t know for sure, but I’m guessing it lost millions of dollars in holiday gift buying. To its credit, the site also lists retailer sites that have been up 7/24.

The website belongs to a company that sells infrastructure monitoring. Every minute or so, it checks in on a client’s site to see if it’s up and operating. If not, alerts go out so people can start fixing things. This is symptomatic of what’s wrong in a majority of data centers today. They focus on recovery from failure, not failure prevention. They probably don’t know what’s broken or how to fix it immediately, extending recovery time even more.

Doesn’t it make better sense to monitor and diagnose in real time, anticipate potential failures, and head them off before a crash? Remote monitoring and management should be more than a passive exercise. It should be, and can be, eyes and ears that not only notify but enable pro-active issue remediation without downtime or data loss. After all, we want to make darn sure Aunt Millie gets that strawberry huller, mushroom brush, and melon baller in her Christmas stocking, don’t we?

What is PCI Compliance?

10.20.2011High Availability, Mission Critical, SLA, TechnologyBy: PCI CompliacePCI stands for the Payment Card Industry, referring to debit, credit, prepaid, ATM and POS cards and related businesses.  The governing body is the PCI Security Standards Council (PCI SSC) which is responsible for the implementation, dissemination,  development, enhancement and  storage of security standards for securing account data.

The PCI SSC has defined  security standards outlined by the  PCI DSS – Payment Card Industry Data Security Standards.  PCI dictates regulations on how organizations (retail, commercial or otherwise) must follow when storing, processing or transmitting their customer’s credit card data. The PCI standard dictates 12 requirements for security best practices as it relates to customer credit card data.  Of note; the PCI standard  does not dictate how you implement PCI compliance, only that you must ensure to comply to the 12 requirements.

Do you have more questions?  Feel free to leave them in the comments, or tweet me at @Stratus4Uptime.

ftServer Service Program Update Video

9.19.2011Fault Tolerance, MS SQL, SLA, technical webinar, vmwareBy: v id="webcontent_0_wrappercontent_1_middlecontent_1_contentcontent_2_pagecontentcontent_0_leftcontentcontent_0_rptResults_ctl01_pnlSummary">With the recent announcement of the new ftServices portfolio , Stratus delivers the world’s most aggressive response, resolution and uptime commitments for business-critical Windows, Linux and VMware operating environments. Stratus is the IT industry’s only vendor with proactive service programs providing root-cause problem analysis for platforms and operating systems. Our global 24×7 network of customer assistance centers not only provide immediate response to any customer support request, but also offer in-depth engineering-level response for the most critical problems in less than 15 minutes.
Presented by Sue Nemetz, Director, Services Business Development.

A Tale of a Print Server Meltdown

7.15.2011Cost of Downtime, High Availability, Mission Critical, SLA, TechnologyBy: print server failSo, you head to work one morning, say hello to the security guard on your way in, and you fire up the coffee machine, your desktop, and the printer.

Or you try to fire up the printer.

The print server is taking the day off. Can a print server crash put a stop to an entire day’s work? Absolutely. Try creating orders, invoices, receipts, shipping labels and reports without one. If you can’t take an order, your business is dead in the water.

What about the e-mail server? If that goes down, virtually all communication screeches to a halt. Customers can’t contact you with questions, problems, or new orders. You cannot receive email notifications from any other applications you are running, like Sims (?) for example. So now you not only are incommunicado and helpless to fix the problem, you are unprotected, as well.

SMBs in the current market still do not seem to realize that their applications are business critical.  Patient management, order management, print servers and Microsoft Exchange are all essential to daily success, but one outage can put a complete stop to business.

When you stop to think about which of your applications are business critical, take time to consider why you have each of them in the first place. Sometimes, the simplest applications are the most important.

SLAs should have teeth

6.30.2011Cost of Downtime, Failure, Fault Tolerance, High Availability, SLA, Technology, VirtualizationBy: teethUptime SLAs are not being written to protect the customer. They are designed to shield the service provider and avoid responsibility. It should be different.
I was just reading some stories of system downtime on the LinkedIn conversation “Skip Continuous Availability and High Availability. Have You Been a Victim of No Availability?” It struck me that service providers are failing to write SLAs that respond to the real costs of downtime.

And that’s critical. The adage is that one annoyed customer tells nine. So shouldn’t our “service level agreements” be service oriented? But it’s not happening that way in today’s cloud.

• One major hosting provider “guarantees” 100% uptime — but in the contract they disregard any outage of less than 30 minutes. And if such an outage occurs, they only rebate 5% of the monthly fee for the affected server. That hardly compensates the customer for the loss of business and productivity that may have occurred in the 30-minute outage — and does nothing to address any number of 29-minute outages. What kind of SLA is this, really?

• A SaaS provider’s enterprise-level SLA: “A service outage is covered by the SLA when service is completely unavailable or inaccessible for customer’s use for 10 or more consecutive minutes.” So a nine-minute outage means nothing. Their listed “key industries” include media companies, such as Universal Music Group, who would lose at least 630 files in a 9-minute outage, according to my quick calculations. I am not going to try to estimate the costs of an outage for their next key industry: medical.

• A giant web services provider guarantee only 99.5% uptime (which is just under 44 hours a year!). But they claim that their well-documented 5-day outage that ended on April 21st never violated any SLAs. I know of a remote heart-monitoring company that was unable to read electrocardiogram results for an entire day; I doubt they are happy with their SLAs today.

SLAs should have teeth. They should be written with the customer needs and expectations in mind. Compensation for violated SLAs should bite and draw blood. The compensation for the customer should more than compensate for the pain of their downtime.

There should be no amount of downtime too small to care about. Customers should be clear on how much downtime they can expect, since they feel the pain of business mistakes. The customer should feel compensated — and most important, that they can continue to trust your company.