Stratus Blog

Showing archives for category Mission Critical

Exchanges, Fault Tolerance and Determinism

6.9.2016Fault Tolerance, Financial, Mission CriticalBy:  

The proliferation of online trading platforms and services coupled with demand for high speed trading has put pressure on stock exchanges to seek high performance capabilities while maintaining system stability.

But how are stock exchanges responding to this need?  Some have gone down the path of creating “home grown” software based uptime solutions to ensure the availability of their most critical applications, while others have opted to deploy fault-tolerant hardware solutions. The amount of code exchanges write and maintain to ensure high availability can be formidable.

Stock exchanges that have placed their bets on “home grown” solutions have been burdened with the cost, complexity and efforts associated with modifying their proven applications, writing new code and maintaining that code, as well as the impact on CPU cycles. And despite this, they are still unable to guarantee the consistent, deterministic levels of performance required to meet their “zero tolerance” for downtime requirements with the same 99.999% that a true fault tolerant hardware solution provides. This “zero tolerance” for downtime is not only applicable to trading, it also applies to clearing and risk management applications that – while not low-latency dependent – are still critical to the business. In many instances, any amount of unplanned downtime of these critical applications could put a stock exchange’s reputation on the line or worse, the credibility of an entire nation’s financial system.

Those stock exchanges that have put their money on fault-tolerant server systems are reaping the benefits. From an operational and regulatory perspective, there is nothing better. The systems run reliably without unplanned downtime or outages and there is no need to write software-based code to support the applications – this is performed at the server level resulting in even more reliability than the software equivalent, less burden on the CPU and a significant reduction in Opex.

Over the last thirty years, Stratus has mastered fault-tolerant server solutions to guarantee the highest levels of uptime through its unique lock-step technology. As the premier fault-tolerant technology of choice for many stock exchanges the Stratus solution has evolved over the years to deliver today’s high performance capabilities through standard x86 architectures that meet the load requirements of high-speed, highly regulated financial transactions. Through a recent partnership with best-of-breed low-latency network card vendor, Solarflare, Stratus has further enhanced its solution to deliver low-latency and low-jitter kernel bypass. The winning combination of fault tolerance and low-latency is bringing tremendous value to exchanges and capital markets, meeting their uptime requirements without the need for any software modifications.

More on jitter and determinism in the next post.

Three Steps for Moving Business Critical Apps to the Cloud

4.30.2015Cloud, Mission CriticalBy:  

The trend toward cloud-based applications and services is well underway as enterprises see the advantages in cost, efficiency, and agility. Largely absent from this march to the cloud have been mission-critical applications, which remain locked within legacy systems in the data center. Understandably, IT leaders want to make sure they can meet the security and availability requirements of mission-critical apps in a cloud environment before making that leap.

But as cloud technologies mature, this is starting to change. New approaches are emerging that offer the potential to meet the demands of business-critical applications in private cloud environments. At the heart of these new approaches lies a new mindset. IT leaders need to adopt an application-centric approach to cloud services rather than an infrastructure-centric approach. That means building cloud environments that go beyond “commodity services” and deliver robust capabilities to meet the needs of mission-critical apps. I believe there are three steps to achieving this successfully.

Step 1: Rethink your approach to availability

It goes without saying that availability is non-negotiable for business-critical apps. And until lately, “cloud” and “high availability” are terms not normally used together. That’s because traditional hardware-based fault tolerance approaches don’t lend themselves to the elastic, virtualized nature of cloud computing environments. That’s where software-defined availability (SDA) comes to the rescue. With the new generation of SDA solutions, failure recovery is abstracted from the application, enabling mainframe-like availability levels in a cloud environment running on low-cost commodity hardware.

This abstract approach also means you can achieve business-critical availability without completely re-engineering the application. In essence, you are deploying availability as a cloud service—dramatically reducing cost, complexity and risk.

Step 2: Focus on orchestration

Orchestration means making sure every bit of data moving around in the cloud ends up exactly where it’s supposed to be, when it’s supposed to. This requires sophisticated solutions with the intelligence to dedicate the right resources, when and where they are needed.

The fact is, many applications are only “mission-critical” at certain times. For example, you might have a financial application that has high availability requirements during specific times in the accounting cycle. Today’s advanced cloud orchestration solutions allocate the appropriate resources to support this availability requirement—dynamically and automatically. When this level of availability is no longer required, resources are redeployed seamlessly. The result: availability when you need it and optimized computing resources at all times.

Step 3: Leverage open source technologies

What’s the point of embracing the flexibility and cost efficiencies of the cloud, if you’re going to lock yourself in with expensive, proprietary technologies? Taking advantage of open source technologies and architectures like OpenStack, Linux, and KVM (Kernel-based Virtual Machine) enable you to avoid costly license fees while allowing the flexibility and interoperability to create cloud environments using best-of-breed solutions—like SDA and orchestration solutions discussed above.

The open source cloud ecosystem is growing and maturing rapidly, and fostering tremendous innovation as it goes. I believe building on this evolving open source foundation will pay huge dividends in agility down the road.

There you have it: Three critical keys for moving business-critical apps to the cloud. Embracing these crucial success factors, and the innovative technologies behind them, just might be the bridge to the cloud your IT organization has been looking for.


The Journey to High Availability: Discover Where You Really Are and Why You Should Care

3.18.2014Cost of Downtime, High Availability, Mission Critical, VirtualizationBy:  

What if you had the choice of having your applications available 99% of the time versus 99.9995% of the time – would you really experience a difference?

What is your typical morning like?  Perhaps it begins with breakfast, followed by the morning news and a 30-minute workout.  But not everything goes as planned.  Sometimes the cereal you were hoping for or the web site you frequent for current events aren’t available.  For these daily decisions, the answers are easy – eat something else or try another URL.  Honestly, if your favorite cereal was only available 90% of the time, you’d be fine.

A typical day at the office generally starts out with a similar pattern – turn on the computer, log-on and begin using the applications essential to your job and company’s success.  For the majority of the time, most days go as planned. However, what happens when your routine goes awry?  What’s the effect on your company’s productivity when the applications you and your colleagues depend upon go down and everything comes to a screeching halt?  What if the application is outward facing and affects customers trying to do business with you?  What if it happens at a peak time?  These are all questions someone considered when deciding what type of availability solution was required for the application (or at least you hope they did).  The effects are as much as the potential costs – but that is a story for another post.

This Availability Journey Infographic does a great job of representing almost every factor you should consider and classifies the probable solutions by their average yearly downtime.  This average is translated into a “Downtime Index Multiplier” that can be used to help calculate your company’s “Yearly Downtime Risk”.  The Downtime Index Multiplier is shown at each stage in the infographic. It is derived from the average downtime for the given solution — converting hours, minutes and seconds into a decimal format for multiplication.  So, a solution with 99% availability has about 87 hours and 36 minutes of yearly downtime – converting to a Downtime Index Multiplier of 87.6 (87+(36 /60)). You use this multiplier to calculate your yearly downtime risk for the solution as shown on the Availability Journey Infographic.  For example, if you calculated your application’s hourly cost of downtime at only $10,000 – your yearly downtime risk, at a 99% availability rate, would be $876,000 ($10,000 x 87.6).  In comparison, a 99.9995% solution has only 2 minutes and 38 seconds of yearly downtime – or an index of only 0.04. Using the example above, the yearly downtime risk would be $400 ($10,000 x .04).  Thus, if your application’s hourly downtime costs were only $10,000 an hour, the difference in yearly risk between the lowest and highest availability solutions would be $875,600.

Today’s top-of-the-line availability solutions are not the purpose-built exorbitantly priced mainframes of yesterday.  They are industry-standard plug-and-play solutions that fit into almost any infrastructure including virtualized and cloud.  One thing I can guarantee; unless you’re a credit card company, fault-tolerant hardware or software won’t cost you a fraction of the $876K at risk in the example above.  Then again, if you were, you’d already be utilizing fault-tolerance because your risk is probably in the billions even without considering the hidden costs of downtime like damaged reputation, regulatory impact and lost customers.

So, what is the cost of downtime and availability goal for your company’s applications in this always-on world? Well, 67% of best-in-class organizations use fault-tolerant servers for high availability6. Be careful if you’re part of the 66% who still rely on traditional backup for availability8… because you are taking a huge gamble.

Super Bowl Shines Worldwide Spotlight on Downtime

2.5.2013Cost of Downtime, Failure, High Availability, Mission CriticalBy: Over 100 million people worldwide tuned in to watch Super Bowl XLVII. Therefore, it could be argued that was the most viewed and infamous power outage to wreak havoc on the grandest of scales.

It just goes to show, downtime happens.

We can’t really say for sure how or what occurred, although early speculation placed blame on Beyoncé’s lights-out performance, a manager at the Superdome, site of the game, said it was not the halftime show, but that a local energy company is claiming it had trouble with one of the two main lines that deliver power to the stadium from a local substation.

It could have been a software glitch, or a hardware problem that sacked power to the stadium for 33 minutes and left the NFL with a black eye.  But the downtime incident powered a social media surge, as hundreds of thousands of people began Tweeting about the #poweroutage.

Which brings us to Twitter itself? Having suffered its own downtime nightmare back on January 31, Twitter was able to handle the blitz of people tweeting about the Super Bowl’s misfortune. Twitter announced it processed just over 24 million tweets during the game, with the mini missives coming in at a rate of 231,500 a minute during the power outage.

Downtime appears in many different forms and at many different times, across all industries and business landscapes. The Twitter downtime occurrence was much different from that the NFL witnessed, but both incidents took their tolls financially and in terms of a hit to brand reputation.

Within the enterprise there is an acceptable level of downtime that occurs each year. On average, businesses suffer between three and five hours of downtime per year, far too much in our humble opinion, at an average cost of $138,888 per hour. While that’s a staggering figure, the damage to the brand can be even more catastrophic.

Let’s get back to the Super Bowl and the power outage. The City of New Orleans, which hosted the game, is already worried it’ll lose out on hosting future games because of what happened. That’s a city known for its ability to show its visitors a good time, but those businesses that depend on major events like the Super Bowl to draw in tourism dollars could suffer from that 33-minute absence of electricity.

Again, downtime comes in many forms depending on the industry and the ramifications have the potential to throw their victims for a significant loss. It’s like that old saying that you need to expect the unexpected. When the unexpected does arrive you have to be prepared to come back from that downtime swiftly and with as little disruption to your business as possible. With the right technology and the right best practices in place, you can minimize the damage and decrease the chance of downtime seriously hampering your ability to do business.

Can You Hold On a Minute?

1.29.2013Cloud, Cost of Downtime, High Availability, Mission CriticalBy: Have you ever thought what a minute of your time is worth?

Let’s say you get paid $60 an hour – then one minute is worth $1. If you are reading this, then, my bet is you are probably willing to spend $1 waiting for an answer. Chances are you will wait much longer, especially if it’s on someone else’s dime. But, how long is too long to wait?

If you run a 911 response center (emergency phone service in the USA) then one minute of downtime is not measured in dollars but lives. Maybe you are the IT manager of a financial company, how many credit card transactions could you lose in one minute? One hundred? One thousand? Maybe many more.

In both these and many other commercial examples, the cost of downtime is both known and quantifiable. Businesses not only perform risk assessments on downtime but also, they make business decisions to avoid it. In 2011, eWeek reported a business could lose an average of about $5,000 per minute in an outage. As they say, “at that rate, $300,000 per hour is not something to dismiss lightly.” Given that, I think we can all agree for critical business applications – uptime is pretty important to many business and now, to me too.

Foursquare downtimeThis week, I start a new job as chief marketing officer at Stratus Technologies, one of the world’s leaders in ensuring up-time for your applications. You will find our software and servers behind many things you use day-to-day and you would be pretty upset if they didn’t work. Examples would be supporting credit card transactions and 911 services. What makes this role interesting is not just these types of services, but also, how our solutions apply to others. Let me give you an example.

I was sitting in the Austin airport waiting to board the first of two flights that would take me to Boston, my new home.  I wanted to let my friends on Facebook know that I had started my journey so I thought I would check-in on Foursquare – which automatically updates Facebook. Foursquare is down.

I wait until Dallas (I am changing through DFW – one of the downsides to Austin) and Foursquare is still down. When I arrive in Boston, hours later, Foursquare is up, so I check in. Of course, I could have given up on Foursquare and just checked-in on Facebook. In the cloud, there are often alternative ways of doing things.

This may seem like a trivial example, especially compared to a 911 service, but if you are Foursquare and in search of a business model, I suspect this is not good news. As that social site looks to monetize its platform, my guess is it will use ads. I need to be on the Foursquare service to see the Ads. Another outage like this and I will not be on the service. The reality is it may not have been the site’s fault, it maybe the service provider’s fault, but as a user, I don’t care.

Just a few weeks ago, in the CIO section of the Wall Street Journal, they reported “Netflix Amazon Outage Shows ‘Any Company Can Fail’.” Forrester Research analyst Rachel Dines is quoted as saying, “It’s all about timing. This was a big deal because it was one of the worst possible times it could happen as families gathered during Christmas to watch movies.” OK, so families could have talked to each other, but you get the point and there are plenty of other alternatives to Netflix.

What excites me about Stratus Technologies was not just how our technologies applies to established commercial businesses but to these new cloud-based services. I have no doubt that as cloud applications become more important in our lives, Stratus Technologies will have a critical role to play in making them available all the time.

For now, I have a lot to learn about the business and I look forward to blogging about it as I go.

White Paper Provides Practical Advice for Migrating PSAPs to Next Generation 9-1-1 Technology

1.22.2013High Availability, Mission Critical, uptime, VirtualizationBy: Next-Generation 9-1-1 Technology Migration GuideEverybody is texting these days. Teenagers, soccer moms, business people, and even grandparents have jumped on the bandwagon, sending more text messages, photos and videos than ever before. In fact, according to a 2011 Pew Internet survey, “Americans and Text Messaging,” 73 percent of cell phone users text, and nearly one-third of them would rather text than talk. With texting on the rise, it’s inevitable that 9-1-1- technology must evolve to meet the needs of today’s mobile citizens. That’s what Next Generation 9-1-1 (NG9-1-1) is all about.

NG9-1-1 is a national initiative that aims to update and improve emergency communications services.  The end goal is to upgrade the country’s 9-1-1 infrastructure so that the public can not only call, but also transmit text, video, photos, and more to a Public Safety Answering Point (PSAP). In turn, the PSAP will be able to process the data, transmit it as necessary, and get it out to first responders. Unlike today’s system, the new infrastructure will also support the transmission of calls and information across county and state lines. These enhanced capabilities will be instrumental in increasing public safety by helping law enforcement, firefighters, EMTs, and other first responders get better information about the situations they face in the field.

While migrating your PSAP to NG9-1-1 may seem overwhelming at first, proper planning can help ensure a smooth and manageable transition. Give careful upfront consideration to all your technology needs — ESInet, CTI software, CAD systems, mobile data networks, TDD software, and more. Think about how you will fund your NG9-1-1 system. Explore potential liability issues. Create a public education plan. And figure out the best way to protect your NG9-1-1 solution against downtime that could lead to tragic consequences. Looking for practical advice on how to successfully move your PSAP to NG9-1-1?  Download our informative white paper, What You Need to Know About Migrating to Next Generation 9-1-1 Technology,” to learn more.

New Report Discusses Downtime Protection Options for Virtualized Applications

12.20.2012Cost of Downtime, Fault Tolerance, High Availability, Mission Critical, uptime, Virtualization, vmwareBy: Image -- Aberdeen -- Role of FT Servers -- LPIt’s no secret that system downtime is bad for business. For one thing, it’s expensive. According to a 2012 Aberdeen Group report, the average cost of an hour of downtime is now $138,888 USD — up more than 30% from 2010. Given these rising costs, it’s no wonder that ensuring high availability of business-critical applications is becoming a top priority for companies of all sizes.

When it comes to choosing the right downtime protection, there are a couple of important things to keep in mind. First, deployment of applications on hypervisor software for server virtualization is increasing at a steady pace and is expected to continue until almost all applications are implemented on virtualized servers. As a result, you need to make sure that your downtime protection is able to support virtualized as well as non-virtualized applications. Second, with IT spending and headcount on the decline, downtime protection should be easy to install and maintain since there are fewer IT resources available to manage the assets.

Available downtime protection options range from adding no additional protection other than that offered by general-purpose servers to deploying applications on fault-tolerant hardware. Which option you choose will depend on the type of application in question. If the application is mission-critical, then you’ll need higher levels of protection. A strong segment of companies are choosing to protect each of their mission critical applications with fault-tolerant servers because they provide the highest availability, require no specialized IT skills, and are now priced within reach of even small to mid-size companies.  Looking for guidance in choosing the right downtime protection for your “can’t fail” applications? Download the Aberdeen Group report to learn more.

Marathon Technologies is now Stratus Technologies

9.26.2012Cloud, Disaster Recovery, Fault Tolerance, High Availability, Mission Critical, Technology, uptimeBy: If you are an IT decision maker looking for application high availability and business continuity, Stratus acquisition of Marathon Technologies is relevant to you.

Stratus, the company known for products and services that keep mission-critical applications up and running all the time, announced on Monday the acquisition of Marathon Technologies. Marathon’s specialty is software-based solutions for high availability, fault tolerance and disaster recovery. Its everRun MX is the world’s first software-based, fault tolerant solution that supports multi-core/multi-processor Microsoft applications; The addition of the Marathon everRun® product line, the world’s first software-based, fault tolerant solution to support multi-core/multi-processor Microsoft applications, further solidifies Stratus’s position as the leading provider of availability solutions.

We welcome Marathon’s customers, channel partners and employees to the Stratus community. Stratus is the leader in high availability and fault tolerant solutions for both software and hardware whether in a physical or virtualized cloud environment.

You can read our recent announcement at here.

Stratus Technologies’ Avance 3.0 Detects and Prevents Downtime

7.25.2012High Availability, Mission Critical, Technology, VirtualizationBy: Stratus Technologies’ high-availability (HA) Stratus Avance Software 3.0 now includes support for Intel Xeon E5 “Sandy Bridge” processor-powered servers.  This is an exciting development for companies that use Intel Xeon E5 servers manufactured by HP, IBM, Dell and Intel, as Avance software proactively detects and prevents downtime and ensures that their applications run without interruption.

Avance software is the only HA solution that automatically detects, isolates and handles faults, keeping applications running despite system interruptions. Avance software constantly monitors system heartbeat and the health of drives, fans, power supplies and other system components to predict faults and performance degradation. This system management dashboard gives the IT administrator detailed configurations and alert information as well as guidance on resolving issues.


This next generation of Avance can support up to 24 virtual machines (VM) on a single licensed HA server platform, a 50 percent increase over its predecessor. The improvement in VM density can lower operational, maintenance and management costs by enabling more consolidation of servers and applications.

This is also an important development for our channel partners. Frank Vincentelli, chief technology officer at Integrated IT Systems, a computing services firm and Stratus4Uptime channel partner, feels the development is critical to maintaining healthy relationships with customers. “Adding IT talent is expensive and application downtime is disruptive, and our clients very much want to minimize both,” he said.

Avance 3.0 software is an organization’s best protection against unplanned downtime and data loss for their most important applications. Avance proves that HA can be simple to achieve and manage, and affordable to implement in organizations of any size.  No other HA solution offers more.

Some additional features of the newest release include:

  • Support for Intel Modular Server (IMS) systems, built on Intel Multi-Flex technology, to create a HA blade platform between two IMS chassis
  • Faster snapshotting to reduce VM back-up time and downtime exposure
  • System manageability enhancements that improve the user experience

Complete product details are available here.


Customer Spotlight: Protecting Emergency Response Applications at Hartsfield-Jackson International Airport

7.18.2012Fault Tolerance, High Availability, Mission Critical, uptimeBy: For this month’s customer spotlight, we’re taking a look at how downtime can affect travel and transportation. Located in Atlanta, Georgia, Hartsfield-Jackson International Airport is the world’s busiest airport, with nearly 90 million passengers traveling through the airport annually.

Atlanta Airport Case StudyWho doesn’t love a good summer vacation at a far away destination? But before you can really start enjoying your trip, you have to get there. It sounds basic – pack, fly, enjoy – but nothing’s ever that easy. The more traveled fliers know to expect long lines at check in and maybe even delays on the runway. But one thing they aren’t expecting is getting caught in an emergency situation. Unfortunately, these things happen and the last thing you want is being in need of emergency services that are delayed because the control center is down.

It’s safe to say that no airport can tolerate system downtime, but what about the airport that has more travelers going through than any other airport on the planet? Stratus customer, Hartsfield-Jackson Airport has over 90 million travelers pass through their doors each year, and they recognized the critical need to be prepared to handle all types of emergencies. Therefore, in 2010 they added a new Centralized Command and Control Center (C4), providing a single point for managing incidents across its entire campus, which enables computer aided dispatch of police and fire units stationed at the airport. With the system being such a focal point in airport operations, security and safety, the importance of uptime was crucial. Although first responders could communicate via radio and phone, an application or server outage undoubtedly would hinder fast communication, and potentially put lives on the line.

The airport decided to take advantage of Stratus’ fault-tolerant ftServer system which lets the system seamlessly ride through issues that would cause a conventional server to crash. Additionally, unlike using a server cluster for redundancy, the ftServer system doesn’t require IT administrators to use special software scripts and systems management procedures to ensure uptime. Paired with the proven reliability and efficiency of RESPONSE CAD software, Stratus’ 24/7 uptime assurance and proactive monitoring gives dispatchers the stability they need for continuous real-time access to vital information and communications.

In five year’s Hartsfield hasn’t experienced any unplanned downtime – now that’s something to add to the bragging rights beyond being the busiest airport in the world!

Learn More: Protecting Emergency Response Applications at Hartsfield-Jackson International Airport.

Pageof 4