Stratus Blog

Showing archives for category Disaster Recovery

Is Your Disaster Recovery Solution Sufficient for Your Critical Applications?

11.15.2016Disaster RecoveryBy:  

Availability Demands of Our Always-on World

The digitalization of our world, and the globalization of our economy, have truly transformed the business environment in which we all operate. To compete, your business needs to operate 24 hours a day, 7 days a week, 365 days a year. This means your IT systems must run 24/7/365, to support your always-on business.

Always-on, has become a global requirement that touches every part of our lives. It impacts your critical business applications, and your business can’t run without them. In manufacturing environments, it’s about maintaining productivity, and reducing waste. Retailers, need to ensure transaction processing systems are up and running, to maintain sales targets. In Building Security, premises and individuals need to be protected from internal, and external threats. In Public safety, lives are on the line. In Financial Services, the impact of system downtime is huge when you’re managing thousands of transactions per second. And in Healthcare, accessibility of patient records and compliance is crucial. You get the idea. None of these organizations can afford for their applications to be down. And, as the cost of downtime continues to rise, the dependence companies have on IT systems continues to increase.

It’s About More Than Just Protecting Against System Failures

Availability protection, however, isn’t limited to threats against servers, storage systems, virtual machines or applications. Unplanned downtime can result from localized power failures, building-wide problems or even the complete loss of a site or a facility. Such disasters, whether natural or caused by human error, can result in the total loss of a physical data center, potentially leaving your business unable to function for days or even weeks. In regulated industries, a site-wide problem can lead to data loss that risks compliance, adding significantly to your downtime costs. That’s why businesses in regulated industries like pharmaceuticals, manufacturing and financial services need protection solutions that ensure that all their data is safely replicated and remains available at all times.

Traditional Approaches to Site Protection

When protecting against localized failures by using geographic separation, if a disaster strikes in one location, the goal is that your applications and data are immediately available, up-to-date, and fully operational at another location. Disaster Recovery (DR) solutions enable a business or operation to switch over to a remote location for continuation of vital technology infrastructure and systems following a natural or human-induced disaster. There are a few things to be aware of regarding DR solutions.

  • Failover may not be automatic, and may require human action.
  • DR implementations have Recovery Time Objectives (RTO’s) – the maximum amount of time that a system, or application can be down after a failure or disaster – and Recovery Point Objectives (RPO’s) – the target maximum time period for which data might be lost.
  • Data is not typically backed up continuously, but instead asynchronously, based on a schedule. This means that when the DR site is turned up, operations resume from the point of the last data back. So if, for example, your back-up is every 6 hours, the maximum period of data loss could be 6 hours.

While traditional DR solutions can provide long distance geographic separation for protection, that protection does incur a period of downtime, and some level of data loss.

Metro-Wide Availability Protection Prevents Downtime and Data Loss

The needs of our increasingly always-on world, are driving the race to zero for RTO and RPO. This demands something more than traditional DR can offer.

An alternative to traditional DR solutions is the use of synchronous replication between geographically separated sites. The network requirements for synchronous replication mean that these solutions are best suited to geographic separation typically within a metropolitan area. Such Metro-wide Availability Protection solutions can defend your critical business applications against localized power failures, building-wide problems or physical machines failures without downtime or data loss.

Unlike DR solutions that rely on asynchronous replication and which must therefore focus on recovery from downtime, Metro-wide Availability Protection with synchronous replication and can provide zero downtime for your applications during outages. In the event of a physical machine or site failure, a Metro-wide Availability Protection solution can automatically detect those failures and keep virtual machines running with no downtime. The difference between preventing downtime, rather than helping you merely recover from it, has a big impact on an organization’s revenues, costs, customer satisfaction and efficiency rates.

Metro-wide Availability Protection, with synchronous replication, provides geographic separation protection within a metropolitan area, without downtime or data loss in the event of a localized failure or disaster.

A Powerful Addition to Your Availability Toolkit

Metro-Wide Availability Protection is a powerful addition to your solutions for always-on systems and applications. Unlike typical disaster recovery solutions that are reactive, and which rely on back-up and restore, Metro-Wide Availability protection uses synchronous data replication between locations in a metro area, to allow for continuous operation in the event of a site failure in order to truly safeguard your business from major downtime due to potentially catastrophic events such as flooding and power outages.

Click Here to learn about Stratus everRun’s SplitSite Metro-Wide Availability Protection solution

New Research on Business Continuity and Disaster Recovery from Aberdeen Group

6.18.2013Disaster RecoveryBy: In today’s global online economy, an organization’s IT end-users and customers demand 24/7 access to applications . As a result, even the briefest  period of downtime can have serious bottom-line impact. In fact, a recent survey found that the average cost per hour of downtime was over $163,000*—an amount that can quickly send a company into financial peril. Given these high stakes, it’s not surprising that business continuity and disaster recovery are becoming top priorities for organizations of all types and sizes.

But when it comes to implementing business continuity and disaster recovery (BC/DR) plans, what’s the best approach to take? A new Aberdeen Group report focuses on the actions, capabilities, and technology enablers that best-in-class organizations have adopted to ensure continuous business operations—even  in the event of disaster. Results of a recent Aberdeen survey show that best-in-class companies document their BC/DR requirements and procedures, measure results, and educate staff on implementing documented processes. In addition, they tend to use server virtualization and fault-tolerant servers as technology enablers to  maintain business-critical services.

If your organization is thinking about developing, upgrading, or changing your BC/DR plans, there are upfront steps you should take before investing valuable budget and resources. Download the new Aberdeen Group report, “Business Continuity and Disaster Recovery: Don’t Go it Alone,” to understand what best-in-class companies are doing to maximize availability of critical applications and to get practical advice for implementing successful BC/DR processes and solutions within your organization.

*Source: Aberdeen Group, May 2013

Marathon Technologies is now Stratus Technologies

9.26.2012Cloud, Disaster Recovery, Fault Tolerance, High Availability, Mission Critical, Technology, uptimeBy: If you are an IT decision maker looking for application high availability and business continuity, Stratus acquisition of Marathon Technologies is relevant to you.

Stratus, the company known for products and services that keep mission-critical applications up and running all the time, announced on Monday the acquisition of Marathon Technologies. Marathon’s specialty is software-based solutions for high availability, fault tolerance and disaster recovery. Its everRun MX is the world’s first software-based, fault tolerant solution that supports multi-core/multi-processor Microsoft applications; The addition of the Marathon everRun® product line, the world’s first software-based, fault tolerant solution to support multi-core/multi-processor Microsoft applications, further solidifies Stratus’s position as the leading provider of availability solutions.

We welcome Marathon’s customers, channel partners and employees to the Stratus community. Stratus is the leader in high availability and fault tolerant solutions for both software and hardware whether in a physical or virtualized cloud environment.

You can read our recent announcement at here.

Service Level Agreements and Outages

3.16.2012Disaster Recovery, Fault Tolerance, High Availability, SLABy: On Tuesday, March 13th, Boston experienced a large power outage due to a transformer fire. NStar crews arrived to the scene in mass in a heroic effort to contain the fire and get the Back Bay, Fenway and South Boston residents and businesses back online within a matter of days.

The rancor of citizens and public officials, it seems, was not with the outage itself, or even the response effort of NStar to fix the damage. NStar created its own PR problem when they repeatedly set and failed impossible deadlines.

In and NECN interview, Mayor Menino said, “NStar was responsive to a point, but sometimes they overpromised.”

The 115,000-volt transformer fire occurred at 6:30 p.m. on March 13. NStar responded quickly, reporting that they were
“assessing the situation and will begin power restoration as soon as possible,” via their Twitter account, @NSTAR_NEWS.

At 5:02 a.m. Wednesday, March 14, they claimed via Twitter to have restored power to 8,000 customers and would restore power throughout the day and into the evening for the remaining 13,000. That tweet, widely reported by Boston news stations, set the standard that power would be completely restored by the end of Wednesday. When residents and shopkeepers awoke Thursday without power, they started to get angry.

When power restoration did not happen Wednesday, NStar promised citizens via news conferences that they would restore power during the Wednesday evening commute.

That, too, did not happen for some 12,000 Back Bay, Kenmore Square and Fenway residents.

Later Wednesday, at 5:59 p.m., the City of Boston tweeted via @NotifyBoston that “NSTAR reports power back to Back Bay/Kenmore restored by 7 p.m. Power to Pru/Copley area around 4 a.m.”

Ironically, Boston resident Marcela Garcia retweeted them, qualifying “FOR SHO???”

Read More

Keeping Computer Aided Dispatch Software Up and Running: St. Charles County Case Study

3.9.2012Disaster Recovery, Failure, Fault Tolerance, High Availability, Mission Critical, PSAPBy: computer aided dispatch softwareAnyone in the public safety sector will tell you that the key to a safe neighborhood and a successful first-response system is teamwork. Everyone is essential. For example, even on one small car fire, the person who calls 9-1-1; the dispatch operator who answers the phone and sends the proper emergency personnel; the fire engine driver who navigates the truck safely through crowded streets; the firefighters who extinguish the blaze; the policemen who keep onlookers at a safe distance; the emergency medical technicians and paramedics who triage patients and get them to the hospital; and then the nurses , doctors and technicians in the hospital that treat victims, are all critical to keeping the public safe.

The same is true of the equipment. Every piece of the line is essential. The phone lines connect the 9-1-1 caller to the dispatcher and then the dispatcher to the fire station. All of the firefighters gear must work, along with the truck, the hydrant, and the hoses. The ambulance crew, similarly, must be fully-equipped and transportable. There is little room for error when lives and property are at stake.

About that equipment. First responder organizations rely on top-of-the-line tools. Have you ever seen a firefighter haul out a green garden hose, struggling to untangle the kinks, in an effort to put out a fire? Have you ever seen a policeman take control of a robbery situation using a squirt gun? Have you ever seen a lifeguard swim to a victim and instead of tossing them a buoy, fitted them with floaties? No, and you won’t. Ever.

In public safety, there is no substituting the right tools to get the job done. Every piece is essential, and it must work exactly as designed, every single time.

Or, in the case of the server that supports the public safety applications, every single second.

St. Charles County Department of Dispatch and Alarm is a great example of a department that looked beyond the fire trucks, police cars and ambulances to find vulnerabilities that could possibly hurt public safety performance and put their citizens in danger. They implemented a highly reliable computer-aided dispatch (CAD) system built on Stratus® ftServer® systems and TriTech Software Systems’ VisiCAD™ software to ensure uninterrupted performance of their dispatch software. 40,000 service calls come through the dispatch a year, and every single one could be life-saving. TriTech’s VisiCAD software is flexible enough to service their 16 ambulances and 34 fire stations, encompassing a total of 120 mobile units.  VisiCAD is dymanic enough to locate the closest response team to the accident, while monitoring backup vehicles should they be needed.

The ftServers running the Computer Aided Dispatch system, as well as storing all of the electronic information of the calls, ensure the systems have unparalleled uptime. St. Charles County IT Manager Travis Hill said they have been running their original ftserver system for more than nine years without any server downtime. That means nice years of proactive protection for the citizens of St. Charles County.

To find out more about why St. Charles County specifically chose VisiCAD software on ftServers, click here to read the case study.

Preventing Public Safety Outages

3.7.2012Cost of Downtime, Disaster Recovery, Failure, Fault Tolerance, High Availability, Mission Critical, PSAPBy: Saturday’s 911-system outage in the District of Columia highlights the necessity for fault tolerant systems running mission-critical applications. Due to a PEPCO power outage to the call site on Martin Luther King Jr. Avenue, citizens could not reach EMS personnel from 1:53 to 2:16 p.m. Although traditional and social media channels did their best to get the word out about alternate numbers, all 617,996 citizens of the District were put at risk. Perhaps nothing is more critical to a city than public safety systems like EMS, Fire and Police response.

@AriAnkhNeferet from Twitter said it best, “Someone please explain to me how it’s possible that 911 is experiencing a power outage?! Come on DC. we have to do better.”

She is right – the most mission critical systems and applications shouldn’t be subject to outages, power or otherwise. Backup systems, fault tolerant servers, and disaster recovery solutions are all possible ways to make your EMS system safer for the community. Servers wired for two distinct power sources that come from separate power grids, like our ftServers, are an easy way to guard against power outages. Live data replication and split-site capabilities, two features of our Avance high availability software, are two other ways to ensure your systems are protected.

Besides power failures, server crashes, memory failures, disk drive failures and a countless number of other technical problems can crash servers much more often. Saturday’s power outage demonstrates what could happen if a public safety system goes down for any number of reasons, and reinforces that steps need to be taken to protect systems from more normal/frequent occurrences.

When lives are at stake, you cannot be too careful. However, @AriAnkhNeferet’s tweet shows that something else is at stake: reputation. What happens when if public loses trust in the EMS system to respond? A large Metro can get 30,000 9-1-1 calls per day. That would mean the 20+ minute outage could have affected 400+ 9-1-1 calls, leaving citizens stranded and the city’s first line of defense helpless to respond.

If you run life-saving systems, it might be best to run through some worst-case scenarios on your existing architecture. What happens when a power failure happens in your call center? What happens when a server has a hardware failure? What is your disaster recovery plan in the case of an earthquake, fire, or flood? Are there dedicated resources available 24-hours in the case of a failure?

To learn how Stratus can help you with these and other public safety technology issues, click here to download more information.

Keeping Electronic Health Record (EHR) Applications Available at Alice Peck Day Memorial Hospital

3.1.2012Disaster Recovery, EMR, Failure, Healthcare, High AvailabilityBy: Today, your Danskos are going to power over the linoleum floors, moving from patient room to patient room. In a sea of charts, beeping machines, gurneys and meal carts, you know that one small misstep can set back your whole day.

It isn’t a large leap, then, to understand that one lapse of even a small amount of downtime for the a hospital’s electronic health record (EHR) system can bring the entire hospital – staff, patients, and machines alike – to a standstill.

Ten years ago, when doctors and nurses used paper charts, the risk of inaccessible data was low, as was the level of efficiency. Aside from the occasional misfile or lost folder, patient medical histories were never completely unavailable. Electronic medical records have done wonders to streamline accessibility to patient information, but they also created vulnerability and a single point of failure in the server.

Click here to learn how Alice Peck Day Memorial Hospital prevents downtime.

The HITECH (Health Information Technology for Economic and Clinical Health) Act, however, demands “meaningful use” of technology in healthcare environments, with a $2 billion incentive behind it. Designed to make the exchange of healthcare information between healthcare professionals easier and more accurate while improving the level of care patients receive, the bill strongly encourages healthcare practices to adopt EMR .

Once the tedious process of data entry and document scanning is complete, medical practices can reap the rewards of a paper-less system, but that efficiency comes with a catch: If the EHR system goes down, medical records are as good as gone. As a result, protecting servers and applications from downtime becomes paramount.

Alice Peck Day Memorial Hospital , a 25-bed hospital in the northeast, implemented virtualization technology with high availability software to address concerns over medical records accessibility. To see their prescription for success, read the case study.

Protecting VMware vCenter Server from Downtime

2.8.2012Disaster Recovery, High Availability, Mission Critical, Technology, uptime, Virtualization, vmwareBy: The most business-critical application in your data center could very well be VMware’s vCenter Server.

As you continue to add to the number of virtual machines in your environment, keeping vCenter Server up and running becomes increasingly important. Think about it: If vCenter goes down, IT managers are unable to control their VMs, and tools for site recovery, operations, business management, chargeback processes and more are unavailable.

To increase availability for vCenter, VMware offers a self-described “stopgap” solution known as Heartbeat, a complex-to-configure failover system that requires duplicate servers, duplicate software licenses (for both vCenter and Windows) and a brand-new interface to learn.

For those of you scoring at home, that is twice the license costs, twice the servers, and twice the management headaches.

Our crack engineering team, who often have the same pain points as you and every other IT director around the globe, created a better solution. Simple, lower-cost uptime for vCenter that pro-actively prevents failures instead of reactively managing failovers. All this without any extra licenses, workarounds or custom scripting.

The new Stratus Uptime Appliance for VMware vCenter Server is a plug-and-play solution that provides greater than 99.999% availability. It’s a single server solution that runs one copy of vCenter on one copy of Windows. One bullet-proof server, only one software license, and no complexity. (And, as an added bonus, it costs thousands of dollars less than similarly configured Heartbeat solutions.)Is your vCenter a risky, single point of failure?

Uncomplicate your uptime and check out the new Stratus Uptime Appliance for VMware vCenter. Have a question? Ask it in the comments, or send us a tweet @Stratus4uptime.

Finger Pointing and Problem Solving

12.5.2011Disaster Recovery, Failure, High Availability, Mission Critical, SLA, TechnologyBy: I recently came across a fascinating website,, that reports on which online retail sites are down and for how long. Seventy-five retailers’ sites have gone dark since Black Friday, according to this website. The winner – or loser as the case may be – was a major North American company, logging more than 10 hours off the grid. I don’t know for sure, but I’m guessing it lost millions of dollars in holiday gift buying. To its credit, the site also lists retailer sites that have been up 7/24.

The website belongs to a company that sells infrastructure monitoring. Every minute or so, it checks in on a client’s site to see if it’s up and operating. If not, alerts go out so people can start fixing things. This is symptomatic of what’s wrong in a majority of data centers today. They focus on recovery from failure, not failure prevention. They probably don’t know what’s broken or how to fix it immediately, extending recovery time even more.

Doesn’t it make better sense to monitor and diagnose in real time, anticipate potential failures, and head them off before a crash? Remote monitoring and management should be more than a passive exercise. It should be, and can be, eyes and ears that not only notify but enable pro-active issue remediation without downtime or data loss. After all, we want to make darn sure Aunt Millie gets that strawberry huller, mushroom brush, and melon baller in her Christmas stocking, don’t we?

What to learn from the Amazon Downtime Debacle

5.2.2011Cloud, Disaster Recovery, Failure, High Availability, Technology, uptime, VirtualizationBy: “Good enough” service is only acceptable until something bad happens, like the Amazon downtime debacle. Then everyone demands compensation for business interruption and lost revenue. A couple of industry experts in a Networkworld article went so far as to say that customers actually shared in the blame for their losses, stating they should have anticipated failure and made plans to minimize impact. Silly customers. They thought they were doing that when they contracted for cloud services.

Cloud service providers are not experts in uptime assurance. They know it, which is why their availability SLAs are empty promises. Uptime assurance is sophisticated technology. Running virtualization software on armies of servers doesn’t magically create it. Purpose-built software and hardware, proactive availability monitoring and management, and best-practices oversight does. These solutions – industry-standard with compelling ROI — are available today. But until cloud customers demand SLAs with real teeth, and service providers own up to their responsibility to protect their customers’ interests, good enough will remain the unacceptable standard in the cloud.

Pageof 2