Our first blog of this series outlined the various challenges posed to DevOps and IT leaders resulting from the COVID-19 pandemic. In this blog we take a closer look at designing for resilience using a Cloud-N disaster recovery and mitigation strategy.
Disaster Recovery (DR) and Mitigation has long been just a box-checking exercise in IT. You think about it once a year, paying homage to the “N+1” gods. N+1 meaning, “I’ll buy an extra one just in case”. Keep an extra server or load balancer or application delivery controller mothballed for the unexpected incident when something breaks in AWS or a key piece of your infrastructure fails. So you pay the N+1 tax: emergency servers on reserve, back-up sites, data backups, reserve instances – but you hold your nose.
Then along comes the COVID-19 crisis – the mother of all DR planning scenarios. All of a sudden, the N+1 tax looks like a true bargain. Overnight the terrain of the Internet changes with massive shifts in traffic patterns hammering online games, shopping, and video streaming services. Most employees are working from home or remotely so every IT organization needs to boost VPN capacity, and plan for an extended reality of trustless networking and authentication.
N+1 and 2N Redundancy
In redundancy planning for Disaster Recovery, the term “N” represents the number of infrastructure components that you need to maintain in an active state. It can be a number of ADCs, servers, routers, switches, clusters, or entire data centers. In your application infrastructure, any components that lack backup or disaster duplication capability are represented simply as N. For the sake of argument, let’s say N represents four Application Delivery Controllers (ADCs).
An N+1 architecture means that for any number of infrastructure components (N), there is one additional component on standby providing passive redundant capability in the event that one of the components in N fails. So in our ADC configuration where N represents four ADCs, N+1 would mean there is one ADC providing passive redundant capability to four active ADCs. An N+1 strategy is sufficient for infrastructure where only one component in N is likely to fail at one time. However, if multiple simultaneous component failures are likely and downtime is unacceptable, then N+1 is insufficient.
A 2N architecture means a fully redundant system with two independent mirrored sets of components that are not connected in any way and do not share dependencies. So in our ADC configuration where N represents four ADCs, 2N would mean there is an additional four mirrored ADCs providing redundancy. In this configuration, if one component or even all components in N fail, the mirrored component(s) should be sufficient to accommodate a full application and capacity load. Companies can run and deploy 2N live in production as two separate systems (connected only at the control plane) or they can have one of the Ns on passive standby to which they can failover as an entire replacement system if the active N fails.
Previous IT disciplines implemented these DR strategies using ephemeral, cloud-based infrastructure or legacy methods that deployed backup tin boxes and rented fractional capacity ahead of actual demand.
With Cloud Native and DevOps, Disaster Recovery Is Different
Cloud Native and modern DevOps render much of this thinking obsolete. Modern DR strategies can be deployed in near real-time, and require detailed planning but much lower cost. There are three core principles:
- Avoid fixed assets and spending. Instead, focus on flexibility and rely on cloud-based assets.
- Distribute all critical systems and data globally on redundant providers to maintain resilience.
- Design for maximum observability of systems to make it easy to troubleshoot in a crisis and quickly respond by adding or shifting resources.
With smart Cloud Native infrastructure components, an organization can quickly spin up a fully redundant infrastructure or any critical infrastructure and application component, in real-time. When the organization has deployed its infrastructure and applications in containers, and is using a true control plane/data plane deployment pattern, it is relatively easy to create and manage a 2N or even 3N or 4N capability. This can be achieved by leveraging different public computing clouds or across hybrid instances of on-premises and cloud infrastructure.
We call this new disaster recovery mindset “Cloud-N”. Cloud-N makes it far easier to:
- Avoid fixed assets and spending by shifting to a cloud-based deployment strategy that can be dusted off and implemented in an instant.
- Distribute critical systems and data virtually across providers and geographic locations, boosting resilience and reducing critical dependencies and single points of failure.
- Observe all system behavior by making observability of disaster recovery elements a simple extension of production system observability capabilities with the same instrumentation, telemetry, notifications and dashboards.
How Cloud-N For Disaster Recovery Works With DevOps Methodology
If you follow DevOps, you probably realize that the above three characteristics actually map to DevOps methodologies. At the most basic level, DevOps is about making it possible to manage infrastructure and everything built upon infrastructure (almost) in the same way as managing software code. Continuous Integration / Continuous Deployment (CI/CD) systems in DevOps mirror code deployment pipelines in GitHub. In CI/CD pipelines, DevOps teams manage scripts or code-like constructs to generate infrastructure deployments and manage virtual (and often cloud-based) infrastructure.
The importance of CI/CD and DevOps for infrastructure deployment is growing in importance as Docker Containers become more widely used, bringing with them “manifests” that deploy not only code and applications but also required virtual compute and networking configurations. At the control plane of this new era we have Kubernetes as the now de facto standard for orchestrating and managing containers. By extension, Kubernetes becomes the default orchestration and management engine for Cloud Native applications and infrastructure.
Through this Cloud Native lens, the difference between Disaster Recovery and DevOps quickly blurs. IT teams have always managed Disaster Recovery – but it has largely operated as a separate “break glass” folder of instructions and procedures. With DevOps and Cloud-N, Disaster Recovery is merely an extension of standard operating procedures. Infrastructure such as ADCs, servers and networking equipment is simply redeployed in another geography, another cloud or in another fashion.
Hybrid Cloud-N + N+1 = Cover Your Disaster Recovery Bases
The reality is that few organizations are entirely Cloud Native. Similarly, few have zero cloud deployments. Today a hybrid existence is the norm.
To design for this hybrid reality, CTOs and IT leaders need to adopt a modular approach to Disaster Recovery. Where possible, a Cloud-N approach can and should be applied, incorporated as just another flavor of standard DevOps procedures. Where physical or hard-to-shift virtual infrastructure is deployed, a more traditional N+1 standby capability is required.
Over time, CTOs and CIOs should consider shifting to a Cloud-N Disaster Recovery strategy because it will afford them faster and easier recovery, at a lower cost, and with less interruption for users. This shift will likely mirror ongoing deployment patterns that are moving applications away from legacy architectures and onto virtual and cloud-based infrastructures.
Viewing Disaster Recovery as a Function of ADCs
A good example of this shift in both application deployment and Disaster Recovery to virtual, observable and flexible infrastructure is ADCs. If a Disaster Recovery plan requires a large physical ADC to backup one or more ADCs in deployment, an IT or networking team will have to budget six or seven figures for the initial purchase and then carry that cost on the books while only rarely using the asset. Should the organization need to further increase N+1 capacity as it grows, it might take weeks or months to install a second physical ADC. When the new system is running, setting up telemetry and observability can take additional days or weeks.
For monolithic and large ADC deployments in virtual environments – basic lifting and shifting a physical system into the cloud – getting that system up-and-running in the cloud could take hours. Doing so in multiple clouds is challenging while maintaining application delivery integrity in production (for the same reason that legacy ADC providers cannot provide multi-cloud deployment). Creating a flexible and distributed system is challenging. Observability with these large VMs is easier than with physical ADCs but remains challenging due to the difficulties in instrumenting monolithic systems that have APIs bolted on rather than baked in as design conventions like you see in Cloud Native.
When an application and infrastructure are designed using Cloud Native principles, then Cloud-N becomes easy. The system is distributed by design and linked by APIs for telemetry and communication. Standard DevOps tools (Jenkins, GitLab, Chef, Puppet, etc.) and observation tools (Prometheus, OpenTracing) can be used to deploy container-based ADCs that are already designed as lightweight data plane additions to extend application agility and delivery performance. Under this methodology, a company can quickly replace lost capacity by spinning-up new ADCs or moving application delivery from one public cloud to another, or from one data center to another within a cloud.
Introducing the Nova ADC
The rise of DevOps practices like microservices, containerization and multi-cloud deployments have brought significant scale, but they lack the abilities of robust Layer 7 ADCs. Traditional ADCs don't "fit" anymore – they retain business models based on hardware, static configuration and no horizontal scalability. The DevOps teams of the world have been left with dull, stock solutions for application delivery.
Enter Nova: a dynamic self-scaling and self-healing ADC designed using Cloud Native principles to fit the DevOps methodology, CI/CD workflows and the multi-cloud reality of modern application delivery.
Nova is capable of infinite scale, intelligent automation, easy integration via REST API, and centralized control and observability. Nova’s pricing model makes dynamic scalability cost-effective: pay per Node per hour, and scale-in and scale-out as you need.
Our customers – from large banks to giant online game companies – are moving towards Cloud Native ADCs like Nova to enjoy all these benefits both in production and in the unfortunate event that bad things happen.