Best Practices for Disaster-Resilient Application Infrastructure

by Craig Risi on Tips and Tricks • January 18, 2022
Best Practices for Disaster-Resilient Application Infrastructure

Websites and applications should be designed to be functional, high-performance, and resilient to common and uncommon disasters. Despite your best plans, it’s inevitable that events will conspire to interrupt the smooth operation of your systems. For businesses that depend on software and web services, disasters can have a massive financial impact, making it critical for IT teams to do everything possible to avoid or mitigate them.

The way your applications respond to disaster incidents is a good indication of the overall quality of your architecture and code. Today, we will share some tips that should help you and your team build an application infrastructure that is resilient against even the worst disasters.

What Disasters Do You Need to Prepare For?

Before discussing some of the development and infrastructure best practices, we must understand the types of critical incidents that often bring systems down—even systems that seem to be robust. 

Software Failure

As much as we try to create error-free applications, the complexity of software, different platforms, operating systems, versions, and software updates that come out on a daily basis mean that a software failure is probably going to happen at some point. And while most defects that slip through are relatively minor in nature, it is possible for software defects to cause wholesale crashing when certain events happen, and how quickly an application and development can recover from these sorts of major issues is a reflection on how well the application is designed. 

Performance Failure

This is similar to software failure, except it occurs when excessive load or performance constraints on either an application or its infrastructure cause the application to fall over. Unlike functional failures, however, performance issues can be a little more complicated to rectify, though they are possibly easier to mitigate if you test for this early.  

Server Failure

As with any form of hardware, servers are susceptible to failure, whether it’s a loss of power or connectivity, or the failure of a physical component on the server itself—servers go down, and all software processes running on them inevitably die with them. And since most companies will have their applications split across multiple servers, it’s possible—albeit rare—that entire data centers can also go down when connectivity issues strike. Therefore, you want to ensure that your applications are resilient to failure in these circumstances. 

Security Failure

Sadly, we live in a world where there are many malicious players out there online who are trying to find loopholes in applications and their infrastructure to bring them down, or hold your intellectual property or critical data to ransom, or find some other way of committing fraud and profiting as a result.

This is a very high-level look at the types of disasters we need to prepare for as software engineers, though there are many more ways we can expand this list to include countless issues to prepare for. As exhaustive as that list might be, teams need to ensure that they do everything they can to produce applications that remain resilient in unpredictable environments. 

Resiliency Solutions

The goal is not to focus on problems but rather on solutions, so below are a few best practices that teams can follow to ensure that their application infrastructure is best positioned to survive disasters. 

1. Redundancy

In any software system, redundancy needs to be built in to ensure that if one server, data center, or application stops working as expected, it can still operate through an alternative available system. Redundancy is one of the key design principles behind cloud architecture, where applications can sit in multiple servers and regions so that they remain available when other things fall.

However, this applies not just to your overall host infrastructure but to every aspect of your system, including databases, load balancers, and the services and containers that make up your entire system. Many times, businesses introduce a form of redundancy through multi-server availability across different locations, but then have parts of their applications tied to only one location or unable to work effectively without systems being fully operational. As a result, the software still doesn’t work as expected despite this redundancy.

All services should be able to run and scale independently of others and switch between multiple copies of these services in different locations as needed. 

2. Avoid Single Points of Failure

Ensuring that you have redundancy in place for all of your hardware and applications is critical, but the design of your system needs to guarantee that you do not become too reliant on any given aspect of the system. While certain aspects of a system (such as security authorization) are certainly more critical than others, you don’t want to run into the problem where large parts of your system are no longer operational simply because one part of your application is down. 

3. Use Distributed Software Architectures and Containerization

Many of the above suggestions won’t work well if your application is large and requires complex configuration across multiple servers. Yes, you could have redundancy set up across multiple locations that can be switched across, but if there is a load or scale issue, it would be difficult to do so.

As such, companies need to work on breaking the different components of their systems into smaller independent parts that can preferably be containerized so that replication, duplication, and scaling can happen far more easily and speedily. 

4. Consider Multi-Tenancy

No matter how well you set up your on-premises servers or cloud environments for redundancy, there is still a chance that a data center or connection to one goes down and you are left with a system that, while possibly still being available through the redundancy measures, is perhaps underperforming for what you need.

Multitenancy, while perhaps more complicated in its configuration, can help a lot with this, as your infrastructure is split between multiple services, which should each be able to scale to full production needs should it ever be required.  

5. Test Early and Test Always

Most software outages are a result of faults in the software—bugs, performance issues, ineffective design, or other architectural flaws. All of these things, though, can be tested for and mitigated through effective test design and automation. The more you prioritize dealing with issues that can cause failure and test to ensure that your system can mitigate these (by, for example, simulating failures), the more comfortable you will be knowing that your system can handle these failures should they ever occur in production.  

6. Write Clean Code and Handle Your Exceptions

As previously mentioned, the majority of system failures occur as a result of either poor architecture hosting infrastructure or underlying software issues. A key way to ensure system stability is to follow best coding practices, ensuring that the code is kept clean. At its heart, the definition of clean code is code that is easy to understand and change. And this is important because if the code ever needs to be updated or fixed, your development team should be able to do this quickly and efficiently.

When code is too complex, it becomes difficult to troubleshoot and identify where issues are, which can add a lot of unnecessary time to your development and maintenance cycles.

Writing clean code will improve your ability to not only produce new features in the code but also fix it when things go down or new security or critical compliance features need to be added quickly. Even if the previous code did its job and was reliable, it is likely that once it’s updated or changed, it will be far easier to introduce errors as a result.

Along with clean code, your team’s development standards also need to make it clear how exceptions are handled so that when errors of any kind do occur in the code, they don’t bring down the system as a result. 

7. Release Small and Release Regularly

A common misconception of software development is that the more you release into production, the more you expose yourself to risk. The opposite is actually true. There is a lot more risk in releasing big updates fewer times a year than updating your services on a daily basis. In big releases, there is great room for things to go wrong, along with a lack of clarity in knowing the cause of particular issues, should they occur. When deploying small changes on a regular basis, you reduce the risk of something going wrong with any release; plus, it makes it easy to roll back or fix the troubled code, as you will know immediately where the change was.  

8. Make Code Easy to Roll Back and Consider Using Feature Flags

Along with the above, it’s important that all code that is deployed is clearly version controlled so that teams can roll back or remove particular services should they ever need to due to an introduced error.

However, releasing regularly can become difficult, as bigger project features will take time to develop, and so the team might want to keep their branch moving forward with their latest working changes while knowing that other dependencies may not be ready for them yet. This can be achieved using feature flags, which hide certain features from being available in production while further development happens and prevent the need for big releases, which you are trying to avoid.

9. Use ADCs to Manage Your Load and Scale

Every system works best when there is some form of gate or boundary to protect any areas that are exposed so that everything within can operate safely without concern. Application delivery controllers (ADCs), which combine load balancing with web application firewalling (WAF) and monitoring, are vital tools for ensuring that the endpoints of every system are both safe and highly available to your clients.

ADCs, such as Snapt Aria and Nova, provide incredible endpoint protection that prevents all manner of attacks, includes proactive security scanning, and can adjust the load to ensure your system scales correctly and maintains high performance. Built-in global server load balancing (GSLB) can redistribute traffic to other clouds or locations in the event of low performance, a loss of availability, or other high-impact failures.

Having these powerful tools in place gives you confidence in knowing that your system has the right guards in place to protect you from external threats and maintain continuity when disaster strikes. This allows you to focus on developing highly resilient applications that meet the needs of your customers.

 

With software becoming increasingly complex as we develop new tools and ways of working, the chance for things to go wrong only increases. And so, we need to build our software systems to mitigate these concerns and ensure that we can keep delivering value to our clients for as long as possible.