Our first blog of this series outlined the various challenges posed to DevOps and IT leaders resulting from the COVID-19 pandemic. In this blog we take a closer look at why businesses need to design applications and IT infrastructure for dynamic scalability and how this can be achieved.
Every developer and CIO understands they need to be able to quickly scale their cloud infrastructure to meet unexpected spikes in usage caused by unforeseen events. Most are also thinking about how to solve this problem. In fact, a key consideration of the Cloud Native design ethos of loosely coupled services on immutable infrastructure is to simplify scaling.
That said, the rapid changes in buyer behavior as a result of COVID-19 have demonstrated that scalability is now table stakes for running an online business or service. With the majority of the global population having been restricted to their homes, internet traffic spiked roughly 40% worldwide over the span of two short weeks. Increased e-commerce activity, increases in online gaming and streaming, and the shift to working from home all contributed to the spike. But for businesses, the spike meant a new capacity normal for their infrastructure requirements.
The shifts happened all at once and created cascading failures to deliver applications effectively across nearly all affected industries. But wait – wasn’t cloud computing and flexible infrastructure supposed to give companies better tools to weather exactly these types of extreme circumstances? The answer is yes, but doing so on a fast timetable remains exceptionally difficult.
Most cloud architects design applications with load balancing and vertical scaling baked in. But these structures are no longer sufficient. Load balancing breaks down in a multi-cloud world where an application might be split across multiple services in multiple clouds. Hybrid infrastructure likewise handles unforeseen spikes poorly because it remains dependent on physical hardware and failover to the cloud is rarely without a hiccup if the application demands change on more than a single dimension. Designing for cloud native can help more by using containers and decoupling software from hardware, but load balancing across clouds in cloud native remains challenging and load balancing in a single large public cloud presents a major business risk: the cloud might have problems or pricing could make operations prohibitive.
This is where a new generation of application delivery controllers come into play that are designed to be effortlessly multi-cloud, lightweight and rapidly scalable. To understand why, let’s look at some of the failure modes from the recent COVID-19 crisis.
The Problem of Unpredictable Scaling Events
CCInsight data shows that the U.S. is seeing immense growth in retail e-commerce revenue for the year. The graph below shows the rate of increase in online revenue growth from January to June 2020.
For an omni-channel retailer (as the majority are these days), these usage spikes require fast reallocation of resources to accommodate higher levels of traffic.
Supermarkets were among the top end of increasing website traffic and e-commerce transactions. By April 2020, global supermarket web traffic had seen a 189% increase when compared with earlier months. This resulted not only in slower application performance but also rising infrastructure costs for supermarkets as they struggled to meet demand.
The gaming industry, which is no stranger to massive spikes in usage, had a pick-up during the early stages of the pandemic. NetScout recorded a 295% increase in South American traffic and a 100% increase in U.S. traffic from February to April 2020. Many game aficionados noticed higher-than-usual load and refresh times on popular multi-player titles during peak hours.
% Increase in Gaming Traffic (a US ISP)
Video conferencing is a particularly bandwidth-intensive application due to the need to maintain multiple connections and constantly refresh. From December 2019 to May 2020, the conferencing platform Zoom experienced a massive increase in users from 10 million to 300 million – a 3,000% increase. While this kind of sudden increase in customers would seem like any business owner's dream, Zoom was not ready for this kind of expansion. Zoom users also experienced significant outages on Sunday, 17 May 2020, disrupting virtual church services and even the UK government's daily COVID-19 briefing.
Microsoft saw a similar spike in demand for Microsoft Teams, their competing product. According to Aternity’s Remote Workforce Report, Teams has recently overtaken Zoom’s growth rate: from February 17 to June 14, Teams usage grew 894% from its base usage, followed by Zoom at 677%.
How Cascading Problems Impacted Microsoft Teams
Teams is hosted on Microsoft’s Azure platform and is built on microservices architecture; however, Microsoft still ran into problems with the rapid increase in demand, stating that “As the surge began, it became clear that our previous forecasting models were quickly becoming obsolete...”. Teams experienced multiple outages when demand spiked rapidly between 16-20 March 2020 and again in June 2020.
Microsoft is among the world’s most advanced technology companies and its Azure cloud computing platform is one of the three largest public clouds. They also run the world’s largest SaaS product – Office365. But the COVID-19 crisis created a cascade of compounding events that combined to push the Teams infrastructure past its limits.
The COVID-19 crisis disrupted server production in China, slowing Microsoft’s hardware expansion plans for its cloud data centers; the rise of working-from-home drove huge growth in bandwidth-hogging VPN connections to Office365, further impacting the Azure cloud; demand for Microsoft’s own Xbox gaming services grew quickly during the pandemic; and hackers targeted a series of Distributed Denial of Service attacks at Azure and its customers, further taxing bandwidth and capacity.
The combination of all of these factors, all driven by an unforeseen event, created an impact far greater collectively than any one of these problems might have caused alone. The result was a mad scramble by Microsoft to increase capacity in its data centers and to boost the capacity of its networks and cloud computing capabilities by any means necessary. To conserve compute capacity and bandwidth, Microsoft turned off a number of nonessential features in Teams, such as showing whether a person is composing a reply to a chat. Naturally, Amazon turned Microsoft’s Azure and Teams challenges into a marketing plug for the Amazon cloud.
The Problem of Predictable Scaling Events
These challenges were caused by unplanned events, but even predictable high-traffic events can cause scalability issues. These kinds of shifts in buyer behavior have been magnified by COVID-19, but they have been occurring for years.
Video Game Releases
A great example of this is video game launches, typified by the outages resulting from the Diablo III launch of 2012. Gamers around the globe had been patiently anticipating the launch of a game that had been in the making for ten years. Thousands of launch events were underway worldwide and just as everyone was ready to go … (Error 37) The servers are busy at this time. Please try again later. Blizzard’s Josh Mosqueria responded to the incident by saying, “It really caught everybody by surprise. It’s kind of funny to say. You have such an anticipated game—how can it catch anybody by surprise? But I remember being in the meetings leading up to that, people saying, ‘Are we really ready for this? OK, let’s double the predictions, let’s triple the predictions.’ And even those ended up being super conservative.”
Gaming companies are no stranger to the challenges of traffic spikes, with new game releases taking up their fair share of internet users. The difficulty with preparing for these is the sheer volume of traffic which has been known to exceed the wildest of predictions, making these events virtually impossible to adequately plan for.
So clearly it doesn’t take a global pandemic to prompt rapid changes in buyer behavior and in the scale of demand. How can you design your business and architect systems to cope with the problem of scale?
How to Design for Scalability
While predictable and unpredictable changes in buyer behavior cause huge challenges for many businesses, those businesses that have prepared in advance by designing for scalability can convert these changes into massive opportunities.
One reason why businesses have not been able or willing to adequately prepare for these events is the cost involved. Providing sufficient capacity to meet peak demand (when peak demand might be 3,000% higher than normal) can be prohibitively expensive, and an inefficient allocation of resources when that peak demand might last for only ten days out of the year. Many traditional technology solutions, including traditional load balancers or ADCs, do not support flexible on-demand scaling and intelligent automation, and therefore require businesses to provision permanently for the moment of peak demand. An additional challenge is licensing: even if a load balancer or ADC technically supports flexible scale, its licensing structure might not, making scaling costly – this discourages good behavior in planning for scaling events.
As stated in our introductory blog post, the key lesson from these unpredictable or even predictable shifts is that systems need to be capable of handling rapid growth or rapid shrinkage. System architecture needs to account for scalability and flexibility to cope with these kinds of strains. Static servers and fixed software licenses no longer serve the needs of businesses dealing with unpredictable changes in demand and traffic.
Designed specifically for developers, Nova – Snapt's cloud-native ADC – incorporates the below key features which we believe are essential for surviving these kinds of changes in consumer behavior today and in the future:
- Real-time scalability with scale up or down in 3 minutes or less
- Flexible licenses to make it less costly to scale during outlier events
- Multi-cloud functionality and application portability
While these features are necessary capabilities for solving the problem of scalability, they only scratch the surface of the culture and business shifts that are required to enable rapid scaling for web scale applications. To start with, the enterprise must be willing to (and technically capable) of putting in place a multi-cloud infrastructure that has intelligent automation to deliver orchestration of new instances of any type to alleviate heavy application loads. The tip of the spear is the ADC. But the entire architecture - including CDNs, databases, and service meshes - all are enabled to scale as they approach peak capacity. This architecture requires embracing more flexible notions of infrastructure, where most computing, storage, and networking capacity is in the cloud but can be linked to physical infrastructure running containers - a hybrid multi-cloud architectural paradigm.
Under this type of modern cloud-native architecture, when demand spikes and customer experiences are at risk, the orchestrated system can autonomously take the decision to spin-up thousands of new instances of the stressed application (or stressed services within the application) in two or even three public clouds. The expansion can be directed to the data centers geographically closest to the heaviest demand to put the application as close as possible to end users and minimize round trips. This is a fundamentally different paradigm that is impossible with hardware-based ADCs which require weeks or months for installation. This paradigm even breaks down for most cloud-based ADCs; their spin up times are too slow to keep pace with rapidly spiking demand and their licensing structures are too restrictive to allow for rapid scaling up and down in any size of cloud instance.
Snapt designed Nova, our cloud native ADC, to address real-time scaling with self-scaling ADC deployments capable of massive scale and automation. Nova works just as well for a small app with a few microservices and a massive webscale app with millions of users and hundreds of compute nodes. You can download and run the community edition of Nova for free, or contact us to learn more about using Nova to tackle the largest enterprise scaling events.
Want to find out more about have to achieve real-time scaling? Download our white paper "Real-Time Scaling in Cloud Native Application Delivery is Hard. Here’s How To Do It."