Building a scalable application that has high availability is not easy. Problems can crop up in unexpected ways that can cause your application to stop working and stop serving your customer’s needs.
No one can anticipate where problems will come from and no amount of testing will identify and correct all issues. Some issues end up being systemic problems that require the correlation of multiple systems in order for the problems to occur. Some are more basic, but are simply missed or not anticipated.
Links and More Information
The following are links mentioned in this episode, and links to related information:
- Modern Digital Applications Website (https://mdacast.com)
- Lee Atchison Articles and Presentations (https://leeatchison.com)
- Architecting for Scale, published by O’Reilly Media (https://architectingforscale.com)
Application availability is critical to all modern digital applications. But how do you avoid availability problems? You can do so by avoiding those traps that cause poor availability.
There are five main causes of poor availability that impact modern digital applications.
Poor Availability Cause Number 1
Often, the main driver of application failure is success. The more successful your company is, the more traffic your application will receive. The more traffic it receives, the more likely you will run out of some vital resource that your application requires.
Typically, resource exhaustion doesn’t happen all at once. Running low on a critical resource can cause your application to begin to slow down, backlogging requests. Backlogged requests generate more traffic, and ultimately a domino effect drives your application to fail.
But even if it doesn’t fail completely, it can slow down enough that your customers leave. Shopping carts are abandoned, purchases are left uncompleted. Potential customers go elsewhere to find what they are looking for.
Increasing the number of users using your system or increase the amount of data these consumers are using in your system, and your application may fall victim to resource exhaustion. Resource exhaustion can result in a slower and unresponsive application.
Poor Availability Cause Number 2
When traffic increases, sometimes assumptions you’ve made in your code on how your application can scale are proven to be incorrect. You need to make adjustments and optimizations on the fly in order to resolve or work around your assumptions in order to keep your system performant. You need to change your assumptions on what is critical and what is not.
The realization that you need to make these changes usually comes at an inopportune time. They come when your application is experiencing high traffic and the shortcomings start becoming exposed. This means you need a quick fix to keep things operating.
Quick fixes can be dangerous. You don’t have time to architect, design, prioritize, and schedule the work. You can’t think through to make sure this change is the right long term change You need to make changes now to keep your application afloat.
These changes, implemented quickly and at the last minute with little or no forethought or planning, are a common cause of problems. Untested and limited tested fixes, quickly thought through fixes, bad deployments caused my skipping important steps. All of these things can introduce defects into your production environment. The fact that you need to make changes to maintain availability, will itself threaten your availability.
Poor Availability Cause Number 3
When an application becomes popular, your business needs usually demand that your application expand and add additional features and capabilities. Success drives larger and more complex needs.
These increased needs make your application more complicated and requires more developers to manage all of the moving parts. Whether these additional developers are working on new features, updated features, bug fixes or other general maintenance, the more individuals that are working on the application, the more moving parts that exist, the greater the chance of a problem occurring that brings your application down.
The more your application is enhanced, the more likely there is for an availability problem to occur.
Poor Availability Cause Number 4
Highly successful applications usually aren’t islands unto themselves. Highly successful applications often interact with other applications, either applications that are part of your application suite, or third party applications. Third party applications can be provided by vendors or partners. They can be external SaaS services. Or, they can be integrations with customer systems. The more dependencies you have, the more exposed you are to problems introduced by those other external systems.
Your availability will ultimately become tied to the availability and quality of those external applications. The more dependencies you have, the more fragile your application becomes.
Poor Availability Cause Number 5
As your application grows in complexity, the amount of technical debt your application has naturally increases. Technical debt is the accumulation of desired software changes and pending bug fixes that typically build up over time as an application grows and matures. Technical debt, as it builds up, increases the likelihood of a problem occurring.
The more technical debt you have, the greater the likelihood of an availability problem.
All fast-growing applications have one or more of these problems. These problems are the sort of problems that increase the risk of having a problem with availability. Potential availability problems can begin occurring in applications that previously performed flawlessly. The problems can quietly creep up on you, or the problems may start suddenly without warning.
But most applications, growing or not, will eventually have availability problems.
Availability problems cost you money, they cost your customer’s money, and they cost you your customer’s trust and loyalty. Your company cannot survive for long if you constantly have availability problems.
Focusing on these five causes will go a long ways to improving the availability of your applications and systems.
Tech Tapas — Database backup test failure
I want to tell you a story. You tell me if this is ok or not.
This was from a conversation I had heard in a company I was working with.
The conversation was a message from one engineer to their peers, They were trying to update them on the situation of a production database. The message went like this:
“We were wondering how changing a setting on our MySQL database might impact our performance…”
“…but we were worried that the change might cause our production database to fail.”
“Since we didn’t want to bring down production, we decided to make the change to the replica database instead…the backup database…”
“After all, it wasn’t being used for anything at the moment.”
Of course, you can imagine what happened next, and you would be right.
The production database had a hardware failure, and the system automatically tried to switch over to use the replica database.
But the replica database was in an inconsistent state due to the experimentation that was going on with it. As such, the replica database was not able to take on the job as the new master…it quickly became overwhelmed…and then it failed as well.
Both the original master, and the replica failed. The replica, who’s sole purpose for existence was to take over in case the master failed, wasn’t able to do so because it was being tinkered on by other engineers.
Those other engineers didn’t understand that, just because the replica wasn’t actively servicing production traffic, that doesn’t mean it wasn’t being used. It’s entire job was to sit in wait to take over if necessary. By experimenting on that replica database, they were inadvertently impacting production. They were introducing risk into the production system — risk that wasn’t appropriate. Risk that could — and in this case did — cause serious problems.
This, by the way, was a true story. But it also is not an uncommon story. I hear similar sorts of problems occur in many engineering conversations, and many operations management conversations. Not having a clear understanding or appreciation for how certain actions impact the risk management plans for a production system can be disastrous. This is why active and continuous risk management planning is critical for production networks to stay operational.