Both developers and system admins regularly focus on strategies to create infrastructures that are reliable and minimize downtime. The primary reason for this, is that so many companies now rely on Internet-based services, making downtime financially damaging.
Users expect a stable and reliable service, so interruptions not only decrease customer satisfaction but increase support requests.
In this article I’m going to talk about three areas that are particularly sensitive when it comes to downtime, and offer some improvements that will push you towards 99.9999% uptime.
1. Monitoring and Alerts
Properly monitoring your infrastructure is the first step in being proactive with any issues, and it’s the most efficient way of discovering issues before they affect your customers.
This also includes aggregating and retaining a record of stats such as application performance metrics, and system resource utilization. Alerts then build on the metric collection through evaluating rules against current metrics. In other words it looks for anything weird.
A client is often implemented on each host that gathers metrics for monitoring, and then reports back to a central server. The metrics are stored in a database and are available for services like searching, alerting, and graphing.
There is monitoring software that can do this for you, including:
Graphite provides an API that has the support of dozens of applications and programming languages. Metrics are pushed, stored, and graphed in the central Graphite installation.
Prometheus can be used to pull data from a variety of community supported and official clients. It has an alerting system that is built-in, is highly scaleable, and comes with client libraries for several programming languages.
2. Software Deployment Improvement
Software deployment strategies are one area that many people overlook, but it has a huge impact on your downtime.
Having a deployment process that is very complex, or requires a number of manual steps to be completed will result in the production environment leaving the development environment behind. This contributes to risky software releases because every deploy is a much larger set of changes, and that naturally carries a much higher risk of problems arising. This in turn leads to numerous bugs, which slow down development and can potentially lead to the unavailability of resources.
To combat this, you need some up-front planning. If you already have this issue then set aside some time to smooth out the problems and start afresh, before you move on.
Finding a strategy that allows you to automate the workflow, code integration, deployment, and testing, will give you the best chance of syncing your production environment with your development environment.
A good place to start automating deployments is to make sure that you’re following best practices with regards to continuous integration and delivery (CI/CD) and testing the software. These best practices include:
Maintaining a Single Repository
Maintaining a single repository ensures that every person on the development team works on the same code, and can test their changes easily.
Automating Testing and Build Processes
Automating your development and testing is essential. This will simplify deployment in an environment similar to the final use-case, and is particularly helpful when debugging platform-specific issues.
3. Implementing High Availability
The last strategy to use when attempting to minimize downtime is to use the concept of high availability on the infrastructure. This includes principles used in designing resilient and redundant systems.
The system must be able to detect the health of the system. If the system fails, it needs to know precisely where it has failed.
The system must be able to redirect traffic. This is essential in minimizing downtime as it ensures that traffic between servers is quick, with minimal interruption.
Eliminate single points of failure. This means that several redundant servers are used. Moving to multiple web servers and a load balancer from a single server is one of the ways you can upgrade to a highly available infrastructure. The load balancer performs regular health checks on web servers and routes traffic from those servers that are failing. (It also enables a more seamless deployment of code.)
Increasing database resilience using database replication is another way you can add resilience and redundancy. Different database models have different configurations of replication. However group replication is the most interesting because it allows you to have both read and write operations on a redundant cluster of servers. Failing servers can be detected and routing done to avoid downtime.
In this article we’ve covered three areas where processes and infrastructure improvements will lead to less downtime. This will lead to happier clients, and of course ultimate more revenue.
Investigating the changes you can make to reduce downtime is one of the best investments you can make in software; start by improving deployment, monitoring your metrics, and ensuring a high infrastructure availability.