Expert opinions, TECHNOLOGY

Fault tolerance A to Z: What you need to know before migrating to the cloud

Sanctions and the loss of Western service providers are far from the only threat to companies’ cloud infrastructure and data. Clouds are targeted by hacker attacks as often as the business’s own IT ecosystem. Other risks include tech problems on the provider’s side or in data networks. How can a company protect its cloud services? What are the current information security trends in the cloud services market?

Image by freepik

Shared responsibility model

The cloud service provider (or cloud app vendor) is always responsible for making their solution fault tolerant. However, this does not relieve the customer of the obligation to do their share of work. This concept is known as shared responsibility, where both the service provider and the customer are simultaneously responsible for the system’s fault tolerance, within the limits that are determined by the specific cloud computing service.

For example, an IaaS provider is responsible for securing their data centers and underlying infrastructure. The virtual machine operating system level is the responsibility of the cloud customer, who is responsible for securing their data and applications, and managing the operating system security settings, access control, and user rights.

The list of customer responsibilities starts here, and includes everything up to setting up and updating apps and services, managing access rights and authenticating user credentials, doing regular backups, and developing contingency plans in case of incidents.

Cloud risks

The risks of using cloud services vary depending on how they are employed.

For instance, if non-critical systems like a file server with archived documents are hosted in the cloud, their unavailability for several days is unlikely to cause major issues. However, if a sales website becomes non-operational, it could lead to substantial financial losses, making this a critical risk.

To assess the risks associated with cloud services, it’s essential to evaluate the requirements for each service using the RTO/RPO formula. RPO (Recovery Point Objective) represents the acceptable amount of data loss, while RTO (Recovery Time Objective) defines the acceptable duration of service downtime. These metrics help determine acceptable levels of risk and guide business continuity planning.

This creates a classic “decision triangle”: there is no definitive answer to the risks involved when cloud connectivity is lost. The level of risk depends on the company’s budget and how much it can invest to meet the desired RTO/RPO targets. The requirements for RTO/RPO can be adjusted up or down by balancing these factors accordingly.

Key aspects of cloud resiliency

The benefits of cloud infrastructure are well-established. Cloud solutions eliminate the need for enterprises to invest in and maintain their own hardware infrastructure, reducing capital expenditures. Additionally, cloud services can be scaled up or down as needed, providing flexibility. In contrast, relying on in-house infrastructure requires additional capital investments to achieve the same scalability.

However, moving to cloud solutions doesn’t eliminate the need to ensure system fault tolerance and secure data storage. Opting for cloud infrastructure requires the implementation of specific measures due to the differences between cloud and local environments.

The importance of data backups is well understood today, and while backups can be stored in the cloud, it’s advisable to store them with a different provider to enhance security. Using multi-cloud solutions, where data copies are hosted by multiple providers, is an even better strategy. This ensures access to backups even if both the primary and backup clouds become unavailable.

Using backup tools and regularly updating backup copies doesn’t eliminate the need to maintain robust information security. For cloud systems, this means continuous monitoring, often utilizing specialized software tools, including those provided by cloud service providers. Another critical security measure is implementing a well-structured access rights management system and enforcing strict protection for user accounts, particularly those with access to cloud resources.

In addition, the company must create and adopt an action plan for handling system failures. This plan should outline the steps and procedures for addressing system unavailability, with clear deadlines for implementing measures based on the service levels specified in the provider’s SLA. The plan should not only be updated regularly but also tested through periodic drills and exercises to evaluate and strengthen the system’s fault tolerance.

Backup basics

Today, backup is an essential procedure for protecting data. However, it is important to know the difference between various backup methods. When picking a certain one, you should consider business requirements for RTO/RPO.

Backup is typically conducted on a daily basis, allowing all data to be saved with the constant addition of incremental, that is, new data. This simple and cost-effective backup method reliably protects data against ransomware. Additionally, backup procedures can be easily organized by the provider without accessing the data. Yet, it should be considered that the recovery will take a while, and the backups need to be constantly tested to ensure they’re reliable.

In order to choose a minimal backup RPO of 4 hours, a replication method is used, allowing you to almost immediately restart the system or revert to its previous version in case the update was unsuccessful. This service is organized at the provider level as well. However, a short RPO means there are few data recovery points, which reduces the level of protection against ransomware, while errors or malicious files may spread throughout the entire backup.

The most demanding clients could benefit from the so-called geo-redundancy through application. It involves replicating data to create two separate data copies stored in different systems. This backup method makes a risk of data loss basically non-existent: in case one copy fails, the second one remains operational. But it is costly as it involves the duplication of IT infrastructure.

Testing capacities 

Cloud testing deserves a separate mention as this task is not obvious to many clients and requires certain technological efforts and competencies.

First of all, cloud solutions require periodic stress tests of the system by creating peak loads on it to ensure how confidently it handles increased traffic. Additionally, the stress testing analyzes the behavior of applications and infrastructure.

Another type of cloud testing is failure simulation, known as chaos engineering. This method involves artificially disabling part of the system or services to identify potential failures. The outcome of the failure simulation helps understand the system’s resilience to unexpected issues.

Disaster recovery plans should also be tested, which is done through recovery exercises. Also, the program should include plans for the use of DRaaS solutions to test their functionality.

Along with testing recovery plans, a security audit should be conducted on a regular basis to include checking for vulnerabilities and compliance with security standards. The audit should also check the relevance of software versions.

Finally, backup data copies also need to be checked on a regular basis; they should be periodically used to restore data even with no incidents occurring.

———————————————

As we can see, no universal solution exists. Yet, one could choose a combination of two backup approaches, such as copying and replication, and seek it based on business requirements in accordance with the RTO/RPO indicators. Additionally, you should remember that any backup option requires data to be encrypted, made accessible to a limited number of users, and stored in different locations by different providers.

Fault tolerance of IT infrastructure and systems is a difficult and complicated task, and transitioning to cloud services is among the most effective ways to ensure it. Subscribing to a cloud service does not mean ignoring security risks. While the methods to address system and data security are slightly evolving, the essential approach remains unchanged: the company must consistently prioritize resilience of its digital assets.

By Dmitry Borodachev, Executive Director, DataRu Oblako

Previous ArticleNext Article