Document toolboxDocument toolbox

(no) Disaster Recovery

Translation needed

The content of this page was copied from another page tree and needs to be translated or updated.

When you finish translation, make sure to

  • Replace the label NEEDS-TRANSLATING with TRANSLATED

  • Remove this macro from the page



General Information

We strive to provide best service possible and avoid anything that could be called “a disaster”. Nevertheless, being realistic, there is no such thing as fully disaster proof design.

Therefore, we aim to provide a solution that can withstand a number of imaginable disaster scenarios, or simply different critical situations or failures, together with procedures that allow us to gracefully recover from such incidents.

All key service metrics are monitored against problems. Such metrics include: server load, storage capacity, service availability etc.

At the same time hardware metrics are also monitored, that include temperature, network state etc. Having those metrics at hand allows us to understand the health status of the environment and take appropriate actions when needed.



Disaster Recovery Site

In many cases, primary data center resiliency to failure is not enough. (Read more about data center resiliency in (no) Data Center Information section).

A solution called Twin Site is used in order to provide a redundant datacenter design. Each customer is using single datacenter as a primary location. In case of a major failure, a second datacenter is used as that customer disaster recovery site. 

The DR Site is a solution built using the same architecture as the primary site. It contains customer application servers, configuration, database clone and a replica of file storage. 

It is possible for us to redirect the traffic from a primary data center to DR site with none to minimal data loss. The possible data loss is caused by an asynchronous nature of a DR site. 

Due to performance reasons we do not maintain a tight synchronization between sites, so there might be some data that was not replicated when the fallback was made. 

In case the primary data center is not operational, depending on many conditions, a decision to fall back is made per incident basis by the support engineers.

Backups

For more information about backup schedule and retention, please refer to (no) Data Backups



Failure and Disaster Recovery

This document lists an overview of possible failures and measures taken to resolve them. You will also find a potential service impact classification.

Incident

Description

Recovery Procedure

Potential Service Impact

Potential Data Loss

Worst Impact level

Hard Drive Failure within acceptable limits

 

Other hardware failure within acceptable resiliency limits

Storage system is designed to withstand loss of one or more (a certain number) of drives. This incident involves failure that is within accepted number of simultaneously failed storage components (HDD, SSD, Controller etc)

 

All other hardware elements like servers, NIC, PSU, firewalls is redundant (N+1) as well.

Failed drives or hardware needs to be replaced by a service personnel as soon as possible.

No impact on service.

Service state is considered degraded or critical but functional until drives are replaced and rebuilding process is finished.

In some cases a storage system will be able to heal itself, by performing re-balancing of stored data, and return to fully safe state. The corrupted parts can be replaced at earliest convenience.

No data loss.

Customer data is not affected by failure of storage system or other hardware within acceptable limits.

Storage nodes in the main data-center are synchronized in real time. As long as at least one node survives, the data is fully safe.

None

Multiple Hard Drive or storage component failure outside allowed resiliency limits

In an unlikely event of multiple drives failure the selected data center might become inoperable.

User data is stored on at least two independent nodes.

Data-center breakdown must be analyzed. If data-center is operational a provider equipment must be replaced and most current data must be restored from offsite backups.

When main data-center is back to operational state, customer traffic can be redirected from backup to primary data center.

Service must fall back to backup / secondary data-center.

Possible marginal loss of data.

Secondary data-center is not synchronized in real time. Some operations performed minutes before catastrophic failure might not have been replicated to backup location.

Maximum allowed data loss time-frame is considered to be 24h. In practice data is synchronized throughout the day.

Low

Server Failure (processing node)

Service architecture allows for single node failure.  Each customer service resides on at least two processing nodes.

Node / server must be replaced by new equipment.

Slight reduction in service performance might be observed in some circumstances.

No data loss. Customer data is not affected by failure of single nodes / servers. The secondary processing node can access the fully synchronized storage system.

Background tasks like imports might be interrupted and should be restarted.

Low

Network failure

Network connections are fully redundant.

Failed network equipment is either replaced by data-center provider or by service provider.

In case of catastrophic failure, traffic must be redirected to secondary backup (DR Site).

No data loss. If a failure cannot be fixed within reasonable time frame, and if traffic is redirected to backup site a decision must be made about master data (Wait in read only mode or use backup as master)

Low

Power Failure

Two independent power supply lines are used

(A+B Power)

Data-center is equipped with power rectifying facilities, battery backup and diesel generator for long term power outage.

Power is provided by data-center provider.

There should be no downtime due to power loss. In case of catastrophic failure in power systems, each storage system is equipped with its own battery backed up cache that allows of recovery of all data when the power is restored.

No data loss.

Low

Human Error

 

Malicious user

 

Security Breach

Human error or malicious users is one of the hardest incidents to protect against.

Human errors include unintended deleting of data or data modification.

Data is stored both in master data-center and offsite backup location.

Offsite backup location includes 90 days rollback capability based on data snapshots.

This protects against synchronizing deleted data in all data centers.

Backup data must be restored in place of damaged or deleted data.

Possible data loss.

Data loss should be maintained within the last synchronization window, not greater than 24h.

If a rollback to specific date is ordered, changes after that date will be lost, or can be made available as a recovery service.

Medium

 

Main Data-center Destruction

 

or

 

Backup Data-center Destruction

This is a broad but unlikely case that includes: explosion, terrorist attack, fire, flood or other catastrophic event.

Customer traffic is directed to backup data center. Assessment of main data-center usability is performed.

A new data-center is selected or existing data-center is being rebuild.

Customer application runs on backup data-center.

Possible marginal loss of data.

Secondary data-center is not synchronized in real time. Some operations performed minutes before catastrophic failure might not have been replicated to backup location.

 

In case of backup data-center destruction, master data is not affected. Backup restore points can be lost until new restore points are accumulated.

Medium

Simultaneous multiple sites destruction

This scenario is theoretically possible, but would require physical destruction of two independent data center locations - main data-center and a disaster recovery data-center located almost 1000km apart.

A raw data encrypted backup is performed to third location.

Rebuilding of infrastructure and derived data is needed before the service can be restored.

(Raw data - it's a core database and files backup)

A service downtime is expected until a new data center is established.

Possible data loss. Data loss should be maintained within the last synchronization window, not greater than 24h.

High

 

 

 

 

Â