A 5 part blog series by Zensar Architecture Centre of excellence
No one wants a disaster to happen but establishing effective disaster recovery plan and mechanisms for your organization will go a long way in helping you stay in business in the event of a disaster.
In the first part we looked at some of the critical aspects that should be thought about to ensure business continuity in IT landscape. In the second part, we looked at how high availability architectures help maintain business continuity. Also as discussed natural extension to high availability architecture is the DR setup where the failover features will address restoring the service automatically by transferring to secondary i.e. DR site of the data center.
Let’s discuss about “Disaster recovery” in IT environments in this part and how much importance it has in maintaining business continuity. A lot has been written and published on the DR subject. I would like to summarize that in a simple graphic below which depicts the basic steps and phases involved in establishing effective disaster recovery mechanism.
Well known types of disasters are natural and man-made, but every business is different and hence the impact will be different. Man-made disasters can be avoided with better processes and planning however natural disasters need to be dealt with a DR plan to minimize the impact on business continuity.
Business impact assessment (BIA) and risk assessment plays a crucial part in devising a DR strategy.
The impact of the disaster will be on the IT services outage which will be on IT infrastructure (compute, storage and network), data (especially loss of data of the customer, transactions, revenue/billing and finance) and applications will need to be assessed and correlated with and/or quantify the business impact such as share price drops, customer satisfaction, revenue loss, legal and compliance penalties and fines etc.
Hence BIA helps with prioritizing the DR measures as well as helps in cost–benefit analysis of the same in order to commit a realistic budget on DR.
So done with the assessment, now time to plan and implement a DR strategy which will bring a whole load of questions such as what, how much, when and how to recover the loss.
The answers will be in a typical set of DR requirements which are captured as:
- Recovery point objective (RPO) – RPO is the maximum targeted period in which data and/or transactions might be lost from an IT service due to a major incident. It actually means data and/or transactions at a certain point in time before the incident will be recovered. The RPO gives systems designers a limit to work to.
- Recovery time objective (RTO) – RTO is the targeted duration of time and a service level within which a business process must be restored after a disaster (or disruption) in order to avoid unacceptable consequences associated with a break in business continuity. In practical IT services terms, it refers to how quickly you can switch from your production source machine to your target backup machine, in the event of a disastrous incident.3. Recovery time actual (RTA) – It is the time recorded that elapsed for the recovery of IT service to the agreed SLAs after the incident. It is often compared with RTO and is expected to meet RTO. Monitoring the difference between RTO and RTA helps refine the DR strategy and optimize the IT service accordingly to minimize the impact.
Please note, the word “Objective” in the DR requirements as those figures do not mandate but the expectation or intention of targeted duration within which full recovery to be achieved.
For example, if your company’s storage and backup policy prescribes nightly (12-hour) backups but the DR RPO states a 6-hour time period then nightly backups may not suffice for the full data/transactions recovery within the last 6 hours as the last backup taken will not have that data available.
Other most important requirement in DR is the data center site i.e. place where your IT infrastructure is housed. A primary site hosts the regular production or live systems and a secondary i.e. DR site hosts another set i.e. redundant set of equipment. In case of major incident the DR site takes control of operations and replaces primary service.
Although there are multiple regulation and standards like ISO 22301 or BS 25999-2 etc. usually a DR site is or should be located at least 30 miles/ 50KM apart or maximum 100 miles/160km apart. These are perceived to be practical enough isolation between two sites.
Now that we collected requirements and implemented the DR strategy, it is equally important to communicate the strategy to the right audience and stakeholders effectively. Train the users who may be affected in the incidents and who are expected to follow DR processes.
And last but not the least continuous monitoring of incidents, tracking of recoveries, maintaining history and continuously looking to improve/optimize the DR strategy will ensure the business continuity in the long run.
Zensar Architecture Centre of Excellence carries a wealth of experience in IT infrastructure optimization and DR implementation and offers consulting services in these areas. Happy to help if your organization is in such need.
Vijaykumar Dixit – Vijay has over two decades’ industry experience and over 15 years in IT. Vijay is a TOGAF9.1 certified practitioner and Oracle master certified JavaEE architect. He holds bachelor’s degree in Engineering and a PG diploma in advanced computing. He is currently part of Zensar’s Architecture Centre of Excellence.
His areas of expertise are enterprise architecture, solution design-consulting, SaaS product development and cloud computing. He has helped clients define & setup technology road-map, establish architecture governance and best practices and achieve increased return on IT spend by promoting Service oriented architectures, cloud migration and applications portfolio review and rationalization. He has been a key contributor in Java based enterprise solutions development and also has a sound background of DotNet (Microsoft), Open source based bespoke development and COTS based integration. He has worked for blue chip clients such as Boots PLC, TNT Logistics, Carphone Warehouse, Verizon & Government organizations such as ONS and NHS in the UK and Liberty, Discovery etc. in South Africa.