Extend CI/CD with CR for Continuous App Resilience
This is a guest post written by Govind Rangasamy, CEO and Founder, Appranix.
The radical shift towards DevOps and the continuous everything movement have changed how organizations develop and deploy software. As the consolidation and standardization of continuous integration and continuous delivery (CI/CD) processes and tools occur in the enterprise, a standardized DevOps model helps organizations deliver faster software functionality at a large scale. However, newer cyber threats, evolving regulatory requirements, and the need to protect brand reputation are putting tremendous pressure on IT leaders to effectively protect their customer and business-critical data.
Conceptually, DevOps pipeline approach makes a lot of sense, however, in practice, Site Reliability Engineering (SRE) and Ops teams optimize systems for service reliability and robustness at the cost of delivering new features. The need for software reliability inherently decreases Continuous Delivery (CD) throughput. This conundrum is the biggest challenge for any organization adopting DevOps practices at a large scale today. By integrating and extending CI/CD with Continuous Resilience (CR) to provide protection against multitudes of software reliability disruptions, DevOps teams can confidently deploy new software and not affect resiliency of the systems. In other words, Continuous Resilience is the radical new enabler that gives confidence for SREs and cloud operations teams to increase the speed of DevOps.
The Need for App Resilience and Compliance with DevOps Pipelines
- Instantly Check, Reject or Rollback from Bad Deployments for App Resiliency: Software deployments are complex and erroneous. Even when DevOps teams deploy ever smaller incremental, frequent updates using CI/CD, changes to infrastructure, cloud configurations, and bad containers can disrupt the robustness of your software systems. This is particularly complex when more and more microservices are added to the already complex distributed systems. By integrating Continuous Resilience (CR) into your DevOps pipeline, you can instruct a system like Appranix to be your co-pilot to protect application environments with a continuous copy of cloud configurations, data and cloud services state without affecting your production environment. This allows recovery from disruptions at any point in time, giving a level of resiliency that was not possible before.
- Instant Rollback of Cloud Configurations and Services: Advanced CI/CD pipelines also deliver cloud infrastructure or Kubernetes infrastructure changes using infrastructure-as-code. There is always the good and tested code or bad infrastructure-as-code that gets pushed by the pipelines. SREs can’t always keep track of the changes to the infrastructure from 100s and 1000s of deployments. However, they can always recover bad cloud configurations that disrupt app resilience without a re-deployment. For instance, if security groups are deleted through a deployment, they can do granular recovery of security groups from Appranix. If load balancer configurations are messed up, Appranix can instantly recover the load balancer from its app environment time machine.
- Instant Test Environments with Real-world Data to Avoid Resiliency Problems in Production: Lack of testing with real-world data introduces an enormous risk to the reliability of the systems. What if you can get instant test environments with all other dependent services with the real-world data snapshots for automated testing along your DevOps pipeline? You can achieve this today with Appranix platform services integrated with CI/CD. Moreover, you can test your software in another region of the cloud or even across another cloud provider.
- Fast Recovery from Cyber Attacks: Ever-increasing cyber attacks like ransomware allow rogue groups to take over business-critical systems. Some of the recent notable ones are Maryland City Systems ransomware attack, and Atlanta City’s ransomware attack, where the ransom payments were lower than the actual recovery of the software systems. What if you had a copy of an entire application environment safely stored in another region of the cloud? What if the entire process is completely and transparently automated without affecting production systems? You can recover from any such attacks as quickly as possible without resorting to expensive, time-consuming efforts and most importantly, save your organization’s reputation with quickly restored services. Once you restore the last known application environment, you can deploy the latest code from your pipeline or helm chart to get the systems up and running while you figure out a permanent fix for your cyber attack.
- Meet SoC/SoX Resiliency Compliance Demands: Most of the modern SaaS applications follow DevOps processes to keep up with the changes required. Service level requirements for multi-tenant SaaS applications are much higher than traditional applications. If these SaaS applications aim to achieve SoC II Type II compliance, organizations need to prove they have reliable recovery capabilities in another region of the cloud. In other words, the need for SoC compliance automatically drives the need for high availability and resilience. If you integrated your CI/CD deployed software updates with Appranix, you can automate the resilience compliance easily.
- Protect against Cloud Provider Failures: Hyperscale cloud providers like AWS, Google Cloud and Microsoft Azure continuously work to improve their infrastructure and platform services. It is now possible to create more complex software systems with distributed architectures that allow easier updates and maintenance. However, even hyper-scale cloud environments get massive disruptions due to configuration changes or capacity issues. Recent cloud outage issues like Google Cloud Broke the Internet and Azure Cloud Capacity Issues in the UK, highlight the necessity to create resiliency at the application level with copies of application environments that will always be ready if and when you need them.
- Be Prepared for Natural Disasters: Ever-increasing natural disasters create cloud service disruptions. Recent events like Lightning strike disrupts Azure, highlight the need for second region protection. You can be well-prepared to recover or re-direct your application traffic to another region or another cloud provider with Appranix.
Create an App Environment Time Machine to Recover from App Disruptions
As you observed above, application disruptions are a normal part of the software development and operations process. The increasing complexity of software stacks and distributed architectures-enabled cloud platforms such as Kubernetes, demand better resiliency practices. Application resiliency should be considered as part of the software development process and should not be relegated to the end of the operations procedure.
When organizations integrate continuous resilience with CI/CD, overall application resilience increases dramatically. They can introduce an application environment time machine at the end of a CI/CD pipeline to take a continuous copy of cloud service configurations, application environment meta-data along with data snapshots to provide multiple levels of resiliency.
Integrate Appranix with the CI/CD Pipelines
It is very easy to integrate continuous application resilience into your CI/CD pipelines. After a one time discovery of your production environment in Appranix, you can integrate your DevOps pipeline project with a custom script for the deployment pipeline.
The following example uses CloudBees Codeship CI/CD SaaS integrated with Appranix platform to explain the process.
- If you don’t have one already, create a new project in CodeShip
- Login to your code repository and connect with your CodeShip project
- Select appropriate CodeShip project – Pro or Basic
- Configure deployment branches with a custom script to include Appranix code to start managing App Environment Time Machine
- If you want to rollback or recover from application disruptions or create an environment for testing, login to Appranix to select a timeline from the app environment time machine and hit a button to recover the application(s)
Achieve Multiple Levels of Resiliency
With an application environment time machine, organizations can achieve three levels of resiliency for instant creation, rollback, and recovery of app environments running on cloud platforms such as AWS, Google or Azure or VMware. Protection and recovery across a different provider are easier with container-based applications running on Kubernetes systems. Most of the organizations will be satisfied with Level 1 and 2 resiliency architectures. Level 3 is possible for applications with container-based applications with less data transport requirements.
By integrating application resilience as an extension of CI/CD, DevOps teams can drastically reduce risks to system robustness while deploying new software. They can address service reliability issues proactively as opposed to the legacy reactive operations model. If software reliability is continuously automated, organizations can achieve resilience that could decrease disruptions by almost 50-300%.