AWS : IAM User

  • Pipeline User
  • IAM User

Each AWS Account has its own AWS Identity & Access Management (IAM) Service.

If you know Azure
On Microsoft Azure, we have a Subscription. The AWS Account can be equivalent to the Azure Subscription. With a difference. Each AWS Account can have its own IAM Users but in Azure, we have a central IAM Service, called Azure Active Directory (AAD).
Each above-called service is a huge topic but we don’t do a deep dive right now.

The AWS IAM User can be used

  • Only for CLI purposes. This user can’t log in to the AWS Portal.
  • Only for working with the AWS Portal. This user can’t be used for CLI.
  • Both purposes. This user can be used to log in to the AWS Portal and CLI.

Pipeline User

The first question is why do we need a Pipeline User?

  • Automated deployment (CI/CD) pipeline and prevent manual or per-click deployment.
  • We can only grant the pipeline user for some specific permissions and audit the logs of this user.

This user can work with AWS Services only via CLI. Therefore it has an Access Key ID and a Key Secret.

If you know Azure
It’s used like a Service Principal, that you have a client-id and client-secret.

IAM User

Description and video come soon

Multi-Cloud

Introduction

This document gives us the definition of different cloud classifications and focuses on the Multicloud and Hybrid cloud and the organization’s tendency to adapt to the cloud, especially for multi-cloud. This document even refers to the challenges of multi-cloud at the management and technical level and the reasons for them, and in the last part of the document some services are introduced that can help in multi-cloud solutions.

Cloud classifications

This document classifies the cloud in the following pillars. The focus of this document is multi-cloud.

Figure 1: Definitions of different types of clouds

In fact, most enterprise adopters of public cloud services use multiple providers. This is known as multi-cloud computing, a subset of the broader term hybrid-cloud computing (Gartner) [3]

Multi-Cloud e.g., when some resources are on Azure, some on AWS and some on GCP, or some VMs on AWS and using Office 365 of Microsoft, or when you connect several cloud provider deployments with each other via VPN, they are considered as multi-cloud.

Organizations’ tendency for cloud

Almost all organizations have data and workloads, they must be stored and hosted. The organizations have two possibilities either a private data center or using the cloud.

If the organizations decide on an on-premises data center, they have to pay upfront, which requires capital expenditure (CapEx) with much software and hardware maintenance.

But if they decide on the public cloud, they will have only operational expenditure (OpEx), because it’s the model of the public cloud to pay as you go. Therefore, most organizations decided to use the public cloud. Organizations always tend to reduce expenditures and increase income, consequently, they are attracted to (having cost-efficient infrastructure) they intend to adapt to the multi-cloud. But of course, it’s not just this reason. More reasons are explained in the next section.

Organizations’ tendency for multi-cloud

There are many tendencies to embrace a multi-cloud strategy, here some of them are listed.

The common reason is to reduce cloud computing overhead by designing a cost-efficient infrastructure by using cost-effective options from multiple cloud vendors.

AzureVM instancesContainer clustersHosted AppsServerless functions
AWSAWS EC2AWS EKSAWS Elastic BeanstalkAWS Lambda
GoogleGoogle Compute EngineGoogle Kubernetes EngineGoogle App EngineGoogle Cloud Functions

The second common reason is different services that are offered by different cloud vendors because some vendors offer specialized services. It might not be the most economically efficient service, but it fulfills the requirements of the workload better, and it’s not available on another vendor.

The third reason is, to improve the reliability and availability of cloud-based workloads because they spread across multiple clouds and disruptions to those workloads are less likely.

The fourth reason is when globally distributed enterprises / international companies acquire offices/subsidiaries in different countries, or they have to be merged with other companies, and they may have their resources in different clouds. Since a particular cloud provider doesn’t have a data center in a country.

The fifth reason is, that the organizations want to avoid cloud provider lock-in. If the could provider changes the price of the services used in your workload, the entire workload is impacted. The solution is to architect the applications cloud-agnostic that can be run on any cloud. It does not mean that it would be cheaper or more efficient to run on more clouds, because the workflow can be optimized if a specific cloud is used, but it is better to have the option to be able to move the workload.

By using a single-cloud strategy you can also develop workloads that are able to move to another cloud without difficulty, but it happens really fast to get deeply dependent on the cloud vendor’s tools and services and encounter the following risks:

Migration is difficult and costly

Budget risk when the vendor raises the service costs

And the solution is:

  • Using multi-cloud strategy
  • Using tools that are cloud-agnostic and can be used in any clouds

The result of using a multi-cloud strategy is:

  • Easier migration/swap of a particular workload to another cloud
  • Not lock-in on a cloud vendor
  • Freedom to choose the cost-efficient services cloud provider
  • Avoid mirroring expenses

The reasons above are impacting the organizations’ infrastructure more and bring more benefit for projects because of being able to have multi-cloud architecture.

Basically, multi-cloud architectures are more expensive to implement because of the complexity (several toolsets for cloud management or cloud service broker, and each cloud provider has its own way of doing things). However, money can be saved considering the ability to pick and choose cloud services from multiple cloud vendors. In this case, the services that are not only the best but the services that are most cost-efficient. This is going to provide us a strategic advantage.

Another mandatory point is, to figure out the business case to understand costs vs. values, the organizations need some sort of value advantages of doing so, how can this value come back into the organization.

 Single public cloudTwo public cloudsTwo public clouds & Private Cloud 
Initial costs500,000 $750,000 $1,000,000 $– In terms of getting things up and running – Getting things scaled up – In 3rd case is because of the software and hardware of the private part
Yearly costs100,000 $125,000 $300,000 $– For pay as you go -Maintain
Value of choice0$200,000 $250,000 $– Value of move information -How beneficial it can be
Value of agility500,000 $800,000 $900,000 $-Ability to change things as the needs of the business change (speed of need)

Value of choice: it is more business and asks about the impact of this decision on KPIs.

Value of agility: it is more technical and asks more about how I can react to business changes.

Therefore, we have to understand the business metrics to be able to understand the business value and then decide on the best solution for the project.

Always the business metrics / KPIs (Key performance indicators) have to be considered. The KPIs have an impact on the value of choice.

Sales revenueNet profit marginGross marginSales growth year-to-date
Cost of customer acquisitionCustomer loyalty and retentionNet promoter scoreQualified leads per month
Lead-to-client conversation rateMonthly website trafficMet and overdue milestonesEmployee happiness

To have a successful multi-cloud infrastructure and deployment, it’s important to have a configuration of services, which is both compliant with the organization’s regulations and cost-efficient. Unless the deployment in production would be a big challenge.

Multi-cloud challenges and considerations

When an organization decides to adopt multi-cloud and use multi-cloud strategies, they have to prepare for the following items and have a strategy for them:

  • Integration
    • How do share data between workloads running on multi-cloud?
  • Management
    • How to manage resources from an abstract layer without making your hands dirty with different cloud vendors’ command lines and tools?
    • How do monitor resources?
    • Which cloud service brokers can be used?
  • Optimization
    • How should be the service configuration to have a cost-efficient infrastructure?
  • Compliance
    • How do keep the service configuration compliant with the regulatory outlines of the organizations?
  • Technical
    • They are adding complexity to the architecture and adding more risk but how it can bring more value back into the organization?

How can we do each of them?

  • For integration
    • Managing all workloads from a central monitoring hub
      • Using third-party tools for management and monitoring like Using a universal control plane, which abstracts the workload from the underlying cloud, where the workload is hosted. Cross-plane and Kubernetes are the tools that can be used for multi-cloud architecture. The drawback of this approach is,
        • Workloads that cannot be containerized
        • Lack of knowledge and experience with Kubernetes
  • For Management/Monitoring
    • Universal Control Plane can be used
    • Third-Party solution
    • A custom solution can be developed (using clouds’ APIs) but this solution is less centralized. The API approach also demands more hands-on effort from IT personnel, both upfront and for maintenance.
    • Management console of each of the clouds (navigating between tools for different clouds).
  • For Optimization performance
  • Compliance
    • Unifying all workloads within a common security and access-control framework
Figure 2: Multi-cloud toolset basic architecture for custom tools or third-party

The important point is

Cloud vendors don’t make it easy to integrate a workload running on one cloud with another workload hosted on a competitor’s cloud.

Most cross-cloud compatible tools provided by cloud vendors focus on importing workloads from another cloud rather than offering support for ongoing integration between workloads running across multi-clouds.

And finally, we have to pay for the services and tools of the third party.

Multi-cloud governance and security

Security is not governance but has to be linked for multi-cloud. Governance is about putting limitations on the utilization of resources and services, in other words, governance is restrictions based on identity and policy. Security is about authenticating and authorizing the person and machine that use this resource, in other words, security is restrictions based on identity and access rights (Identity Access Management is an important requirement for multi-cloud). [4]

The hierarchy of security and governance is as follows.

For a successful multi-cloud infrastructure, it’s necessary to have a good governance and security outline.

Resources

Leveraged resources e.g., storage, compute, database, cloud server broker (CSB), etc. If they are still used or de-provisioned, how high is the charge, if they follow the usage rules e.g. only specific sizes of VMs are allowed to be used.

Services

Keep track of services e.g. data transfer services.

Cost

It’s about who’s using what and when, and how much they should be charged. This is about the policies for the utilization of resources and services. It must be done for a show back and chargeback (this is a part of the reimbursement process). It can be used for the health of the multi-cloud system. The other usage is putting limitations to manage the budget of the projects. It is one of the challenges that enterprises are encountering.

For doing governance a Cloud Management Platform (CMP) is needed. This provides a common interface to manage the resources and services across different clouds by providing a layer of abstraction to remove complexity.

CMP monitors the charge of provisioning, de-provisioning of resources, and usage rules of resources as well. The advantage is, because of the abstraction layer, it’s not necessary to be the expert on everything.

Multi cloud Requirements

As Many multi-cloud architectures are similar to hybrid cloud architectures and they have almost the same requirements and needs.

Figure 4: Multi and hybrid cloud expectations from a development perspective

Multi-cloud workloads categories

The common workloads that can use the multi-cloud strategy are as follows:

Deploying the same workload on two or more clouds simultaneously e.g., a business might store copies of the same data in both AWS S3 and azure storage. By spreading data across multiple clouds, that business would gain greater availability and reliability (without paying higher costs for mirroring data, because the mirroring is expensive e.g., multi-region is expensive in AWS)

Running multiple workloads at once, with some workloads running in one cloud and the others in another cloud (this approach provides cost efficiency and cloud agnosticism but doesn’t make individual workloads more reliable than using a single cloud)

It’s to keep multi-tier applications in the same cloud and region (then you can use the cloud provider’s backbone for internal traffics.)

The same applies to multi-cloud architecture for hybrid deployments

Whereas you can purchase dedicated bandwidth between on-prem and Azure (for example), can’t easily do the same between public cloud providers.

The workload might have regulatory requirements, which means that you might be in a Geo, that has a particular piece of legislation, and that particular piece of legislation might specify where data can go, or the security configuration might be a set of standards.

Multi cloud for workloads with complex regulatory requirements

  • Each cloud has diverse ways of assessing compliance with regulatory standards.
  • While cloud providers themselves are compliant with standards, the configuration for your organization’s workload may not be.

What should the technical lead know before starting with multi-cloud

The compute offering on the cloud is lying along a spectrum from IaaS (when you manage your servers, storage, networking, firewalls, and security on the cloud) to PaaS ( when you use platform-specific tools for scaling, versioning, and deployment). PaaS can help to go to production faster.

AzureVM instancesContainer clustersHosted AppsServerless functions
AWSAWS EC2AWS EKSAWS Elastic BeanstalkAWS Lambda
GoogleGoogle Compute EngineGoogle Kubernetes EngineGoogle App EngineGoogle Cloud Functions

In the first column, we have more low-level access to hardware, underlying operating system, and machine, with virtual machines we have abstraction over hardware. At the right end, you have hosted apps and serverless functions, that give you fewer ops and less administrative overhead and you don’t have to provision your own machine. We focus on the code and the platform takes care of the rest. However, this means you have less control and more platform lock-in.

As you see in the table above no matter which cloud provider you use, you have almost the same services.

If you want to have less administrative overhead and more platform support and don’t worry about provisioning, then you have to use platform-specific tools. On one side platform-specific tools offer convenience and on the other side lock you into a particular platform and the code that you write is not portable.

You can choose more control, then you have less platform support and you end up using open-source tools. This is a balance that you need to get to.

The balance between embracing platform capabilities and enduring vendor lock-in: search for your own sweet spot.

This sweet spot companies have found often involves the use of containers.

Containers offer the right trade-off between IaaS and PaaS offerings. Containers are just a unit of software, which basically package your application and all of its dependencies into an isolated unit. Containers are a key technology when you’re planning for a hybrid or multi-cloud.

A single container does not offer scalability, load balancing, fault tolerance, and all other bells and whistles that you need when you’re building at scale. What you need is a cluster of containers. Once you have a cluster, you need an orchestrator, that’s where Kubernetes comes in.

Kubernetes is an orchestration technology for containers and allows you to convert isolated containers running on different hardware into a cluster. Kubernetes embrace platform capabilities while maintaining the portability and flexibility of your code. The cool thing about Kubernetes is, no matter what cloud platform you’re on. All of them support Kubernetes.

A successful multi-Cloud solution/deployment

Elements of successful multi-cloud deployments would be as follows:

  1. A consistent set of tools to manage workloads across clouds (several tools for maintenance across multi-cloud might not be a good idea, for example, if we have to use something like PowerShell to manage each cloud, then we have to know the different command lines of Azure, AWS, and GCP, and this is cumbersome). A good solution is to have only one tool for managing all VMs and pay for this service. These expenses are for efficient maintenance.
  2. A consistent way of monitoring the security of workloads across clouds.
  3. Easy to manage and monitor costs for each cloud in the multi-cloud deployment.
  4. Ability to migrate workloads between clouds as necessary (to avoid the lock-in issue)

Multi cloud identity

Manage identity and access management for cloud [3] admins, app developers, and users. For cloud-based solutions, identity management and access management (IAM) must be always available.

References

[1] Why Organizations Choose a Multicloud Strategy, Goasduff, Laurence, Production date: 2019.05.07, Accessed date: 2020.06.23

[2] Multicloud strategies, Linchicum, David, Accessed date: 2020.07.03

[3] Public Cloud Inter-region Network Latency as Heat-maps, Agarwal, Sachin, Accessed date: 2020.07.09

[4] TECH INSIGHTS: THE IT TECH SHAPING TOMORROW, Christopher, Tozzi , Production date: 2019.10.21

[5] Microsoft’s New Azure Arc Services Can Run on ‘Any Infrastructure’, Sverdlik, Yevgeniy , Production Company: Datacenter Knowledge, Production date: 2019.11.04, Accessed date: 2020.06.06

[6] Governance guide for complex enterprises: Multicloud improvement, Production Company: Microsoft, Production date: 2019.9.17, Accessed date: 2020.6.21

[7] Multi-Cloud Governance: Agility, not Chaos in your Multi-Cloud, Production Company: Microsoft, Production date: 2019.1.21, Accessed date: 2020.7.15

[8] Making Sense of a Multi-Cloud API Approach, Anthony, Art, Production date: 2020.05.17, Accessed date: 2020.06.03

[9] Hybrid Cloud Infrastructure Foundations with Anthos, Production Company: Google, Production date: 2019.12.10, Accessed date: 2020.5.25

[10] 12 Business Metrics That Every Company Should Know, Karlson, Karola, Accessed date: 2020.07.03

[11] Multicloud identity and access management architecture, Production Company: IBM, Accessed date: 2020.07.03

Throttling Design Pattern

Knows as Rate Limiting. We place a throttle in front of the target service or process to control control the rate of the invocations or data flow into the target.

We can use the cloud services to apply this design pattern. This can be useful if we have an old system and we don’t want to change the code.

On each cloud vendor we have a service which does the throttling for us.

Approach

  • Reject too frequent requests
  • We have to break up logic into smaller steps (Pipes & Filter Design Pattern) and deploy it as higher/lower priority queues.

Note: It you have to handle long-running tasks, use queue, or batch.

Autoscaling & Throttling

They are used together and in combination. They affect the system architecture in great measure. Think about them in the early phase of the application design.

Security

The security in “Bring Your Enterprise on Cloud” topic is a very hug job. But it’s implementation is not impossible. This topic is based on the related links.

The conceptual check list for security is as follows

Enterprise Infrastructure Security

  1. Network security
  2. Data encryption
  3. Key and secret management
  4. Identity & Access Management
  5. Duty segregation
  6. Least Privileges
  7. Zero trust
  8. Defense in depth
  9. Platform policies
  10. Vulnerability check/management
  11. Compliance Monitoring

Enterprise Application Security

  1. Database
  2. Storage
  3. Container image registry
  4. Container service
  5. Kubernetes service
  6. Serverless functions
  7. App Service
  8. Queue services
  9. Event services
  10. Cache services
  11. Load balancers
  12. CDN services
  13. VMs
  14. VM Disks

Approach

These are the topics, which must be considered in “Bring Your Enterprise on Cloud” topic. In the following links I’ll provide an exact check list based on cloud provider.

To make the job easier it’s better to go through the conceptual check list in a layered way as demonstrated in the sample below. This can help to do the job Agile.

Layer 1: We explain how should be e.g. the network.

Layer 2: We explain how we can have e.g. a resilient network (we decide which platform service or a 3th party service or tool can to realize it)

Layer 3: We explain how we can have e.g. a high available network (we decide which platform service or a 3th party service or tool can to realize it)

Layer 4: We can add layers if we need more

Network

Resilient

High Available

Key/ Secret management

Resilient

High Available

Identity & Access Management

Resilient

High Available

Related links

Bring Your Enterprise on Cloud

We cannot generalize a migration way to the cloud for all the companies & enterprises. But I have provided a check list of topics which can help to have a good start without wasting the time with staring from scratch.

Enterprise Infrastructure

  1. On-Prem <-> Cloud
    1. Azure
      1. VPN
      2. Express Route
    2. AWS
  2. DNS
    1. Azure
      1. DNS private, public
    2. AWS
      1. Route 53 private, public
  3. Network
    1. Azure
      1. Vnet, Subnet, NSG, ASG, UDR
      2. Subnet Endpoint, Private Endpoint, Service Endpoint
    1. AWS
      1. VPC, Subnet, SecurityGroup, InternetGateway, NAT
      2. Subnet Endpoint, Service Endpoint
  4. Credential management
    1. Azure
      1. Key/Vault
      2. Manage or Dedicated HSM (FIPS 140-2 level 3)
    2. AWS
      1. Secret Management
      2. Certificate Management
      3. CloudHSM [AWS DOC] (FIPS 140-2 level 3)
      4. Key Management Service (KMS)
  5. Backup & Restore
  6. Logging & Monitoring
    1. Azure
      1. Application Insight
      2. Monitor
    2. AWS
      1. CloudWatch
  7. Access Control (who access to was)

Enterprise Application

  1. Storage
    1. Azure
      1. Storage
    2. AWS
      1. S3
  2. Serverless services
    1. Azure
      1. App Function
      2. Logic App
    2. AWS
      1. Lambda
  3. API/APP Gallery
    1. Azure
      1. API Management
    2. AWS
      1. API Gateway

Related links

Terraform : Cloud

Create organization and workspace in terraform cloud

  1. Sign up/in to this URL (https://app.terraform.io/signup/account)
  2. Skip all the questions
  3. Create an organization

4. Create a workspace (by clicking on create one now)

5. Select the type of the workspace (CLI-driven workflow)

6. Give a name to the workspace.

7. Create the workspace.

8. After creation the workspace the following page is appeared.

9. Set the terraform version in workspace > Setting > General and save settings.

10. Change execution mode to local (to run Terraform commands from the workstation with local variables.)

11. Pay attention: you see two settings on the page.

12. For changing the Plan & Billing go to the Organization setting.

We can use remote state to avoid saving the terraform state file locally and safe keeping the terraform state.


Configure remote state

Related links

Clouds : Shared responsibility model

In doesn’t make difference which cloud vendor you have chosen as the platform. All of them follow the shared responsibility model.

What does it mean?

It means the cloud provider has the security responsibility of the cloud and cloud customer has the security responsibility in the cloud.

AzureAWSGCPIBM
Shared responsibility modelShared responsibility modelShared responsibility modelShared responsibility model
[Source]

What is customer responsible for?

  • Configure the access to the resources e.g. servers
  • Responsible for operating system hardening of the servers
  • Ensure the disk volume has been encrypted
  • Determine the identity and access permissions of specific resources
  • ooo

Who should take care of security?

In companies where they up and run services/application on the cloud, the responsible teams have to have enough knowledge about the security on the cloud.

Developers
and Enterprise architect
Ensure cloud services they use are designed and deployed with security.
DevOps
and SRE Teams
Ensure security introduced into the infrastructure build pipeline and the environments remain secure post-production.
InfoSec TeamSecure systems

In which step of the project the security have to be applied?

AWS : Monitor, React, and Recover

Key concepts

  • Monitoring : is for understanding what is happening in your system.
  • Alerting : is CloudWatch component, is counterpart to monitoring, and it allows the platform to let us know when something is wrong.
  • Recovering : is for identifying the cause of the issue and rectifying it.
  • Automating
  • Alert:
  • Simple Notification System:
  • CloudTrail: with enabling CloudTrail on your AWS account, you ensure that you have the data necessary to look at the history of your AWS account and determine what happened and when.
  • Amazon Athena: which lets you filter through large amounts of data with ease.
  • SSL certificate: Cryptographic certificate for encrypting traffic between two computers.
  • Source of truth: When data is stored in multiple places or ways, the “source of truth” is the one that is used when there is a discrepancy between the multiple sources.
  • Chaos Engineering: Intentionally causing issues in order to validate that a system can respond appropriately to problems.

Monitoring concept

Without monitoring, you are blind to what is happening in your systems. Without having knowledgable folks alerted when things go wrong, you’re deaf to system failures. Creating systems that reach out to you and ask you for help when they need it, or better yet, let you know that they might need help soon, is critical to meeting your business goals and sleeping easier at night.

Once you have master monitoring and alerting, you can begin to think about how your systems can fix themselves. At least for routine problems, automation can be a fantastic tool for keeping your platform running seamlessly [Source].

Monitoring and responding are core to every vital system. When you architect a platform, you should always think about how you will know if something is wrong with that platform early on in the design process. There are many different kinds of monitoring that can be applied to many different facets of the system, and knowing which types to apply where it can be the difference between success and failure.

CloudWatch

  • CloudWatch is the primary AWS service for monitoring
  • it has different pieces that work together
  • CloudWatch metrices are the main repository of monitoring metrics e.g. what does the CPU utilization look like on your RDS database, or how man messages are currently in SQS (Simple Queue Service)
  • we can create custom metrics
  • CloudWatch Logs is a service for storing and viewing text-based logs e.g. Lambda, API Gateway,…
  • CloudWatch Synthetics are health checks for creating HTTP endpoints
  • CloudWatch Dashboard
  • CloudWatch Alarms

List of AWS services that push metrics into CloudWatch: https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/aws-services-cloudwatch-metrics.html

Refer to AWS : Serverless post to create a simple Lambda for testing CloudWach.

How to use CloudWatch

This is the overview > metrices

You see the list of namespaces. Lambda is one of them.

CloudWatch Alert [Source]

  • cloudwatch doesn’t alert you
  • cloudwatch alert inform you
  • Proper alerting will help you keep tabs on your systems and will help you meet your SLAs
  • Alerting in ways that bring attention to important issues will keep everyone informed and prevent your customers from being the ones to inform you of problems
  • CloudWatch Alarms integrates with CloudWatch Metrics
  • Any metric in CloudWatch can be used as the basis for an alarm
  • These alarms are sent to SNS topics, and from there, you have a whole variety of options for distributing information such as email, text message, Lambda invocation or third party integration.
  • Alerting when problems occur is critical, but alerting when problems are about to occur is far better.
  • Understanding the design and architecture of your platform is key to being able to set thresholds correctly
  • You want to set your thresholds so that your systems are quiet when the load is within their capacity, but to start speaking up when they head toward exceeding their capacity. You will need to determine how much advanced warning you will need to fix issues.
Always try to configure the alert in a way that you have a weekend to solve the problem if it’s utilization

Example: create a Lambda function and set up an alert on a Lambda functions invocation in CloudWatch Alarms to email you anytime that the Lamdba is run.

Solution has been recorded in video

Recovering From Failure by using CloudTrail 

The key to recovering from failure is identifying the root cause as well as how and who/what triggered the incident.

We can log

  • management events (first copy of management events is free of charge but extra copies arre each 2$ for 100,000 write management events [Source])
  • data events (pay $0.10 per 100,000 data events)

You will be able to refer to this CloudTrail log for a complete history of the actions taken in your AWS account. You can also query these logs with Amazon Athena, which lets you filter through large amounts of data with ease.

Automating recovery

Automating service recovery and creating “self-healing” systems can take you to the next level of system architecture. Some solutions are quite simple. Using autoscaling within AWS, you can handle single instance/server failures without missing a beat. These solutions will automatically replace a failed server or will create or delete servers based on the demand at any given point in time.

Beyond the simple tasks, many types of failure can be automatically recovered from, but this can involve significant work. Many failure events can generate notifications, either directly from the service, or via an alarm generated out of CloudWatch. These events can have a Lambda function attached to them, and from there, you can do anything you need to in order to recover the system. Do be cautious with this type of automation where you are, in essence, turning over some control of the platform – to the platform. Just like with a business application, there can be defects. However, as with any software, proper and thorough testing can help ensure a high-quality product.

Some aws services can autoscale to help with some automated recovery.

Chaos engineering

Chaos Engineering is the practice of intentionally breaking things in production. If your systems can handle these failures, why not allow or encourage these failures?

Set rational alerting levels for your system so that for foreseeable issues, you get alerted so that you can take care of issues before they become critical.

Edge cases [Source]

Many applications and services lend themselves to being monitored and maintained. When you run into an application that does not, it is no less important (it’s like more important) to monitor, alert and maintain these applications. You may find yourself needing to go to extremes in order to pull these systems into your monitoring framework, but if you do not, you are putting yourself at risk for letting faults go undetected. Ensuring coverage of all of the components of your platform, documenting and training staff to understand the platform and practicing what to do in the case of outages will help ensure the highest uptime for your company.



You owe your dreams your courage.

Koleka Putuma