Change as part of the platform’s DNA

by Philippe Tiede on Sep 6, 2022

Change as part of the platform’s DNA image thumbnail

Survival of the fittest

The market demands more and more speed in business demands and innovation. This inevitably leads to more complexity and dynamics in the operation of IT platforms. The traditional view of 24/7 availability is no longer geared to agile environments. As a result, providing reliable 24/7 is increasingly becoming a challenge for platform teams. 24/7 availability must therefore be rethought. Instead of aversion to possible incidents, you need to embrace disruptions and handle them as part of your DNA instead of avoiding them at all costs and thereby reducing the speed of innovation.

How is the market currently changing, using e-commerce as an example?

In the area of e-commerce, direct-to-consumer (D2C) will continue to be a trend in 2022. The customer-centric D2C business enables new measures to better accompany the customer's purchase decision processes. Retailers can increase customer value and differentiate themselves from the competition through a better customer experience. To do so, however, you need to react faster to market trends and take full advantage of the freedom for product innovation. 

On the other side, more than ever, unavailable services mean real revenue loss for companies. For every hour of downtime, adidas, for example, loses an average of €450,000 in revenue. On Black Friday or at hype sale events, the loss is significantly higher. Therefore, permanent and trouble-free availability of the underlying infrastructure (platform) is critical for the success of companies.

Faster time-to-market and more frequent product innovations lead to more dynamics and complexity in the operation of IT platforms and affect the 24/7 availability of services. Platform teams must ensure not only continuous and disruption-free availability of their platform, but also enable the speed of innovation for the business. 

Today, platform teams must handle speed and stability at the same time to enable business.

How can platform teams obtain speed and stability?

Over the past few years, Kubernetes has been gaining traction in platform operations. Through its architecture, Kubernetes provides exactly what teams need to support the business: speed, flexibility, and most importantly, stability. However, Kubernetes is not yet sufficient by default to run applications productively. It needs to be configured, managed, extended, and refined constantly. Therefore, teams have to pay attention to the right tool set as well as set up the right processes. 

Done right, you have the opportunity to constantly improve the resilience, performance, and blast radius of your Kubernetes platform: as part of your preventive maintenance, during an incident, and in the incident aftermath.

All phases must be interconnected so that the experience and knowledge gained are continuously incorporated into the further development of the operational capabilities. 

Preventive Maintenance:

  • Resilient Infrastructure: The cluster services for observability, security, developer experience, and connectivity must be carefully selected, revised, updated, and managed to ensure the smooth interaction of the platform.

  • Infrastructure Testing: The infrastructure must be tested continuously and automatically during upgrades, new releases, or the introduction of new tools. 

  • Automation: Programmatically, manual activities must be reduced. A high degree of automation of the infrastructure and the operating processes reduces the susceptibility to incidents and allows you to respond automatically. 

  • Observability: Comprehensive monitoring must be set up to detect incidents at an early stage. This includes the application and the infrastructure. Thresholds based on empirical values can provide additional support in identifying the risk of potential incidents at an early stage.

  • Processes: Infrastructure operations are only as good as its processes. Operating processes must be trained and automated as far as possible in order to be able to act as quickly as possible, especially in the event of an incident. Processes such as build and version management, release and deployment management, and patch management of the infrastructure must be automated. Also, exercises such as chaos engineering help to identify potential flaws. 

The event of an incident:

  • Alerting: Alerting must be used in a targeted manner so that the appropriate developer or engineer can take care of an incident within minutes. This includes specific alerts to identify issues more precisely but also the right amount of alerts to avoid alert spamming. Alerts might be under review continuously to improve the alerting system. 

  • Swarming: Ticket systems or escalation levels should be avoided as much as possible. If root causes while firefighting cannot be precisely localized or corrected in a timely manner, representatives of different teams must come together as a swarm to find a solution together. 

  • Access: Engineers need full access to the platform and infrastructure for troubleshooting. Whether it is access to the corresponding log files for analysis or the necessary rights to be able to restart faulty clusters. This helps developers to fix issues at the root and accelerates issue handling.  

  • Hotfixes: Ops-recipes and runbooks support the first troubleshooting of common issues. Swarming additionally helps to find possible workarounds and hotfixes with the goal of making the service available again as soon as possible. 

Incident Aftermath:

  • Post-Mortems: Learning must be derived from every incident, and the learning must be fed back into operational capabilities. Regardless of whether this is process-related, technological, or organizational. 

  • Programmatic Changes: Programmatically, changes must be derived from issues and incidents in order to improve alerting, eliminate root causes, further develop the team's knowledge, and expand the platform or automation. In this way, error sources can be reduced in the long term.

  • Iterating: Each failure must be seen as a new iteration of the platform, and so fixes must also be included as a new iteration. 

Platform teams need to work on their operational capabilities and continuously evolve the platform based on lessons learned, especially during incidents. Especially when upgrades, deployments, and disruptions cause more pain, you should put yourself in these situations more and more often to get better and better as a team. 

Distributive changes that exhaust the platform's resilience will become increasingly common. Volatility will be part of operating a platform. Incidents will be part of operating a platform. Platform teams have to integrate this into their mindset. Fail fast, fail often. Automate everything. Cattle vs. Cows. Discover incidents early. Fix them as quickly as possible. Build your platform resilient enough that incidents have little impact. 

Adapting to disruptive change has to become part of the platform team’s responsibility. Change has to be part of their DNA.

Why do platform teams struggle?

In the past, the operation of IT platforms was technologically, procedurally, and organizationally geared to avoiding changes and thus, potential incidents in operation. The motto here is; never change a running system, which also manifests itself in the established thought patterns and processes. Be it change advisory boards, release planning cycles, typical escalation instances, or outsourcing to third-party providers as insurance in an emergency. Every change in the system was plannable and predictable. Incidents could thus be avoided, and 24/7 availability was easily guaranteed.

However, faster time-to-market and an increased number of product innovations require new ways of thinking. Despite high investments in stable support structures, platform teams are increasingly experiencing incidents and outages of their services. Trying to maintain 24/7 availability, platform teams are quickly caught in a vicious cycle:

  • Technical Debt: Rapid implementation of services comes at the cost of technical debt. As a result, many services do not actually have the productive maturity to ensure stable operation. 

  • Islands of Automation: The rapid development in the cloud native area leads to the fact that more and more islands of automation are created in the company, leading to a proliferation of tools. Platform teams do not have the resources or capacity to acquire the necessary know-how for the multitude of tools. 

  • Unplanned Tasks: Constant firefighting permanently delays the implementation of development projects. A necessary further development of the platform and the know-how is thus impeded. Technical debt grows, and knowledge build-up is neglected.

  • Understaffed Teams: Platform teams are chronically understaffed for a multitude of topics. Necessary resources can only be obtained from the market in a time-consuming and expensive manner. This reinforces the points mentioned above, and the vicious cycle starts all over again. 

The consequence is that platform teams are either too slow in deploying changes or deploying changes leads to an unstable environment. In the long term, the business loses confidence in the stability of the platform, and the platform teams are perceived as blockers for innovation. Platform teams do not feel prepared for upcoming business demands anymore and struggle to provide high availability and agility. 

Platform teams face the challenge of not having enough capacity available to align their platform and also their processes accordingly due to the amount of firefighting and increased demands from the business. 

Platform teams are caught in a never-ending hamster wheel. 

How can Giant Swarm support your platform team? 

Disclaimer: This is the part where we describe how Giant Swarm helps our customers. You can skip it if you don’t want to know what Giant Swarm does, but you might get one or other idea out of it, how we support our customers and, most importantly, what drives us.

Our goal is the stable and flexible operation of productive Kubernetes infrastructure to best drive the business and build trust in the platform. We follow a holistic approach based on DevOps principles and focus on a high level of automation of infrastructure operations and, above all, close collaboration with our customers. We take over the 24/7 operation of several hundred Kubernetes clusters for our customers worldwide and can draw on many years of experience. 

How we can support platform teams:

  • Solid Open Source Platform: We develop and manage our platform for different customers based on open source tools, which we select carefully with regard to the current development in the community and the needs of our customers. We ensure that the interaction of the tools enables high availability, scalability, and flexibility. 

  • Processes in Place: We provide the automation for the necessary operating processes and can onboard the platform team accordingly. Especially in version and patch management, we make sure that our platform is always up-to-date to reduce potential sources of incidents.

  • Flexibility: We grant platform teams access rights to the platform depending on the maturity level and requirements. In principle, platform teams have the option of full access to all components. In most cases, we support changes to the platform so that the platform team has the necessary safety net to adapt the platform according to its use cases.

  • Hive Mind: Our engineers have many years of experience in the Kubernetes ecosystem. We also support many clusters from different customers. This experience continuously flows into the further development of our platform. This way, one customer benefits from the glitches and incidents of all customers. 

  • Extended SRE Team: We focus on direct communication. Each platform team receives a dedicated account engineer as a contact person and a Slack channel for questions and problems. This way, concerns are answered directly and in the shortest possible time.
     
  • No Ticket Handling: Since we maintain direct communication via Slack, we do not need a ticket system. This enables proactive reporting of potential issues or incidents by our engineers or direct alerting of our on-call engineers by the platform team. If root causes cannot be localized, swarming by our teams takes place to find a solution as quickly as possible. 

  • All-inclusive: We focus on joint cooperation. Therefore, our support is not broken down into individual support contracts but is part of our overall package.

  • Shared Responsibility: We have a no-blame culture and do not draw clear responsibility lines between our customers and us. We, therefore, not only support the part of the platform that we manage but also support solutions that our customers are responsible for.

  • Coaching: Our goal is to develop platform teams further. Therefore, the exchange of knowledge is very important to us. So we like to tackle new challenges together and work out new solutions together. 

We are a reliable partner for our customers in the operation of the Kubernetes platform. We go beyond the boundaries of classic support and accompany platform teams as an integral part of their cloud native journey. 

We support teams beyond traditional ticket handling. We evolve them. Together we are one team, one platform, and one SRE.

Conclusion

For a long time, speed and stability were incompatible with IT operations. However, business requirements demand that services are always available despite a short time to market. Platform teams must embrace constant change as part of their DNA and adapt accordingly to this volatility. Platform teams, however, find themselves in a constant barrage of incidents and are, therefore, unable to adapt their platform either technologically or in terms of processes and organization. A reliable partner can support here to destroy this hamster wheel and free up the corresponding capacities. But only if this partnership is based on direct communication and collaboration can the cloud native journey be a success.