• Sep 6, 2022
Survival of the fittest
The market demands more and more speed in business demands and innovation. This inevitably leads to more complexity and dynamics in the operation of IT platforms. The traditional view of 24/7 availability is no longer geared to agile environments. As a result, providing reliable 24/7 is increasingly becoming a challenge for platform teams. 24/7 availability must therefore be rethought. Instead of aversion to possible incidents, you need to embrace disruptions and handle them as part of your DNA instead of avoiding them at all costs and thereby reducing the speed of innovation.
How is the market currently changing, using e-commerce as an example?
In the area of e-commerce, direct-to-consumer (D2C) will continue to be a trend in 2022. The customer-centric D2C business enables new measures to better accompany the customer's purchase decision processes. Retailers can increase customer value and differentiate themselves from the competition through a better customer experience. To do so, however, you need to react faster to market trends and take full advantage of the freedom for product innovation.
On the other side, more than ever, unavailable services mean real revenue loss for companies. For every hour of downtime, adidas, for example, loses an average of €450,000 in revenue. On Black Friday or at hype sale events, the loss is significantly higher. Therefore, permanent and trouble-free availability of the underlying infrastructure (platform) is critical for the success of companies.
Faster time-to-market and more frequent product innovations lead to more dynamics and complexity in the operation of IT platforms and affect the 24/7 availability of services. Platform teams must ensure not only continuous and disruption-free availability of their platform, but also enable the speed of innovation for the business.
Today, platform teams must handle speed and stability at the same time to enable business.
How can platform teams obtain speed and stability?
Over the past few years, Kubernetes has been gaining traction in platform operations. Through its architecture, Kubernetes provides exactly what teams need to support the business: speed, flexibility, and most importantly, stability. However, Kubernetes is not yet sufficient by default to run applications productively. It needs to be configured, managed, extended, and refined constantly. Therefore, teams have to pay attention to the right tool set as well as set up the right processes.
Done right, you have the opportunity to constantly improve the resilience, performance, and blast radius of your Kubernetes platform: as part of your preventive maintenance, during an incident, and in the incident aftermath.
All phases must be interconnected so that the experience and knowledge gained are continuously incorporated into the further development of the operational capabilities.
Preventive Maintenance:
The event of an incident:
Incident Aftermath:
Platform teams need to work on their operational capabilities and continuously evolve the platform based on lessons learned, especially during incidents. Especially when upgrades, deployments, and disruptions cause more pain, you should put yourself in these situations more and more often to get better and better as a team.
Distributive changes that exhaust the platform's resilience will become increasingly common. Volatility will be part of operating a platform. Incidents will be part of operating a platform. Platform teams have to integrate this into their mindset. Fail fast, fail often. Automate everything. Cattle vs. Cows. Discover incidents early. Fix them as quickly as possible. Build your platform resilient enough that incidents have little impact.
Adapting to disruptive change has to become part of the platform team’s responsibility. Change has to be part of their DNA.
Why do platform teams struggle?
In the past, the operation of IT platforms was technologically, procedurally, and organizationally geared to avoiding changes and thus, potential incidents in operation. The motto here is; never change a running system, which also manifests itself in the established thought patterns and processes. Be it change advisory boards, release planning cycles, typical escalation instances, or outsourcing to third-party providers as insurance in an emergency. Every change in the system was plannable and predictable. Incidents could thus be avoided, and 24/7 availability was easily guaranteed.
However, faster time-to-market and an increased number of product innovations require new ways of thinking. Despite high investments in stable support structures, platform teams are increasingly experiencing incidents and outages of their services. Trying to maintain 24/7 availability, platform teams are quickly caught in a vicious cycle:
The consequence is that platform teams are either too slow in deploying changes or deploying changes leads to an unstable environment. In the long term, the business loses confidence in the stability of the platform, and the platform teams are perceived as blockers for innovation. Platform teams do not feel prepared for upcoming business demands anymore and struggle to provide high availability and agility.
Platform teams face the challenge of not having enough capacity available to align their platform and also their processes accordingly due to the amount of firefighting and increased demands from the business.
Platform teams are caught in a never-ending hamster wheel.
How can Giant Swarm support your platform team?
Disclaimer: This is the part where we describe how Giant Swarm helps our customers. You can skip it if you don’t want to know what Giant Swarm does, but you might get one or other idea out of it, how we support our customers and, most importantly, what drives us.
Our goal is the stable and flexible operation of productive Kubernetes infrastructure to best drive the business and build trust in the platform. We follow a holistic approach based on DevOps principles and focus on a high level of automation of infrastructure operations and, above all, close collaboration with our customers. We take over the 24/7 operation of several hundred Kubernetes clusters for our customers worldwide and can draw on many years of experience.
How we can support platform teams:
We are a reliable partner for our customers in the operation of the Kubernetes platform. We go beyond the boundaries of classic support and accompany platform teams as an integral part of their cloud native journey.
We support teams beyond traditional ticket handling. We evolve them. Together we are one team, one platform, and one SRE.
Conclusion
For a long time, speed and stability were incompatible with IT operations. However, business requirements demand that services are always available despite a short time to market. Platform teams must embrace constant change as part of their DNA and adapt accordingly to this volatility. Platform teams, however, find themselves in a constant barrage of incidents and are, therefore, unable to adapt their platform either technologically or in terms of processes and organization. A reliable partner can support here to destroy this hamster wheel and free up the corresponding capacities. But only if this partnership is based on direct communication and collaboration can the cloud native journey be a success.
These Stories on Tech
A look into the future of cloud native with Giant Swarm.
A look into the future of cloud native with Giant Swarm.
A Technical Product Owner explores Kubernetes Gateway API from a Giant Swarm perspective.
We empower platform teams to provide internal developer platforms that fuel innovation and fast-paced growth.
GET IN TOUCH
General: hello@giantswarm.io
CERTIFIED SERVICE PROVIDER
No Comments Yet
Let us know what you think