Platform reality has a surprising amount of detail

by Puja Abbassi on Jun 22, 2023

Platform reality has a surprising amount of detail image thumbnail

This article was originally published on DevOps.com on April 15th 2023. 

Developer platforms and the platform teams that create them have been the center of attention for quite some time now. There have been many (wildly different) opinions about the subject, ranging from how bad the notion of a platform team is to how essential platforms are to make an organization’s DevOps journey to cloud native successful.

Since early this year, the CNCF TAG App Delivery team has been working on a whitepaper about platforms. It explains what platforms are, why they are needed and how platform teams might work to create them. Furthermore, there’s an abundance of end-user talks at conferences that show how platform teams and their platforms were essential in enabling developers at scale. Even on the KubeCon + CloudNativeCon keynote stage, we’ve seen success stories of platform teams like the one at Mercedes Benz.

Developer platforms should enable developers

There is no silver bullet to successful platform teams and platforms. However, it is important to understand that the main goal of a developer platform is to enable developers. It is essential for platform teams to focus on delivering that value, which is why we are seeing a trend to introduce actual product ownership and product management roles to platform teams.

The CNCF whitepaper stated that the three core jobs of a platform team as follows:

  1. Research platform user requirements and plan a feature roadmap.
  2. Market, evangelize and advocate for the platform’s proposed values.
  3. Manage and develop interfaces for using and observing capabilities and services including portals, APIs, documentation and templates and CLI tools.

I couldn’t agree more. But the reality of day-to-day business for many platform teams means they rarely get to focus on those three jobs.

Reality has detail

The situation reminded me of a blog post I read a few years ago, titled “Reality has a surprising amount of detail.” The author explained how it is easy for people to get stuck even when faced with a task that might look simple from the outside. They explain how a task as simple as building wooden stairs can, in its many subtasks, become quite complex. Furthermore, when faced with reality, even the tasks that seemed plannable get more complex and need real-life adjustments.

Most engineers have had analog experiences in their day-to-day work. We can translate this notion into platform engineering quite well. Building a developer platform sounds straightforward at the onset. However, once we get into it, it becomes a rabbit hole of figuring out the CNCF landscape, adding company-specific glue, compliance, security, fixing bugs, getting deep into the open source projects and sometimes, faced with the reality of open source, even becoming contributors and maintainers of such projects.

Now, I’m not saying that it’s all bad. All these things can be great and rewarding in their own way. Especially with a community like the CNCF, participating and contributing to open source can be a very rewarding experience. You’ll make lots of friends, learn DevOps and development skills and might even further your career significantly.

While it’s a lot of work, there is a risk that every platform team basically reinvents the wheel. This is because an organization’s requirements seem too large to rely on a single product (or even a suite of products). We end up with bespoke platforms that need a lot of attention. These platforms also have platform teams that don’t have enough time to focus on the higher-level jobs that they were tasked with.

Platform evolution

Over the years talking to and working with platform teams we have built a model of prototypical platform evolution and its risks at each stage. Usually, a platform team starts with the task of building a developer platform to enable the product teams to build great software and deliver value to the end users.

As they build the platform from the first Kubernetes clusters to observability tooling to security, the backlog for features constantly grows. Even with a growing platform team and sub-teams for certain technology areas, there’s a risk that some of these backlog items never get attention. Especially, delayable, non-critical, daunting tasks (like upgrading to the next Kubernetes release) often fall victim to this problem.

Next, when it comes to operations, the platform team needs to limit the possibilities and breadth of what they offer. Operations responsibility for functionality that is outside of those limits, (for example, a specialized data store) goes back to the product teams themselves, which results in developer distraction.

Even with a wide coverage of the platform, Day 2 and operations, the next big task for the platform team is advocating for adoption. This is not just marketing for their platform, but also means the platform team has to build user-friendly interfaces and documentation. The risks and alternatives we often see here are that product teams build their own mini-platforms and systems that result in shadow IT. Another common risk is ending up with several competing platforms within big enterprises that are hard to integrate and centralize, which in turn leads to inefficiencies.

Once we get over the adoption hurdles, it is the platform team’s job to deliver value and high-level capabilities to the product teams’ developers. However, at this stage, often the platform teams have grown and might encompass several sub-teams for technology areas. Additionally, some areas like security might already be under central IT control. The risk we see here is that the capabilities offered are reduced to technology silos. They might cover all areas like observability, connectivity, security and release engineering, but to achieve real maturity, a platform team needs to offer capabilities that span several of those areas.

Many of the advanced capabilities that we see with more mature end users cover several or all technology areas. Think about a simple high-level capability, like progressive rollouts. It starts with release engineering capabilities to build and deploy the software. From there, we need connectivity capabilities to dynamically route traffic to the new service. Then we need observability capabilities to observe our new version and from there have automated feedback loops to roll back or go forward with the rollout to production. And if that wasn’t enough, as we’re a security-aware organization, we want the whole software supply chain that facilitates this to be secure and trusted.

Question for the curious cloud native enthusiasts: How many tools do you need to integrate to achieve the above “simple” high-level capability in an automated way? And this is only one of many capabilities.

PaaS-like platforms without the downsides

So, while platforms and platform teams are definitely the way to go, we need to be aware of the amount of work and effort required. We need to be aware of the risks and strive towards building highly integrated platforms that deliver high-level capabilities to the product teams.

However, cloud native developer platforms, while having similar goals to the PaaS systems of the past, should not follow the same approach of full abstraction. If we just abstract away into bespoke systems that hide everything from the user, we go back to the world of PaaS where at some point the end-user developers hit limits and build workarounds around the platforms.

Additionally, if we build these bespoke internal platforms, we’ll be stuck with them and need to maintain and evolve everything by ourselves.

Or, to go back to the blog post I referenced earlier, “At every step and every level there’s an abundance of detail with material consequences.” And we have to remember what we care about. The way to avoid both of these downsides is to bet on open source and stand on the shoulder of giants.

By building our platforms on open source and using as many of the established or soon-to–be-established standards of our community we can avoid our platforms becoming bespoke. Recent developments like Cluster API, Gateway API and sigstore are some good examples.

Furthermore, we need to make our abstractions poke-able. While our platforms should come out of the box with user-friendly interfaces and sensible production defaults that enable new product teams to onboard quickly and get their jobs done without having to bother with deep understandings of the underlying technology, we need to keep them open to mature users that know their way around and want to tweak it to their needs. We achieve this by giving mature users the option to access said open standards and APIs safely.

This is the strength that we gain from a community like the CNCF, so we should use it to build great platform products that stay true to the core of open source and strive to make our product teams as happy and successful as possible. And remember to be happy with ourselves as successful and appreciated platform teams.