• Oct 1, 2018
So you’ve finally succeeded in convincing your organization to use Kubernetes and you’ve even gotten first services in production. Congratulations!
You know uptime of your production workloads is of utmost importance so you set up your production cluster(s) to be as reliable as possible. You add all kinds of monitoring and alerting, so that if something breaks your SREs get notified and can fix it with the highest priority.
But this is expensive and you want to have staging and development clusters, too - maybe even some playgrounds. And as budgets are always tight, you start thinking…
What’s with DEV
? Certainly can’t be as important as PROD
, right? Wrong!
The main goal with all of these nice new buzzwordy technologies and processes was Developer Productivity. We want to empower developers and enable them to ship better software faster.
But if you put less importance on the reliability of your DEV
clusters, you are basically saying “It’s ok to block my developers”, which indirectly translates to “It’s ok to pay good money for developers (internal and external) and let them sit around half a day without being able to work productively”.
Ah yes, the SAP DEV Cluster is also sooo important because of that many external and expensive consultants. Fix DEV first, than PROD which is earning all the money.
— Andreas Lehr (@shakalandy) September 13, 2018
Furthermore, no developer likes to hear that they are less important than your customers.
We consider our dev cluster a production environment, just for a different set of users (internal vs external).
— Greg Taylor (@gctaylor) September 18, 2018
Let’s look at some of the issues you could run into, when putting less importance on DEV, and the impact they might have.
I did not come up with these. We’ve witnessed these all happen before over the last 2+ years.
Your nicely built CI/CD pipeline is now spitting a mountain of errors. Almost all your developers are now blocked, as they can’t deploy and test anything they are building.
This is actually much more impactful in DEV
than in production clusters as in PROD
your most important assets are your workloads, and those should still be running when the Kubernetes API is down. That is, if you did not build any strong dependencies on the API. You might not be able to deploy a new version, but your workloads are fine.
Some developers are now blocked from deploying their apps. And if they try (or the pipeline just pushes new versions), they might increase the resource pressure.
Pods start to get killed. Now your priority and QoS classes kick in - you did remember to set those, right? Or was that something that was not important in DEV
? Hopefully, you have at least protected your Kubernetes components and critical addons. If not, you’ll see nodes going down, which again increases resource pressure. Thought DEV
clusters could do with less buffer? Think again.
This sadly happens much more in DEV
because of two things:
DEV
In most clusters, CNI and DNS are critical to your workloads. If you use an Ingress Controller to access them, then that counts also as critical. You’re really cutting edge and are already running a service mesh? Congratulations, you added another critical component (or rather a whole bunch of them - looking at you Istio).
Now if any of the above starts having issues (and they do partly depend on each other), you’ll start seeing workloads breaking left and right, or, in the case of the Ingress Controller, them not being reachable outside the cluster anymore. This might sound small on the impact scale, but just looking at our past postmortems, I must say that the Ingress Controller (we run the community NGINX variant) has the biggest share of them.
A multitude of thinkable and unthinkable things can happen and lead to one of the scenarios above.
Most often we’ve seen issues arising because of misconfiguration of workloads. Maybe you’ve seen one of the below (the list is not conclusive).
Sharing DEV
with a lot of teams? Gave each team cluster-admin
rights? You’re in for some fun. We’ve seen pretty much anything, from “small” edits to the Ingress Controller template file, to someone accidentally deleting the whole cluster.
If it wasn’t clear from the above: DEV
clusters are important!
Just consider this: If you use a cluster to work productively then it should be considered similarly important in terms of reliability as PROD
.
DEV
clusters usually need to be reliable at all times. Having them reliable only at business hours is tricky. First, you might have distributed teams and externals working at odd hours. Second, an issue that happens at off-hours might just get bigger and then take longer to fix once business hours start. The latter is one of the reasons why we always do 24/7 support, even if we could offer only business hours support for a cheaper price.
Some things you should consider (not only for DEV
):
cluster-admin
credentials.DEV
.DEV
or spin up clusters for development by themselves.Why don’t devs have the capability to rebuild dev 🤷♂️
— Chris Love (@chrislovecnm) September 14, 2018
If you really need to save money, you can experiment with downscaling in off-hours. If you’re really good at spinning up or rebuilding DEV
, i.e. have it all automated from cluster creation to app deployments, then you could experiment with “throw-away-clusters”, i.e. clusters that get thrown away at the end of the day and start a new shortly before business hours.
Whatever you decide, please, please, please, do not block your developers, they will be much happier, and you will get better software, believe me.
P.S. Thanks to everyone responding and giving feedback on Twitter!
Image attribution:
Image created using https://xkcd-excuse.com/ by Mislav Cimperšak. Original image created by Randall Munroe from XKCD. Released under Creative Commons Attribution-NonCommercial 2.5 License.
These Stories on Tech
A look into the future of cloud native with Giant Swarm.
A look into the future of cloud native with Giant Swarm.
A Technical Product Owner explores Kubernetes Gateway API from a Giant Swarm perspective.
We empower platform teams to provide internal developer platforms that fuel innovation and fast-paced growth.
GET IN TOUCH
General: hello@giantswarm.io
CERTIFIED SERVICE PROVIDER
No Comments Yet
Let us know what you think