Scaling for more users by allowing less

by Oliver Thylmann on Mar 27, 2015

Scaling for more users by allowing less

We had a lot of deep discussions at the office the last few days. Yes, we are creative thinkers, startup people to the core, always agile and flexible but … we have mental models too and they make you get stuck. We were in our tiny little box.

201503-thinkingbox


A little background is in order. A few weeks ago we started to invite more and more users on the platform and actually getting great feedback. All of this is happening on a shared cluster, with a steadily rising number of containers. This includes both the user apps themselves as well as containers for storage or our ambassadors that connect them. We love it, and users seem to love it, too. With each passing day, we are getting closer to building the premium microservice infrastructure for others to build their architectures on, one step closer to allowing developers to not think about servers anymore.

Then we hit a wall. Seems that at a certain limit of containers in one cluster, while tying them together with fleet and etcd, the orchestration starts acting up and things fall apart. Some of our bigger customers are actually using a private cluster, so they are oblivious to this. We do want to invite more users to play with the platform though, especially knowing that many of those that do, lead to bigger customers. But with performance severely degraded and maintenance work on that shared cluster skyrocketing, we decided to halt inviting people on the platform and fix the problem. Standard in the box thinking.

As mentioned above, we are working the CoreOS tools like fleet and etcd for parts of our platform and while we know what to fix, we want to make sure that we are on the right track and the roadmaps of CoreOS and ourselves are aligned. That takes more time. Anyway, we want people to play with our platform. The box is getting tighter.

After much back and forth and thoughts about futile tiny little short term fixes, we took a step back and looked at our core goals for the next months. This revolves around getting the right customers on board, building up the community and getting the shared cluster production ready. For the first two goals, we need people playing with the platform, creating more ops work, allowing us to invite less users, needing more time to bring the cluster into production stability. A vicious circle in a box. But what we found outside of the box was interesting.

Playing with the platform does not mean running something on the platform. There are people running things in production on Gaint Swarm but many of the stuff that is taking up resources is people playing with architectures and running test or demo instances.

And most of those people do not stop their applications. Do they really need those apps running? They can restart them again almost instantly. Can we delete the applications? A simple swarm up with the original swarm.json will have apps up and running again. So, what if we put up a message that unused applications are killed once a week, applicable for all apps that have not had developer interactions within the last 7 days. After a bit of soul searching we found that:

  • It does not keep people from playing with the platform.
  • It decreases dev assets used up for operations.
  • It probably leads to other customers as we can invite more people.

The power of the AND instead of the tyranny of the OR. It's no longer we can do this or that, we can do all of what we wanted with a simple change that most likely nobody will mind. And for those that do want their apps running no matter what, just put a non gigantic.io domain into your swarm.json and we will not touch it. We might still contact you eventually to question you about pricing strategies.


Here is what you should know as a user about our cleanup actions:


Starting next Monday, we will be looking at apps that have been started more than 7 days ago. If they have no custom domain name attached, we will delete them.

In case one of your applications is affected, you can simply start your application again from your local swarm.json using the swarm up command.

We won't start deleting your apps befor April 1 (no joke intended). This leaves ample time for anyone to point us to the gaping hole in our thinking and make us rethink the entire thing. Looking forward to hear from you in any case. We love stories about what you do with the Giant.

Update: An older version of this post said we’d only stop your app, but after some discussions and stopped apps just continuing to hang around, we decided to delete them instead. They can easily be started again.