Step by Step Towards Zero Downtime Deployment - Rolling Updates are Here

by Puja Abbassi on Jun 9, 2015

<span id="hs_cos_wrapper_name" class="hs_cos_wrapper hs_cos_wrapper_meta_field hs_cos_wrapper_type_text" style="" data-hs-cos-general-type="meta_field" data-hs-cos-type="text" >Step by Step Towards Zero Downtime Deployment - Rolling Updates are Here</span>

Update: With swarm CLI 0.18.0 we introduced different update strategies that you can choose from. The one described in this blog post is the default one-by-one strategy. There’s another strategy called hot-swap, which gives you zero downtime no matter how many instances your component has. We are continuously improving upon update our strategies.

Zero Downtime Deployment is one of the promises that microservices architectures and container technologies make. However, both only help you on your way there, but actually doing a rolling update without your application going down at all is still not trivial. Besides a lot of prerequisites, often times there’s quite some manual work involved. On Giant Swarm it’s now possible with only a single command.

Until recently, we had two ways to do rolling updates:

The first option, which we used in our documentation deployment, was to have two or more components with basically the same image deployed, e.g. “content-master” and “content-slave”. In front of these two components you would put a proxy or load balancer to serve the content of these. Now a rolling update would look like following (staying with the above-mentioned example):

  1. Push a new image for your container.
  2. swarm update content-master and wait for it to be up again.
  3. swarm update content-slave and wait for it to be up again.

Now this is already very easy, but it only works in very few cases, like e.g. our documentation that is based on Hugo. This was and is still available to all our users.

The second option, which would work for many more use cases, was not available publicly and involved getting your hands dirty with fleet. This option we previously used for updating our website with zero downtime. Our website deployment invloves a flask app, which is scaled to at least three instances. Doing a rolling update would involve following steps:

  1. SSH into the cluster that it runs on.
  2. Find the units corresponding to the single instances of the component.
  3. Stop one unit and wait for it to go down.
  4. Start the unit again and wait for it to come up.
  5. Repeat 3. and 4. for all units.

One problem with this was that it took quite some time and manual work to stop, wait, start, wait for each instance, and this would only increase with more instances of a component. Another problem was that you needed access to the underlying cluster, which is first risky, as you suddenly have access to a lot of things that can break your infrastructure and second not something we would want to bother our users with, as our goal is to abstract away said infrastructure (remember?).

Thus, we are happy to announce that we now have automated the whole process and that rolling updates are available to all users of Giant Swarm without even having to update their client or anything. It’s actually one single command and it’s already there in our CLI:

$ swarm update mycomponent

As simple as that! Do swarm update on any component that is scaled to more than one instance and you automagically get a rolling update. Ideally this deploys your updated image to the component with zero downtime. However, this is just a first step towards a more sophisticated implementation, thus, there might be short downtimes in cases with only few (2 or 3) instances and/or with images that take a lot of time to start.

“What if I have a component that has only one instance?”, you say? Here, swarm update will just do a regular update of that component to an updated image. However, if your component is stateless, you can do

$ swarm scaleup mycomponent 3
$ swarm update mycomponent
$ swarm scaledown mycomponent 3

and again you get zero downtime deployment (be sure to increase instances by a higher number if you fear your image will take more time starting up). Try it out!

To be clear, this is an early implementation of the feature and there’s still some use cases that it doesn’t support, but it’s our first step towards making zero downtime deployments easier for you. If you have special use cases or scenarios, which you think we should consider in the future, please tell us about it on our Gitter Group, on IRC channel #giantswarm on freenode, or via email.