Bridging the gap: Kubernetes on VMware Cloud Director

by Xavier Avrillier on Oct 6, 2023

Bridging the gap: Kubernetes on VMware Cloud Director image thumbnail

While the general trend of the last decade has been to move workloads to the cloud, along with major companies jumping on the bandwagon to offer their own managed services, the industry isn’t ready to relegate on-premises infrastructure to the closet just yet.

Many organizations keep running their own Software-Defined Datacentre (SDDC) and will do so for the foreseeable future for reasons such as sovereignty, low latency prerequisites, data privacy, or simply cloud providers selling IaaS capacity on VMware Cloud Director. On top of that, companies that already have solid investments in their on-premise SDDC with the right professionals to run it aren’t going to start over for the sake of moving to the cloud.

With that said, running an on-premises infrastructure isn’t a walk in the park, and Kubernetes is a whole different beast introduced to the zoo. It shouldn’t mean these customers can’t adopt the cloud native ecosystem without losing their sanity and this is where Giant Swarm jumps in by offering Kubernetes on VMware Cloud Director!

Positioning VMware Cloud Director within the cloud native landscape

Let’s first take a look at today’s star of the show and how we got to writing this piece.

VMware Cloud Director (VCD) is a cloud provider solution (much like OpenStack) built on VMware’s SDDC stack, featuring components such as VSAN for storage, NSX-T for networking, NSX ALB for load balancing, and of course, vSphere ESXi for compute. VCD is essentially a management layer added on top of these bricks, to allow organizations to sell resources to internal and external customers in various fashions such as Pay-As-You-Go or Reserved…

On top of these, VMware offers Kubernetes on VMware Cloud Director with Tanzu and the Container Storage Extension (CSE) plugin for VCD. However, we found that customers exploring this route face the same challenges as those looking at managed Kubernetes offerings from major hyperscalers:

Difficulty in finding talent to manage the platform.
Potential misalignment within the existing environment and intricate integration.
Limited or only expensive enterprise support.
A shift towards greater time investment in platform management and less time delivering features.
Substantial costs associated with staying on top of security concerns, update cycle, and incident response.
Navigating the difficulties associated with on-call rotations.

Giant Swarm plugs these holes by offering the option for a full-featured end-to-end managed cloud native platform for Kubernetes on VMware Cloud Director.

Giant Swarm and CAPVCD

Cluster API Provider VMware Cloud Director (CAPVCD) is a provider for Cluster API (CAPI) to support deploying Kubernetes on VMware Cloud Director. This upstream Open Source project was initiated by VMware in mid-2021 in an effort to upgrade their CSE plugin in the future. Back then, the GitHub repo had only a dozen commits but the core features were there.

Fast forward a few months to the early hours of 2022, and Giant Swarm started investigating this new provider. The objective was twofold: to identify how it could solve pain points for our valued VCD customers and to understand the specifics of their unique environment.

Kubernetes on VMware in complex environments

What makes working with on-premises so fun (depending on who you ask) is the variety of environments you come across. Not two customers have the same infra and there are always a few exotic quirks that need to be accounted for.

During the exploration phase of our customers’ requirements, it became apparent that the project would require a few tweaks. Enter the shining attribute of CAPVCD’s Open Source nature, which allows us to implement new capabilities to the product that wouldn’t have been possible otherwise. We contributed quite a bit to the project so that we could run Kubernetes on VMware Cloud Director with our customer’s environment’s requirements in mind.

A few months after starting the exploration phase and in collaboration with the maintainers, we made our first few additions to the product:

Multi-Nic feature: Enabling virtual machines to connect to multiple networks.

Along with the ability to connect the nodes to several networks, we use postKubeadmCommands to easily configure static routes in our cluster chart.

In this excerpt of values.yaml, we connect three additional networks to each node and specify static routes to go along with them. Note that we call them additional networks because there’s already an existing field that sets the first network card where the default gateway will be located (.providerSpecific.ovdcNetwork).

connectivity:
  network:
	extraOvdcNetworks:
  	- MY_network_x
  	- MY_network_y
  	- MY_network_z
	loadBalancers:
  	vipSubnet: 10.205.9.254/24
	staticRoutes:
 	 # Routes out MY_network_x
  	- destination: 10.30.0.0/16
    	via: 10.80.80.1
  	- destination: 10.90.0.0/16
    	via: 10.80.80.1
  	# Routes out MY_network_y
  	- destination: 172.31.192.0/19
    	via: 172.32.1.5
  	- destination: 10.200.200.0/18
    	via: 172.32.1.5
  	# Routes out MY_network_z
  	- destination: 10.0.0.0/8
    	via via: 172.32.85.254

VM Naming feature: Enabling the application of a naming convention based on GO templates for VMs.

Many large organizations enforce naming conventions because of some obscure automation happening somewhere in the background that is based on those. If you worked with Cluster API before, you might know that the names of the provisioned infrastructure objects look something like this: mycluster-worker-64b89864f9xg6h5x-rkh8b. However, this doesn’t cut it with existing naming conventions (unless you are veeeery lucky).

In the example below, the virtual machines backing the Kubernetes nodes will be named giantswarm-xxxxx as you can see in the screenshot.

providerSpecific:
  vmNamingTemplate: giantswarm-

Improvements in post-customization phases to better align with proper CAPI concepts.
Added tracking metadata to disk objects to track proper deletion.
Various bug fixes.

What is a CAPVCD cluster anyway?

Giant Swarm supports multiple Cluster API providers such as AWS, Azure, GCP, vSphere, and OpenStack. For us, deploying a Kubernetes cluster to VMware Cloud Director remains the same across any of these providers, with the exception of provider-specific parameters of course.

A cluster is made up of two different Giant Swarm apps, which we will explore below. If you want to know more about Giant Swarm apps, check out our App Platform.

Cluster app

Based on our cluster-cloud-director helm chart, this app defines what the cluster should look like. The configuration is stored in a configMap and lets you configure many parameters. While many sensible parameters are set by Giant Swarm, everything remains fully configurable.

The cluster app will automatically install Cilium, the Container Network Interface (CNI), the VCD Cloud Provider Interface (CPI), and CoreDNS using HelmReleases.

Here’s an example of a minimalist cluster app and its configuration values file. In these values, you can see the various VCD-specific values such as the sizing policy to use, template and catalog to deploy from, OS disk size (must be greater than template’s size), Load balancer’s gateway’s CIDR, VCD endpoint, OVDC (Organizational Virtual Datacenter), ovdcNetwork for the NIC where the default gateway will live and so on…

---
apiVersion: application.giantswarm.io/v1alpha1
kind: App
metadata:
  name: gs-test
  namespace: org-giantswarm
  labels:
    app-operator.giantswarm.io/version: 0.0.0
    app.kubernetes.io/name: cluster-cloud-director
spec:
  catalog: cluster
  extraConfigs:
  kubeConfig:
    inCluster: true
  name: cluster-cloud-director
  namespace: org-giantswarm
  userConfig:
    configMap:
      name: gs-test-user-values
      namespace: org-giantswarm
  version: 0.13.0
---
apiVersion: v1
kind: ConfigMap
metadata:
  labels:
    app-operator.giantswarm.io/watching: "true"
    cluster.x-k8s.io/cluster-name: gs-test
  name: gs-test-user-values
  namespace: org-giantswarm
data:
  values: |-
	baseDomain: "test.gigantic.io"
	controlPlane:
  	  catalog: giantswarm
  	  replicas: 1
  	  sizingPolicy: m1.xlarge
  	  template: ubuntu-2004-kube-v1.24.10
  	  diskSizeGB: 30
  	oidc:
    	  issuerUrl: https://dex.gs-test.test.gigantic.io
    	  clientId: "dex-k8s-authenticator"
    	  usernameClaim: "email"
    	  groupsClaim: "groups"
	connectivity:
  	  network:
    	    loadBalancers:
      	      vipSubnet: 10.205.9.254/24
  	  proxy:
    	    enabled: true
	nodePools:
  	  worker:
    	    class: default
    	    replicas: 1
	providerSpecific:
 	   site: "https://cd.neoedge.cloud"
 	   org: "GIANT_SWARM"
	   ovdc: "Org-GIANT-SWARM"
 	   ovdcNetwork: "LS-GIANT-SWARM"
  	nodeClasses:
    	  default:
      	    catalog: giantswarm
      	    sizingPolicy: m1.2xlarge-new
      	    template: ubuntu-2004-kube-v1.24.10
      	    diskSizeGB: 60
  	userContext:
    	  secretRef:
      	    secretName: vcd-credentials
  	vmNamingTemplate: giantswarm-
	metadata:
  	  description: "Testing Cluster"
  	  organization: giantswarm
	internal:
  	  kubernetesVersion: v1.24.10+vmware.1

Default-Apps app

In order to enforce the installation of a set of apps in all the clusters of a specific provider, we leverage the concept of App-of-Apps to simplify lifecycle management. Essentially, we install an app in the management cluster, which in turn installs all the required apps in the workload cluster. We call it Default-Apps, and it is based on our default-apps-cloud-director Helm chart.

You can find the list of all the apps that will be installed in the cluster as part of the default apps in the values.yaml file, it is also where you would change the configuration of the apps themselves.

In the example below, we configure the default apps to configure HTTP/HTTPS proxy variables and add a secret used by cert-manager which contains AWS Route53 credentials to solve DNS01 challenges. Typically, these two are usually used in combination in private clusters.

apiVersion: application.giantswarm.io/v1alpha1
kind: App
metadata:
  labels:
    app-operator.giantswarm.io/version: 0.0.0
    app.kubernetes.io/name: default-apps-cloud-director
  name: gs-test-default-apps
  namespace: org-giantswarm
spec:
  catalog: cluster
  kubeConfig:
    inCluster: true
  name: default-apps-cloud-director
  namespace: org-giantswarm
  userConfig:
    configMap:
      name: gs-test-default-apps-user-values
      namespace: org-giantswarm
  version: 0.6.0
---
apiVersion: v1
data:
  values: |
	clusterName: gs-test
	organization: giantswarm
	managementCluster: glasgow
	userConfig:
  	  certManager:
    	    configMap:
      	      values: |
        	        controller:
          	          proxy:
            	            noProxy: 10.80.0.0/13,10.90.0.0/11 ,test.gigantic.io, cd.neoedge.cloud,svc,127.0.0.1,localhost
            	            http: http://10.100.100.254:3128
            	            https: http://10.100.100.254:3128
	apps:
  	  certManager:
    	    extraConfigs:
      	      - kind: secret
        	        name: gs-test-cert-manager-user-secrets
kind: ConfigMap
metadata:
  labels:
    app-operator.giantswarm.io/watching: "true"
    cluster.x-k8s.io/cluster-name: gs-test
  name: gs-test-default-apps-user-values
  namespace: org-giantswarm

Herding clusters

Since we have a multitude of clusters to manage for our customers, Giant Swarm is committed to using and improving the Gitops framework and it fits just right with Cluster API. We rely on the Open Source project Flux-CD to control the desired state of clusters across our customers' fleets, all orchestrated from the simplicity of a few Git repositories.

Gone are the days of clicking through the VCD UI or issuing imperative commands to deploy or manage clusters. Pull Requests (PRs) are the new cool kids on the block as the benefits are countless:

Tracking and versioning.
Peer approval process.
Rollback capabilities.
Better collaboration.
Unmatched flexibility and scalability.
And so much more.

Our Gitops framework allows our customers to create Kubernetes clusters on VMware Cloud Director by committing a few files that can be copied from existing clusters. The structure offers a base layer, to which a number of overlays can be added to customize clusters and their respective apps independently from each other.

You can find out more about this by looking at our Gitops-template repository, which contains the structure we use for both our customers and our own clusters.

We advise all our customers, regardless of the infrastructure provider they use, to embrace GitOps for all these reasons, and so far, the feedback has been nothing but great. Through PR reviews, the customer gains peace of mind while simultaneously acquiring valuable insights when collaboration is needed for fixing or commenting.

Wrap up

We’ve been iterating our CAPVCD integration for a while now and it’s been a success marked by fine-tuning the deployment processes, upgrade procedures, and feature additions. Our customers are happy with the result and keep helping us improve the product through technical feedback and feature requests, which we enthusiastically welcome.

While we already have a solid production-ready way to manage Kubernetes on VMware Cloud Director, our roadmap is packed with cool enhancements that will keep our customers and ourselves on that upward development trend we’ve been riding for the past year and a half.