• Feb 1, 2021
At Giant Swarm, we live the DevOps life. This means we run what we build. The twist is that we manage it mostly for people in organizations outside of ours, though we pride ourselves on dogfooding too.
As the install base we are looking after grows and we manage more and more clusters, we are finding the value of silence, as in silencing alerts. Since it may come off as odd that a managed service company silences alerts, let me provide some context.
At the time of writing this post, Giant Swarm is close to 200 clusters on 25 installations. These clusters are used by different customers for different purposes. As such, not all of them need to be monitored closely all the time. Silencing would typically be for a limited amount of time and for specific use cases. In general, we were looking for a systematic way to control silences and manage their expiration, as applicable to the use case. Using a Custom Resource (CR) and having things in GitHub helps us keep track.
Before looking into developing something ourselves, we typically look upstream. Maybe someone in the community is looking to solve a similar need. A quick look at feature requests on the Prometheus Operator repo shows a request for creating alertmanager silences via CRD. The use cases given, there are:
Let’s dive a little deeper into our use cases. Some general examples would be:
A hypothetical setup would look like the diagram below:
Some typical silences we would want to have:
Now, take the diagram above and multiply it by ~100 (since we are currently running ~200 clusters) and that’s a whole lot of moving parts to keep track of.
The current solution we have created to manage alertmanager alerts is giantswarm/silence-operator.
The silence-operator monitors the Kubernetes API server for changes to Silence objects and ensures that the current Alertmanager alerts match these objects. The Operator reconciles the Silence
Custom Resource Definition (CRD).
Silence
CRs.apiVersion: monitoring.giantswarm.io/v1alpha1
kind: Silence
metadata:
name: test-silence1
spec:
targetTags:
- name: installation
value: kind
- name: provider
value: local
matchers:
- name: cluster
value: test
isRegex: false
There is no expiration date. As long as the CR exists the alertmanager is silenced.
targetTags
field defines a list of tags, which the sync
command uses to match CRs towards a specific environment.To ensure the raw CR is stored in /folder/cr.yaml
, run:
silence-operator sync --tag installation=kind --tag provider=local --dir /folder
matchers
field corresponds to the Alertmanager alert matchers
As mentioned above, we have very specific needs around silencing different alerts and managing the silencing history. Even if you don’t require syncing your git repo with silences into your Kubernetes clusters, you can use the operator with minimal CRs. See the example below:
apiVersion: monitoring.giantswarm.io/v1alpha1
kind: Silence
metadata:
name: test-silence
spec:
targetTags: []
matchers:
- name: cluster
value: test
isRegex: false
For more information about the operator, please visit the repo or contact us.
These Stories on Tech
A look into the future of cloud native with Giant Swarm.
A look into the future of cloud native with Giant Swarm.
A Technical Product Owner explores Kubernetes Gateway API from a Giant Swarm perspective.
We empower platform teams to provide internal developer platforms that fuel innovation and fast-paced growth.
GET IN TOUCH
General: hello@giantswarm.io
CERTIFIED SERVICE PROVIDER
No Comments Yet
Let us know what you think