Part 5: Traces of Your Microservices

by Łukasz Piątkowski on Sep 23, 2020

Part 5: Traces of Your Microservices image thumbnail

Introduction


Welcome to the fifth part of our blog series that focuses on how to develop, deploy, and debug microservices running in the cloud-native stack. In part four, we illustrated how we can use Linkerd, Grafana, and Loki to check the metrics and logs of our application. Still, observability is based on three data sources: logs, metrics, and traces, and so, in this part, we'll target the missing element — tracing.


What is tracing?


Tracing is probably the least-known data source in the observability area.

The problem: debugging an application when it's composed of multiple different services interacting with each other to handle a request.

The solution: trace the passage of a request using a unique set of HTTP headers that are passed between all the interacting services.

The first service on the route of a request is responsible for initiating the trace by inserting an HTTP header called trace ID. Each request is assigned a unique trace ID. Next up, every service or component that takes part in forwarding or processing this request then extracts the header and sends information about the related request to the trace collecting system. The information sent includes data about the request identified by the specific trace ID, as well as the timestamps when it arrived and left the service. The service always passes the header with the trace ID forward, so the next service in the execution path can again extract the trace ID from the header and report the duration of the request from there to the trace collector.

As a result, you can investigate a single request passing through your system like the below:




There are two main open-source projects focusing on providing tracing functionality. One of them is OpenTracing, the other is OpenCensus.

The two projects had a lot of overlap and fortunately decided to merge into the OpenTelemetry project. Unfortunately, OpenTelemetry is still in beta, and although a lot of stuff is already working and all the main components are in place, some additional libraries we’re using here are still available only in OpenCensus, and so we'll stick to it for now. But remember, OpenTelemetry is the future. 😄

How_Security_Evolves_with_your_Kubernetes_Footprint


Short version


If you just want our setup to be up and running, execute the following commands (assuming you followed our series and have a configured cluster with the application running like we had in part 4):

# update to new branch in our repository
git checkout blog_v2

# upgrade linkerd to 2.7.1
## upgrade helm repo
helm repo update
## upgrade the release
helm upgrade --namespace linkerd -f ./helm/configs/linkerd.yaml --version 0.3.2 linkerd giantswarm-playground/linkerd2-app
## check release status and history
helm history linkerd

# install tracing tools
helm install --namespace tracing -n jaeger-operator --version 0.1.2 giantswarm-playground/jaeger-operator-app
helm install --namespace tracing -n opencensus-collector -f ./helm/configs/opencensus-collector.yaml --version 0.1.3 giantswarm-playground/opencensus-collector-app

# deploy nginx ingress controller
## install
minikube addons enable ingress
## check
kubectl -n kube-system get po -l app.kubernetes.io/name=nginx-ingress-controller
## configure
kubectl -n kube-system apply -f ./config/nginx-ic-configmap.yaml
kubectl -n kube-system rollout restart deployment nginx-ingress-controller

# update giantswarm-todo-app
helm upgrade gs-todo --namespace todo --timeout 90 --version 0.3.1 giantswarm-playground/giantswarm-todo-app

Now, you can execute some test requests and create and get todos:

# find minikube's IP where ingress is running
export IC_IP=`minikube ip`
# create a ToDo
curl -H "Content-Type: application/json" -H "Host: todo.local" -X POST \
  -d '{"Text":"Check tracing"}' \
  http://${IC_IP}/v1/todo
# fetch all ToDos
curl -H "Host: todo.local" http://${IC_IP}/v1/todo

Long version


Update source code from git to the newest version


To switch to the branch prepared for this blog entry, just run the following in your checked out repository directory:

git pull
git checkout blog_v2

Installing tracing tools


Tracing enables better visibility into our system. Still, does this mean only the software we've developed?

As you remember, we’re using Linkerd as a service mesh, which introduces a networking proxy into each pod that is included in the mesh. As you can imagine, each proxy has to cause some delay, so it’s not transparent to our tracing efforts — after all, one of the main features of traces is to be able to see how much time is spent processing our request across different services.

Fortunately, service mesh developers are fully aware of this issue. Linkerd can handle the HTTP tracing headers and submit their own point of view on when a request arrives and leaves. Linkerd in particular supports the B3 propagation headers. To make use of that, we first need to upgrade our Linkerd to a newer version v2.7.1.. In the process, you’ll also see how Helm upgrades work and how we can use Helm to check the update history of our app.

helm repo update
# provides Linkerd v2.7.1
helm upgrade --namespace linkerd -f ../giantswarm-todo-app/helm/configs/linkerd.yaml --version 0.3.2 linkerd giantswarm-playground/linkerd2-app
helm history linkerd

Now, run kubectl -n linkerd get pods and wait until all of them are running.

Next, we need to install our trace collecting tools. We'll use Jaeger, one of the most popular open-source tools in this category (other than Zipkin).

In our Giant Swarm Playground Catalog for managed apps, we've prepared a jaeger-operator based managed app, which we can easily use to deploy development-quality Jaeger setup. Let's run the following command:

helm install --namespace tracing -n jaeger-operator --version 0.1.2 giantswarm-playground/jaeger-operator-app

Except for Jaeger, we'll use one more component called OpenCensus Collector. It works as a proxy between different trace providers and trace storage systems like Jaeger. We can use this proxy to aggregate tracing traffic before pushing it to actual trace storage systems, but also to do translation between different tracing protocols and standards. Again, we have prepared it as a managed app, so installation is as simple as:

helm install --namespace tracing -n opencensus-collector -f ./helm/configs/opencensus-collector.yaml --version 0.1.3 giantswarm-playground/opencensus-collector-app

Adding and configuring Ingress

In previous installments of this series, we were using port-forwarding to access our apiserver service. We were doing that for simplicity's sake, but obviously, it’s not the way your customers can access the service in a cluster in a production setup. For that, we need to use a dedicated component and the most popular choice is to use an Ingress controller and a Kubernetes object called Ingress. In short, an Ingress is a proxy web server that is exposed to the outside world as a service and routes external traffic to the services running inside the cluster. The rules describing how to route the traffic are created by cluster users by providing Ingress objects.

One of the most popular implementations of Ingress controllers for Kubernetes is Nginx Ingress Controller. We will use it now to make our use case more realistic, but also because it can inject the necessary tracking headers into the requests that are coming to it. That way, we'll get traces starting at the very edge of our system — the Ingress controller.

To run the Nginx Ingress Controller, we can use a plugin provided by Minikube. The only issue is that we need to configure the controller to enable tracing and to point it to our trace sink. We have prepared a ready config file for that.

So, let’s install the plugin and configure our Ingress by running:

# install
minikube addons enable ingress
# check if it is up
kubectl -n kube-system get po -l app.kubernetes.io/name=nginx-ingress-controller
# after it's up, configure
kubectl -n kube-system apply -f ./config/nginx-ic-configmap.yaml
# restart to apply the configuration
kubectl -n kube-system rollout restart deployment nginx-ingress-controller

Enhancing our services


Our infrastructure is ready. To use our new feature, this time we won’t avoid adding some code. We need to tell our app that the tracing headers need to be extracted and forwarded in all handled calls. We can also use a feature called “span” to define more detailed periods within a single app. A single trace is composed of multiple spans.

A span can be the time it takes for the application to:

- Decode an encrypted data

- Deserialize encoded data

- Get data from a database

Additionally, we’ll add a debugging solution called zPages to our services. It's an OpenCensus toolkit that allows you to check tracing related stats locally, even if your trace sink is not working.

A process of adding tracing capabilities to an application is called “instrumentation”. You can see all the changes we made to instrument our app by checking the complete diff. Here we'll only discuss the most important snippets.


Instrumenting apiserver


The great thing about the tracing projects like OpenCensus is that they don’t only provide the services required to run them, but also a lot of libraries and plugins to use with your code. For apiserver, we can use the ocagent library to initialize our tracing config and start zPages server, as below:

func initTracing(config *todo.Config) {
    oce, err := ocagent.NewExporter(
    ocagent.WithInsecure(),
    ocagent.WithReconnectionPeriod(5*time.Second),
    ocagent.WithAddress(config.OcAgentHost),
    ocagent.WithServiceName("api-server"))
    if err != nil {
        log.Fatalf("Failed to create ocagent-exporter: %v", err)
    }
    trace.ApplyConfig(trace.Config{DefaultSampler: trace.AlwaysSample()})
    trace.RegisterExporter(oce)

    go func() {
        mux := http.NewServeMux()
        zpages.Handle(mux, "/debug")
        log.Fatal(http.ListenAndServe(":8081", mux))
    }()
}

The configuration part is pretty much self-explaining: we’re pointing to our OpenCensus Collector service (loaded from the config.ocAgentHost variable) and set the DefaultSampler: trace.AlwaysSample property. This property makes sure that every request’s trace is recorded and forwarded to Jaeger.

Under normal circumstances, you probably don’t want that as it will overload your trace collector and its resources, but it’s useful for our demo app. One more thing to note is that we’re enabling zPages using an additional HTTP endpoint so they don’t interfere with the one serving our actual API.

Once the configuration is done, we just need to inject a ready HTTP middleware that will automatically trace all the HTTP requests coming to our service. That’s really nice — we don’t have to instrument all the code handling requests by ourselves. Exploiting the go’s HTTP middleware idea, all we need is this piece of code:

r.Use(func(handler http.Handler) http.Handler {
    return &ochttp.Handler{
        Handler:          handler,
        IsPublicEndpoint: false,
        Propagation:      &b3.HTTPFormat{},
        IsHealthEndpoint: func(r *http.Request) bool {
            if r.URL.Path == "/ping" {
                return true
            }
            return false
        },
    }
})

Instrumenting todomanger


The configuration and instrumentation of todomanger is not very different than the one for apriserver, but here we have a gRPC service, so our HTTP middleware approach won’t work. Fortunately, we have a similar concept for the gRPC stack in go: we can use ocgrpc package to achieve that.

So, the configuration code is the same as apiserver and the difference is in how we initialize the middleware that automatically handles tracing headers for us:

server := grpcserver.NewGrpcServer(func(server *grpc.Server) {
        todomgrpb.RegisterTodoManagerServer(server, todoMgr)
    }, &grpcserver.GrpcServerOptions{
        LoggerFields: log.Fields{
            "ver": version,
        },
        AdditionalOptions: []grpc.ServerOption{
            grpc.StatsHandler(&ocgrpc.ServerHandler{}),
        },
    })

That’s it, the above is enough to include all our gRPC calls in traces. Still, we don’t have to exclusively use automatic tracing libraries — we can always intervene and declare our own spans.

For example, to mark in the trace how long it takes to make a call to our database, we can use a code like this:

_, span := trace.StartSpan(ctx, "db-create")
t.db.Create(dbTodo)
span.End()

Building and deploying


Building the application — short version


If you don’t want to build the images yourself, they are available on quay.io for both apiserver and todomanager.


Building the application — long version


If you have all the developer’s tools in place, you can easily build Docker images yourself. Just remember to point to Minikube's Docker for storing the images in the same go.

Run in the main directory:

eval $(minikube docker-env)
make docker-build

Deploying the application


If you built the images yourself, please run the following commands to upgrade our ToDo app using Helm:

cp helm/local_chart/* helm/giantswarm-todo-app
helm upgrade --namespace todo --timeout 90 gs-todo helm/giantswarm-todo-app

However, if you want to use pre-built images, just run this instead:

helm upgrade gs-todo --namespace todo --timeout 90 --version 0.3.1 giantswarm-playground/giantswarm-todo-app

Testing


It’s time to test our new version of the application. We restart log collecting components to make sure Nginx and OpenCensus Collector connect. Then, we check what is the IP used by Minikube, so that we can issue a request that goes through Nginx Ingress Controller.

In order for Nginx Ingress Controller to route our request to the correct service, we need to include a proper Host: header with Curl:

# restart services
$ kubectl -n tracing rollout restart deployment opencensus-collector-opencensus-collector-app
deployment.apps/opencensus-collector-opencensus-collector-app restarted

$ kubectl -n kube-system rollout restart deployment nginx-ingress-controller
deployment.apps/nginx-ingress-controller restarted

# just display our minikube's IP
$ minikube ip
192.168.99.126

# export minikube's IP to variable
$ export IC_IP=`minikube ip`

# create 1 new ToDo
$ curl -H "Content-Type: application/json" -H "Host: todo.local" -X POST \
  -d '{"Text":"Check tracing"}' \
  http://${IC_IP}/v1/todo
{"id":"1","text":"Check tracing","done":false}

# list all the ToDos
$ curl -H "Host: todo.local" http://${IC_IP}/v1/todo
[{"id":"1","text":"Check tracing","done":false}]

If you got responses, everything is working. Our services are responding, but this time our calls should have recorded some actual traces. Let’s check them.


Checking the traces


Debug info from zPages


As we mentioned earlier, we have used the zPages feature of our OpenCensus libraries to provide additional debugging info, which is accessible from the app’s pods even without the rest of the tracing system running. Let’s now access zPages of one of the pods of apiserver.

Start port forwarding with:

kubectl -n todo port-forward $(kubectl -n todo get po | grep apiserver | head -n 1 | cut -f 1 -d" ") 8081

And now access http://localhost:8081/debug/tracez. You should see a page like the below, where you can find that two traces, one for CreateTodo and one for ListTodos were recorded locally:




If you see them as well, we’re all good! Now we can move to the essence — watching actual full traces in Jaeger.


Watching traces in Jaeger


Let’s start by making Jaeger accessible using port forwarding:

kubectl -n tracing port-forward svc/jaeger-all-in-one-default-query 16686

Then, visit http://localhost:16686/. You should see a page like the below. Now, in the list “Service” select our “api-server”, then click on “Find Traces” below. Lastly, locate a trace that lists all our service involved like in this picture:




Note: if you’re missing some services that should be involved in the trace, restart that specific service — it should fix the connection problem.

Now, click on the trace to investigate it. You should see a screen like this:




What you see is a detailed analysis of the route and time spent by a single GET request across all the services involved in building our application. You can imagine how useful it is for debugging issues! We can also see how much additional services, like Nginx and Linkerd mesh, cost us. We can also see how much time the service waits for the database to respond.


Summary


Observability is based on data coming from the main three sources: metrics, logs, and traces. In this part of our series, we included tracing capabilities in our application. Using open-source tools and making some rather minimal changes to our application, we've made it possible to see how a single request traverses our microservice mesh and how much time it spends on different services. Our observability story is now complete… Or is it?

So far, our series has ignored the fact that the metrics we collect need to be stored somewhere, and that we might want to introduce custom metrics besides the one automatically provided by Linkerd. It turns out that the metrics storage service is already there — it’s the Prometheus monitoring system, a hidden hero of our story. But that's for next time...

👋 ICYMI: Here's Part 1, Part 2, Part 3, and Part 4