As modern application environments are polyglot, distributed, and increasingly complex, observing your application to identify and react to failures has become challenging. In early 2019, two popular instrumentation projects, OpenTracing and OpenCensus, merged to create OpenTelemetry, a new standard for observability telemetry. [1]

OpenTelemetry is a collection of APIs, SDKs, and tools. Use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s performance and behavior. [2]

OpenTelemetry is generally available across several languages and is suitable for use.

OpenTelemetry, also known as OTel for short, is a vendor-neutral open-source Observability framework for instrumenting, generating, collecting, and exporting telemetry data such as traces, metrics, logs. As an industry-standard, it is natively supported by a number of vendors. [3]

otel diagram

1. What is Observability?

Observability lets us understand a system from the outside, by letting us ask questions about that system without knowing its inner workings. Furthermore, it allows us to easily troubleshoot and handle novel problems (i.e. “unknown unknowns”), and helps us answer the question, “Why is this happening?” [4]

In order to be able to ask those questions of a system, the application must be properly instrumented. That is, the application code must emit signals such as traces, metrics, logs. An application is properly instrumented when developers don’t need to add more instrumentation to troubleshoot an issue, because they have all of the information they need.

OpenTelemetry is the mechanism by which application code is instrumented, to help make a system observable.

1.1. Reliability & Metrics

Telemetry refers to data emitted from a system, about its behavior. The data can come in the form of traces, metrics, logs.

Reliability answers the question: “Is the service doing what users expect it to be doing?” A system could be up 100% of the time, but if, when a user clicks “Add to Cart” to add a black pair of pants to their shopping cart, and instead, the system doesn’t always add black pants, then the system would be said to be unreliable.

Metrics are aggregations over a period of time of numeric data about your infrastructure or application. Unlike request tracing, which is intended to capture request lifecycles and provide context to the individual pieces of a request, metrics are intended to provide statistical information in aggregate. Examples include: system error rate, CPU utilization, request rate for a given service. For more on metrics and how they pertain to OTel, see Metrics.

SLI, or Service Level Indicator, represents a measurement of a service’s behavior. A good SLI measures your service from the perspective of your users. An example SLI can be the speed at which a web page loads.

SLO, or Service Level Objective, is the means by which reliability is communicated to an organization/other teams. This is accomplished by attaching one or more SLIs to business value.

1.2. Understanding Distributed Tracing

To understand Distributed Tracing, let’s start with some basics.

1.2.1. Logs

A log is a timestamped message emitted by services or other components. Unlike traces, however, they are not necessarily associated with any particular user request or transaction. They are found almost everywhere in software, and have been heavily relied on in the past by both developers and operators alike to help them understand system behavior.

Sample log:

I, [2021-02-23T13:26:23.505892 #22473]  INFO -- : [6459ffe1-ea53-4044-aaa3-bf902868f730] Started GET "/" for ::1 at 2021-02-23 13:26:23 -0800

Unfortunately, logs aren’t extremely useful for tracking code execution, as they typically lack contextual information, such as where they were called from.

They become far more useful when they are included as part of a span, or when they are correlated with a trace and a span.

For more on logs and how they pertain to OTel, see Logs.

1.2.2. Spans

A span represents a unit of work or operation. It tracks specific operations that a request makes, painting a picture of what happened during the time in which that operation was executed. Spans are the building blocks of Traces.

A span contains name, time-related data, structured log messages, and other metadata (that is, Attributes) to provide information about the operation it tracks.

{
  "name": "hello-greetings",
  "context": { (1)
    "trace_id": "5b8aa5a2d2c872e8321cf37308d69df2",
    "span_id": "5fb397be34d26b51"
  },
  "parent_id": "051581bf3cb55c13",
  "start_time": "2022-04-29T18:52:58.114304Z",
  "end_time": "2022-04-29T22:52:58.114561Z",
  "attributes": { (2)
    "http.route": "some_route2"
  },
  "events": [ (3)
    {
      "name": "hey there!",
      "timestamp": "2022-04-29T18:52:58.114561Z",
      "attributes": {
        "event_attributes": 1
      }
    },
    {
      "name": "bye now!",
      "timestamp": "2022-04-29T18:52:58.114585Z",
      "attributes": {
        "event_attributes": 1
      }
    }
  ]
}
1 Span context is an immutable object on every span that contains the Trace ID representing the trace that the span is a part of, the span’s Span ID, Trace Flags that is a binary encoding containing information about the trace, and Trace State that is a list of key-value pairs that can carry vendor-specific trace information.
2 Attributes are key-value pairs that contain metadata that you can use to annotate a Span to carry information about the operation it is tracking.
3 A Span Event can be thought of as a structured log message (or annotation) on a Span, typically used to denote a meaningful, singular point in time during the Span’s duration.

The following table contains examples of span attributes:

Key Value

net.transport

IP.TCP

net.peer.ip

10.244.0.1

net.peer.port

10243

net.host.name

localhost

http.method

GET

http.target

/cart

http.server_name

frontend

http.route

/cart

http.scheme

http

http.host

localhost

http.flavor

1.1

http.status_code

200

http.user_agent

Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/106.0.0.0 Safari/537.36

For more on spans and how they pertain to OTel, see Spans.

1.2.3. Distributed Traces

A distributed trace, more commonly known as a trace, records the paths taken by requests (made by an application or end-user) as they propagate through multi-service architectures, like microservice and serverless applications.

Without tracing, it is challenging to pinpoint the cause of performance problems in a distributed system.

It improves the visibility of our application or system’s health and lets us debug behavior that is difficult to reproduce locally. Tracing is essential for distributed systems, which commonly have nondeterministic problems or are too complicated to reproduce locally.

Tracing makes debugging and understanding distributed systems less daunting by breaking down what happens within a request as it flows through a distributed system.

A trace is made of one or more spans. The first span represents the root span. Each root span represents a request from start to finish. The spans underneath the parent provide a more in-depth context of what occurs during a request (or what steps make up a request).

Many Observability back-ends visualize traces as waterfall diagrams that may look something like this:

Waterfall

Waterfall diagrams show the parent-child relationship between a root span and its child spans. When a span encapsulates another span, this also represents a nested relationship.

For more on traces and how they pertain to OTel, see Traces.

2. .NET observability with OpenTelemetry

When you run an application, you want to know how well the app is performing and to detect potential problems before they become larger. Commonly developers accomplish this by making the app emit telemetry data such as logs or metrics, then monitor and analyze that data. [5]

2.1. What is observability

Observability in the context of a distributed system is the ability to monitor and analyze telemetry about the state of each component, to be able to observe changes in performance, and to diagnose why those changes occur. Unlike debugging, which is invasive and can affect the operation of the application, observability is intended to be transparent to the primary operation and have a small enough performance impact that it can be used continuously.

Observability is commonly done using a combination of:

  • Logs, which record individual operations, such as an incoming request, a failure in a specific component, or an order being placed.

  • Metrics, which are measuring counters and gauges such as number of completed requests, active requests, widgets that have been sold; or a histogram of the request latency.

  • Distributed tracing, which tracks requests and activities across components in a distributed system so that you can see where time is spent and track down specific failures.

Together, logs, metrics, and distributed tracing are known as the 3 pillars of observability.

Each pillar might include telemetry data from:

  • The .NET runtime, such as the garbage collector or JIT compiler.

  • Libraries, such as from Kestrel (the ASP.NET web server) and HttpClient.

  • Application-specific telemetry that’s emitted by your code.

2.2. Observability approaches in .NET

There are a few different ways to achieve observability in .NET applications:

  • Explicitly in code, by referencing and using a library such as OpenTelemetry.

    If you have access to the source code and can rebuild the app, then this is the most powerful and configurable mechanism.

  • Out-of-process using EventPipe.

    Tools such as dotnet-monitor can listen to logs and metrics and then process them without affecting any code.

  • Using a startup hook, assemblies can be injected into the process that can then collect instrumentation.

    An example of this approach is OpenTelemetry .NET Automatic Instrumentation.

2.3. What is OpenTelemetry

OpenTelemetry (OTel) is a cross-platform, open standard for collecting and emitting telemetry data, which includes:

  • APIs for libraries to use to record telemetry data as code is running.

  • APIs that app developers use to configure what portion of the recorded data will be sent across the network, where it will be sent to, and how it may be filtered, buffered, enriched, and transformed.

  • Semantic conventions provide guidance on naming and content of telemetry data. It is important for the apps that produce telemetry data and the tools that receive the data to agree on what different kinds of data means and what sorts of data are useful so that the tools can provide effective analysis.

  • An interface for exporters. Exporters are plugins that allow telemetry data to be transmitted in specific formats to different telemetry backends.

  • OTLP wire protocol is a vendor neutral network protocol option for transmitting telemetry data. Some tools and vendors support this protocol in addition to pre-existing proprietary protocols they may have.

Using OTel enables the use of a wide variety of APM (Application Performance Monitoring) systems including open-source systems such as Prometheus and Grafana, Azure Monitor - Microsoft’s APM product in Azure, or from the many APM vendors that partner with OpenTelemetry.

2.4. .NET implementation of OpenTelemetry

The .NET OpenTelemetry implementation is a little different from other platforms, as .NET provides logging, metrics, and activity APIs in the framework. That means OTel doesn’t need to provide APIs for library authors to use. The .NET OTel implementation uses these platform APIs for instrumentation:

  • Microsoft.Extensions.Logging.ILogger<TCategoryName> for logging

  • System.Diagnostics.Metrics.Meter for metrics

  • System.Diagnostics.ActivitySource and System.Diagnostics.Activity for distributed tracing

.NET OTel architecture

2.5. OpenTelemetry packages

OpenTelemetry in .NET is implemented as a series of NuGet packages that form a couple of categories:

  • Core API

  • Instrumentation - these packages collect instrumentation from the runtime and common libraries.

  • Exporters - these interface with APM systems such as Prometheus, Jaeger, and OTLP.

The following table describes the main packages.

Package Name Description

OpenTelemetry

Main library that provides the core OTEL functionality

OpenTelemetry.Instrumentation.AspNetCore

Instrumentation for ASP.NET Core and Kestrel

OpenTelemetry.Instrumentation.GrpcNetClient

Instrumentation for gRPC Client for tracking outbound gRPC calls

OpenTelemetry.Instrumentation.Http

Instrumentation for HttpClient and HttpWebRequest to track outbound HTTP calls

OpenTelemetry.Instrumentation.SqlClient

Instrumentation for SqlClient used to trace database operations

OpenTelemetry.Exporter.Console

Exporter for the console, commonly used to diagnose what telemetry is being exported

OpenTelemetry.Exporter.OpenTelemetryProtocol

Exporter using the OTLP protocol

OpenTelemetry.Exporter.Prometheus.AspNetCore

Exporter for Prometheus implemented using an ASP.NET Core endpoint

OpenTelemetry.Exporter.Zipkin

Exporter for Zipkin tracing

2.6. Example: Use OpenTelemetry with Prometheus, Grafana, and Jaeger

This example uses Prometheus for metrics collection, Grafana for creating a dashboard, and Jaeger to show distributed tracing.

2.6.1. Create the project

Create a simple web API project by using the ASP.NET Core Empty template in Visual Studio or the following .NET CLI command:

dotnet new web

2.6.2. View metrics with dotnet-counters

dotnet-counters is a command-line tool that can view live metrics for .NET Core apps on demand.

  1. If the dotnet-counters tool isn’t installed, run the following command:

    dotnet tool update -g dotnet-counters
  2. Start the testing web app.

    dotnet run
    info: Microsoft.Hosting.Lifetime[14]
          Now listening on: http://localhost:5000
    info: Microsoft.Hosting.Lifetime[0]
          Application started. Press Ctrl+C to shut down.
  3. Open a new terminal, and send test HTTP request with curl or browser.

    watch curl -k http://localhost:5000
  4. Open a new terminal, and launch dotnet-counters to monitor all metrics from the Microsoft.AspNetCore.Hosting meter.

    Lists the dotnet processes that can be monitored.

    $ dotnet-counters ps
     3123  dotnet            /usr/share/dotnet/dotnet                             dotnet run
     3154  OtPrGrYa.Example  /OtPrGrYa.Example/bin/Debug/net9.0/OtPrGrYa.Example
    dotnet-counters monitor -n OtPrGrYa.Example --counters Microsoft.AspNetCore.Hosting
    Press p to pause, r to resume, q to quit.
        Status: Running
    
    Name                                                                                                                                      Current Value
    [Microsoft.AspNetCore.Hosting]
        http.server.active_requests ({request})
            http.request.method url.scheme
            GET                 http                                                                                                                  0
        http.server.request.duration (s)
            http.request.method http.response.status_code http.route network.protocol.version url.scheme Percentile
            GET                 200                       /          1.1                      http       50                                           0.001
            GET                 200                       /          1.1                      http       95                                           0.001
            GET                 200                       /          1.1                      http       99                                           0.001
    
    

2.6.3. Add metrics and activity definitions

The following code defines a new metric (greetings.count) for the number of times the API has been called, and a new activity source (OtPrGrYa.Example).

// using System.Diagnostics;
// using System.Diagnostics.Metrics;

// Custom metrics for the application
var greeterMeter = new Meter("OtPrGrYa.Example", "1.0.0");
var countGreetings = greeterMeter.CreateCounter<int>("greetings.count", description: "Counts the number of greetings");

// Custom ActivitySource for the application
var greeterActivitySource = new ActivitySource("OtPrGrJa.Example");

2.6.4. Create or update an API endpoint

app.MapGet("/", SendGreeting);
async Task<String> SendGreeting(ILogger<Program> logger)
{
    // Create a new Activity scoped to the method
    using var activity = greeterActivitySource.StartActivity("GreeterActivity");

    // Log a message
    logger.LogInformation("Sending greeting");

    // Increment the custom counter
    countGreetings.Add(1);

    // Add a tag to the Activity
    activity?.SetTag("greeting", "Hello World!");

    return "Hello World!";
}
The API definition does not use anything specific to OpenTelemetry. It uses the .NET APIs for observability.

2.6.5. Reference the OpenTelemetry packages

Use the NuGet Package Manager or command line to add the following NuGet packages:

<ItemGroup>
   <PackageReference Include="OpenTelemetry.Exporter.Console" Version="1.5.0" />
   <PackageReference Include="OpenTelemetry.Exporter.OpenTelemetryProtocol" Version="1.5.0" />
   <PackageReference Include="OpenTelemetry.Exporter.Prometheus.AspNetCore" Version="1.5.0-rc.1" />
   <PackageReference Include="OpenTelemetry.Exporter.Zipkin" Version="1.5.0" />
   <PackageReference Include="OpenTelemetry.Extensions.Hosting" Version="1.5.0" />
   <PackageReference Include="OpenTelemetry.Instrumentation.AspNetCore" Version="1.5.0-beta.1" />
   <PackageReference Include="OpenTelemetry.Instrumentation.Http" Version="1.5.0-beta.1" />
</ItemGroup>
Use the latest versions, as the OTel APIs are constantly evolving.

2.6.6. Configure OpenTelemetry with the correct providers

// using OpenTelemetry.Metrics;
// using OpenTelemetry.Resources;
// using OpenTelemetry.Trace;

var tracingOtlpEndpoint = builder.Configuration["OTLP_ENDPOINT_URL"];
var otel = builder.Services.AddOpenTelemetry();

// Configure OpenTelemetry Resources with the application name
otel.ConfigureResource(resource => resource
    .AddService(serviceName: builder.Environment.ApplicationName));

// Add Metrics for ASP.NET Core and our custom metrics and export to Prometheus
otel.WithMetrics(metrics => metrics
    // Metrics provider from OpenTelemetry
    .AddAspNetCoreInstrumentation()
    .AddMeter(greeterMeter.Name)
    // Metrics provides by ASP.NET Core in .NET 8
    .AddMeter("Microsoft.AspNetCore.Hosting")
    .AddMeter("Microsoft.AspNetCore.Server.Kestrel")
    .AddPrometheusExporter());

// Add Tracing for ASP.NET Core and our custom ActivitySource and export to Jaeger
otel.WithTracing(tracing =>
{
    tracing.AddAspNetCoreInstrumentation();
    tracing.AddHttpClientInstrumentation();
    tracing.AddSource(greeterActivitySource.Name);
    if (tracingOtlpEndpoint != null)
    {
        tracing.AddOtlpExporter(otlpOptions =>
         {
             otlpOptions.Endpoint = new Uri(tracingOtlpEndpoint);
         });
    }
    else
    {
        tracing.AddConsoleExporter();
    }
});

This code uses ASP.NET Core instrumentation to get metrics and activities from ASP.NET Core. It also registers the Metrics and ActivitySource providers for metrics and tracing respectively.

The code uses the Prometheus exporter for metrics, which uses ASP.NET Core to host the endpoint, so you also need to add:

// Configure the Prometheus scraping endpoint
app.MapPrometheusScrapingEndpoint();

2.6.7. Run the project

Run the project and then access the API with the browser or curl.

curl -k http://localhost:5000

Each time you request the page, it will increment the count for the number of greetings that have been made. You can access the metrics endpoint using the same base url, with the path /metrics.

curl -k http://localhost:5000/metrics
# TYPE greetings_count_total counter
# HELP greetings_count_total Counts the number of greetings
greetings_count_total{otel_scope_name="OtPrGrYa.Example",otel_scope_version="1.0.0"} 45 1735894061045
# TYPE kestrel_active_connections gauge
# HELP kestrel_active_connections Number of connections that are currently active on the server.
kestrel_active_connections{otel_scope_name="Microsoft.AspNetCore.Server.Kestrel",network_transport="tcp",network_type="ipv4",server_address="127.0.0.1",server_port="5000"} 1 1735894061045
# TYPE kestrel_connection_duration_seconds histogram
# UNIT kestrel_connection_duration_seconds seconds
# HELP kestrel_connection_duration_seconds The duration of connections on the server.
kestrel_connection_duration_seconds_bucket{otel_scope_name="Microsoft.AspNetCore.Server
. . .
2.6.7.1. Log output

The logging statements from the code are output using ILogger. By default, the Console Provider is enabled so that output is directed to the console.

There are a couple of options for how logs can be egressed from .NET:

  • stdout and stderr output is redirected to log files by container systems such as Kubernetes.

  • Using logging libraries that will integrate with ILogger, these include Serilog or NLog.

  • Using logging providers for OTel such as OTLP or the Azure Monitor exporter shown further below.

2.6.7.2. Access the metrics

You can access the metrics using the /metrics endpoint.

$ curl -k http://localhost:5000/
Hello World!

$ curl -k http://localhost:5000/metrics
# TYPE greetings_count_total counter
# HELP greetings_count_total Counts the number of greetings
greetings_count_total{otel_scope_name="OtPrGrYa.Example",otel_scope_version="1.0.0"} 45 1735894061045
# TYPE kestrel_active_connections gauge
# HELP kestrel_active_connections Number of connections that are currently active on the server.
kestrel_active_connections{otel_scope_name="Microsoft.AspNetCore.Server.Kestrel",network_transport="tcp",network_type="ipv4",server_address="127.0.0.1",server_port="5000"} 1 1735894061045
# TYPE kestrel_connection_duration_seconds histogram
# UNIT kestrel_connection_duration_seconds seconds
# HELP kestrel_connection_duration_seconds The duration of connections on the server.
kestrel_connection_duration_seconds_bucket{otel_scope_name="Microsoft.AspNetCore.Server
. . .
2.6.7.3. Access the tracing

If you look at the console for the server, you’ll see the output from the console trace exporter, which outputs the information in a human readable format. This should show two activities, one from your custom ActivitySource, and the other from ASP.NET Core:

Activity.TraceId:            9ef749f2829d7837e6edd163b8b6bb81
Activity.SpanId:             45e86b6601f6b09d
Activity.TraceFlags:         Recorded
Activity.ParentSpanId:       d1af72ebe3cd5dba
Activity.ActivitySourceName: OtPrGrJa.Example
Activity.DisplayName:        GreeterActivity
Activity.Kind:               Internal
Activity.StartTime:          2023-07-19T00:44:43.2738232Z
Activity.Duration:           00:00:00.0027491
Activity.Tags:
    greeting: Hello World!
Resource associated with Activity:
    service.name: OtPrGrJa.Example
    service.instance.id: 11a771a5-d03b-4f66-baa0-2e968bd8b981
    telemetry.sdk.name: opentelemetry
    telemetry.sdk.language: dotnet
    telemetry.sdk.version: 1.5.0

Activity.TraceId:            9ef749f2829d7837e6edd163b8b6bb81
Activity.SpanId:             d1af72ebe3cd5dba
Activity.TraceFlags:         Recorded
Activity.ActivitySourceName: OpenTelemetry.Instrumentation.AspNetCore
Activity.DisplayName:        /
Activity.Kind:               Server
Activity.StartTime:          2023-07-19T00:44:43.2443183Z
Activity.Duration:           00:00:00.0446847
Activity.Tags:
    net.host.name: localhost
    net.host.port: 5138
    http.method: GET
    http.scheme: http
    http.target: /
    http.url: http://localhost:5138/
    http.flavor: 1.1
    http.user_agent: curl/7.88.1
    http.status_code: 200
Resource associated with Activity:
    service.name: OtPrGrJa.Example
    service.instance.id: 11a771a5-d03b-4f66-baa0-2e968bd8b981
    telemetry.sdk.name: opentelemetry
    telemetry.sdk.language: dotnet
    telemetry.sdk.version: 1.5.0

The first is the inner custom activity you created. The second is created by ASP.NET for the request and includes tags for the HTTP request properties.

You will see that both have the same TraceId, which identifies a single transaction and in a distributed system can be used to correlate the traces from each service involved in a transaction.

  • The IDs are transmitted as HTTP headers.

  • ASP.NET Core assigns a TraceId if none is present when it receives a request.

  • HttpClient includes the headers by default on outbound requests. Each activity has a SpanId, which is the combination of TraceId and SpanId that uniquely identify each activity.

  • The Greeter activity is parented to the HTTP activity through its ParentSpanId, which maps to the SpanId of the HTTP activity.

2.6.8. Collect metrics with Prometheus

Prometheus is a metrics collection, aggregation, and time-series database system.

2.6.9. Use Grafana to create a metrics dashboard

Grafana is a dashboarding product that can create dashboards and alerts based on Prometheus or other data sources.

2.6.10. Distributed tracing with Jaeger

Jaeger (pronounced "Yay-ger") is an open-source, end-to-end distributed tracing system.