What is OpenTelemetry
As modern application environments are polyglot, distributed, and increasingly complex, observing your application to identify and react to failures has become challenging. In early 2019, two popular instrumentation projects, OpenTracing and OpenCensus, merged to create OpenTelemetry, a new standard for observability telemetry. [1]
OpenTelemetry is a collection of APIs, SDKs, and tools. Use it to instrument, generate, collect, and export telemetry data (metrics, logs, and traces) to help you analyze your software’s performance and behavior. [2]
OpenTelemetry is generally available across several languages and is suitable for use.
OpenTelemetry, also known as OTel for short, is a vendor-neutral open-source Observability framework for instrumenting, generating, collecting, and exporting telemetry data such as traces, metrics, logs. As an industry-standard, it is natively supported by a number of vendors. [3]
- 1. What is Observability?
- 2. .NET observability with OpenTelemetry
- 2.1. What is observability
- 2.2. Observability approaches in .NET
- 2.3. What is OpenTelemetry
- 2.4. .NET implementation of OpenTelemetry
- 2.5. OpenTelemetry packages
- 2.6. Example: Use OpenTelemetry with Prometheus, Grafana, and Jaeger
- 2.6.1. Create the project
- 2.6.2. View metrics with dotnet-counters
- 2.6.3. Add metrics and activity definitions
- 2.6.4. Create or update an API endpoint
- 2.6.5. Reference the OpenTelemetry packages
- 2.6.6. Configure OpenTelemetry with the correct providers
- 2.6.7. Run the project
- 2.6.8. Collect metrics with Prometheus
- 2.6.9. Use Grafana to create a metrics dashboard
- 2.6.10. Distributed tracing with Jaeger
- References
1. What is Observability?
Observability lets us understand a system from the outside, by letting us ask questions about that system without knowing its inner workings. Furthermore, it allows us to easily troubleshoot and handle novel problems (i.e. “unknown unknowns”), and helps us answer the question, “Why is this happening?” [4]
In order to be able to ask those questions of a system, the application must be properly instrumented. That is, the application code must emit signals such as traces, metrics, logs. An application is properly instrumented when developers don’t need to add more instrumentation to troubleshoot an issue, because they have all of the information they need.
OpenTelemetry is the mechanism by which application code is instrumented, to help make a system observable.
1.1. Reliability & Metrics
Telemetry refers to data emitted from a system, about its behavior. The data can come in the form of traces, metrics, logs.
Reliability answers the question: “Is the service doing what users expect it to be doing?” A system could be up 100% of the time, but if, when a user clicks “Add to Cart” to add a black pair of pants to their shopping cart, and instead, the system doesn’t always add black pants, then the system would be said to be unreliable.
Metrics are aggregations over a period of time of numeric data about your infrastructure or application. Unlike request tracing, which is intended to capture request lifecycles and provide context to the individual pieces of a request, metrics are intended to provide statistical information in aggregate. Examples include: system error rate, CPU utilization, request rate for a given service. For more on metrics and how they pertain to OTel, see Metrics.
SLI, or Service Level Indicator, represents a measurement of a service’s behavior. A good SLI measures your service from the perspective of your users. An example SLI can be the speed at which a web page loads.
SLO, or Service Level Objective, is the means by which reliability is communicated to an organization/other teams. This is accomplished by attaching one or more SLIs to business value.
1.2. Understanding Distributed Tracing
To understand Distributed Tracing, let’s start with some basics.
1.2.1. Logs
A log is a timestamped message emitted by services or other components. Unlike traces, however, they are not necessarily associated with any particular user request or transaction. They are found almost everywhere in software, and have been heavily relied on in the past by both developers and operators alike to help them understand system behavior.
Sample log:
I, [2021-02-23T13:26:23.505892 #22473] INFO -- : [6459ffe1-ea53-4044-aaa3-bf902868f730] Started GET "/" for ::1 at 2021-02-23 13:26:23 -0800
Unfortunately, logs aren’t extremely useful for tracking code execution, as they typically lack contextual information, such as where they were called from.
They become far more useful when they are included as part of a span, or when they are correlated with a trace and a span.
For more on logs and how they pertain to OTel, see Logs.
1.2.2. Spans
A span represents a unit of work or operation. It tracks specific operations that a request makes, painting a picture of what happened during the time in which that operation was executed. Spans are the building blocks of Traces.
A span contains name, time-related data, structured log messages, and other metadata (that is, Attributes) to provide information about the operation it tracks.
{
"name": "hello-greetings",
"context": { (1)
"trace_id": "5b8aa5a2d2c872e8321cf37308d69df2",
"span_id": "5fb397be34d26b51"
},
"parent_id": "051581bf3cb55c13",
"start_time": "2022-04-29T18:52:58.114304Z",
"end_time": "2022-04-29T22:52:58.114561Z",
"attributes": { (2)
"http.route": "some_route2"
},
"events": [ (3)
{
"name": "hey there!",
"timestamp": "2022-04-29T18:52:58.114561Z",
"attributes": {
"event_attributes": 1
}
},
{
"name": "bye now!",
"timestamp": "2022-04-29T18:52:58.114585Z",
"attributes": {
"event_attributes": 1
}
}
]
}
1 | Span context is an immutable object on every span that contains the Trace ID representing the trace that the span is a part of, the span’s Span ID, Trace Flags that is a binary encoding containing information about the trace, and Trace State that is a list of key-value pairs that can carry vendor-specific trace information. |
2 | Attributes are key-value pairs that contain metadata that you can use to annotate a Span to carry information about the operation it is tracking. |
3 | A Span Event can be thought of as a structured log message (or annotation) on a Span, typically used to denote a meaningful, singular point in time during the Span’s duration. |
The following table contains examples of span attributes:
Key | Value |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
For more on spans and how they pertain to OTel, see Spans.
1.2.3. Distributed Traces
A distributed trace, more commonly known as a trace, records the paths taken by requests (made by an application or end-user) as they propagate through multi-service architectures, like microservice and serverless applications.
Without tracing, it is challenging to pinpoint the cause of performance problems in a distributed system.
It improves the visibility of our application or system’s health and lets us debug behavior that is difficult to reproduce locally. Tracing is essential for distributed systems, which commonly have nondeterministic problems or are too complicated to reproduce locally.
Tracing makes debugging and understanding distributed systems less daunting by breaking down what happens within a request as it flows through a distributed system.
A trace is made of one or more spans. The first span represents the root span. Each root span represents a request from start to finish. The spans underneath the parent provide a more in-depth context of what occurs during a request (or what steps make up a request).
Many Observability back-ends visualize traces as waterfall diagrams that may look something like this:
Waterfall diagrams show the parent-child relationship between a root span and its child spans. When a span encapsulates another span, this also represents a nested relationship.
For more on traces and how they pertain to OTel, see Traces.
2. .NET observability with OpenTelemetry
When you run an application, you want to know how well the app is performing and to detect potential problems before they become larger. Commonly developers accomplish this by making the app emit telemetry data such as logs or metrics, then monitor and analyze that data. [5]
2.1. What is observability
Observability in the context of a distributed system is the ability to monitor and analyze telemetry about the state of each component, to be able to observe changes in performance, and to diagnose why those changes occur. Unlike debugging, which is invasive and can affect the operation of the application, observability is intended to be transparent to the primary operation and have a small enough performance impact that it can be used continuously.
Observability is commonly done using a combination of:
-
Logs, which record individual operations, such as an incoming request, a failure in a specific component, or an order being placed.
-
Metrics, which are measuring counters and gauges such as number of completed requests, active requests, widgets that have been sold; or a histogram of the request latency.
-
Distributed tracing, which tracks requests and activities across components in a distributed system so that you can see where time is spent and track down specific failures.
Together, logs, metrics, and distributed tracing are known as the 3 pillars of observability.
Each pillar might include telemetry data from:
-
The .NET runtime, such as the garbage collector or JIT compiler.
-
Libraries, such as from Kestrel (the ASP.NET web server) and
HttpClient
. -
Application-specific telemetry that’s emitted by your code.
2.2. Observability approaches in .NET
There are a few different ways to achieve observability in .NET applications:
-
Explicitly in code, by referencing and using a library such as OpenTelemetry.
If you have access to the source code and can rebuild the app, then this is the most powerful and configurable mechanism.
-
Out-of-process using EventPipe.
Tools such as dotnet-monitor can listen to logs and metrics and then process them without affecting any code.
-
Using a startup hook, assemblies can be injected into the process that can then collect instrumentation.
An example of this approach is OpenTelemetry .NET Automatic Instrumentation.
2.3. What is OpenTelemetry
OpenTelemetry (OTel) is a cross-platform, open standard for collecting and emitting telemetry data, which includes:
-
APIs for libraries to use to record telemetry data as code is running.
-
APIs that app developers use to configure what portion of the recorded data will be sent across the network, where it will be sent to, and how it may be filtered, buffered, enriched, and transformed.
-
Semantic conventions provide guidance on naming and content of telemetry data. It is important for the apps that produce telemetry data and the tools that receive the data to agree on what different kinds of data means and what sorts of data are useful so that the tools can provide effective analysis.
-
An interface for exporters. Exporters are plugins that allow telemetry data to be transmitted in specific formats to different telemetry backends.
-
OTLP wire protocol is a vendor neutral network protocol option for transmitting telemetry data. Some tools and vendors support this protocol in addition to pre-existing proprietary protocols they may have.
Using OTel enables the use of a wide variety of APM (Application Performance Monitoring) systems including open-source systems such as Prometheus and Grafana, Azure Monitor - Microsoft’s APM product in Azure, or from the many APM vendors that partner with OpenTelemetry.
2.4. .NET implementation of OpenTelemetry
The .NET OpenTelemetry implementation is a little different from other platforms, as .NET provides logging, metrics, and activity APIs in the framework. That means OTel doesn’t need to provide APIs for library authors to use. The .NET OTel implementation uses these platform APIs for instrumentation:
-
Microsoft.Extensions.Logging.ILogger<TCategoryName>
forlogging
-
System.Diagnostics.Metrics.Meter
formetrics
-
System.Diagnostics.ActivitySource
andSystem.Diagnostics.Activity
for distributedtracing
2.5. OpenTelemetry packages
OpenTelemetry in .NET is implemented as a series of NuGet packages that form a couple of categories:
-
Core API
-
Instrumentation - these packages collect instrumentation from the runtime and common libraries.
-
Exporters - these interface with APM systems such as Prometheus, Jaeger, and OTLP.
The following table describes the main packages.
Package Name | Description |
---|---|
|
Main library that provides the core OTEL functionality |
|
Instrumentation for ASP.NET Core and Kestrel |
|
Instrumentation for gRPC Client for tracking outbound gRPC calls |
|
Instrumentation for HttpClient and HttpWebRequest to track outbound HTTP calls |
|
Instrumentation for SqlClient used to trace database operations |
|
Exporter for the console, commonly used to diagnose what telemetry is being exported |
|
Exporter using the OTLP protocol |
|
Exporter for Prometheus implemented using an ASP.NET Core endpoint |
|
Exporter for Zipkin tracing |
2.6. Example: Use OpenTelemetry with Prometheus, Grafana, and Jaeger
This example uses Prometheus for metrics collection, Grafana for creating a dashboard, and Jaeger to show distributed tracing.
2.6.1. Create the project
Create a simple web API project by using the ASP.NET Core Empty template in Visual Studio or the following .NET CLI command:
dotnet new web
2.6.2. View metrics with dotnet-counters
dotnet-counters is a command-line tool that can view live metrics for .NET Core apps on demand.
-
If the
dotnet-counters
tool isn’t installed, run the following command:dotnet tool update -g dotnet-counters
-
Start the testing web app.
dotnet run
info: Microsoft.Hosting.Lifetime[14] Now listening on: http://localhost:5000 info: Microsoft.Hosting.Lifetime[0] Application started. Press Ctrl+C to shut down.
-
Open a new terminal, and send test HTTP request with
curl
or browser.watch curl -k http://localhost:5000
-
Open a new terminal, and launch
dotnet-counters
to monitor all metrics from theMicrosoft.AspNetCore.Hosting
meter.Lists the dotnet processes that can be monitored.
$ dotnet-counters ps 3123 dotnet /usr/share/dotnet/dotnet dotnet run 3154 OtPrGrYa.Example /OtPrGrYa.Example/bin/Debug/net9.0/OtPrGrYa.Example
dotnet-counters monitor -n OtPrGrYa.Example --counters Microsoft.AspNetCore.Hosting
Press p to pause, r to resume, q to quit. Status: Running Name Current Value [Microsoft.AspNetCore.Hosting] http.server.active_requests ({request}) http.request.method url.scheme GET http 0 http.server.request.duration (s) http.request.method http.response.status_code http.route network.protocol.version url.scheme Percentile GET 200 / 1.1 http 50 0.001 GET 200 / 1.1 http 95 0.001 GET 200 / 1.1 http 99 0.001
2.6.3. Add metrics and activity definitions
The following code defines a new metric (greetings.count
) for the number of times the API has been called, and a new activity source (OtPrGrYa.Example
).
// using System.Diagnostics;
// using System.Diagnostics.Metrics;
// Custom metrics for the application
var greeterMeter = new Meter("OtPrGrYa.Example", "1.0.0");
var countGreetings = greeterMeter.CreateCounter<int>("greetings.count", description: "Counts the number of greetings");
// Custom ActivitySource for the application
var greeterActivitySource = new ActivitySource("OtPrGrJa.Example");
2.6.4. Create or update an API endpoint
app.MapGet("/", SendGreeting);
async Task<String> SendGreeting(ILogger<Program> logger)
{
// Create a new Activity scoped to the method
using var activity = greeterActivitySource.StartActivity("GreeterActivity");
// Log a message
logger.LogInformation("Sending greeting");
// Increment the custom counter
countGreetings.Add(1);
// Add a tag to the Activity
activity?.SetTag("greeting", "Hello World!");
return "Hello World!";
}
The API definition does not use anything specific to OpenTelemetry. It uses the .NET APIs for observability. |
2.6.5. Reference the OpenTelemetry packages
Use the NuGet Package Manager or command line to add the following NuGet packages:
<ItemGroup>
<PackageReference Include="OpenTelemetry.Exporter.Console" Version="1.5.0" />
<PackageReference Include="OpenTelemetry.Exporter.OpenTelemetryProtocol" Version="1.5.0" />
<PackageReference Include="OpenTelemetry.Exporter.Prometheus.AspNetCore" Version="1.5.0-rc.1" />
<PackageReference Include="OpenTelemetry.Exporter.Zipkin" Version="1.5.0" />
<PackageReference Include="OpenTelemetry.Extensions.Hosting" Version="1.5.0" />
<PackageReference Include="OpenTelemetry.Instrumentation.AspNetCore" Version="1.5.0-beta.1" />
<PackageReference Include="OpenTelemetry.Instrumentation.Http" Version="1.5.0-beta.1" />
</ItemGroup>
Use the latest versions, as the OTel APIs are constantly evolving. |
2.6.6. Configure OpenTelemetry with the correct providers
// using OpenTelemetry.Metrics;
// using OpenTelemetry.Resources;
// using OpenTelemetry.Trace;
var tracingOtlpEndpoint = builder.Configuration["OTLP_ENDPOINT_URL"];
var otel = builder.Services.AddOpenTelemetry();
// Configure OpenTelemetry Resources with the application name
otel.ConfigureResource(resource => resource
.AddService(serviceName: builder.Environment.ApplicationName));
// Add Metrics for ASP.NET Core and our custom metrics and export to Prometheus
otel.WithMetrics(metrics => metrics
// Metrics provider from OpenTelemetry
.AddAspNetCoreInstrumentation()
.AddMeter(greeterMeter.Name)
// Metrics provides by ASP.NET Core in .NET 8
.AddMeter("Microsoft.AspNetCore.Hosting")
.AddMeter("Microsoft.AspNetCore.Server.Kestrel")
.AddPrometheusExporter());
// Add Tracing for ASP.NET Core and our custom ActivitySource and export to Jaeger
otel.WithTracing(tracing =>
{
tracing.AddAspNetCoreInstrumentation();
tracing.AddHttpClientInstrumentation();
tracing.AddSource(greeterActivitySource.Name);
if (tracingOtlpEndpoint != null)
{
tracing.AddOtlpExporter(otlpOptions =>
{
otlpOptions.Endpoint = new Uri(tracingOtlpEndpoint);
});
}
else
{
tracing.AddConsoleExporter();
}
});
This code uses ASP.NET Core instrumentation to get metrics and activities from ASP.NET Core. It also registers the Metrics
and ActivitySource
providers for metrics and tracing respectively.
The code uses the Prometheus exporter for metrics, which uses ASP.NET Core to host the endpoint, so you also need to add:
// Configure the Prometheus scraping endpoint
app.MapPrometheusScrapingEndpoint();
2.6.7. Run the project
Run the project and then access the API with the browser or curl.
curl -k http://localhost:5000
Each time you request the page, it will increment the count for the number of greetings that have been made. You can access the metrics endpoint using the same base url, with the path /metrics
.
curl -k http://localhost:5000/metrics
# TYPE greetings_count_total counter
# HELP greetings_count_total Counts the number of greetings
greetings_count_total{otel_scope_name="OtPrGrYa.Example",otel_scope_version="1.0.0"} 45 1735894061045
# TYPE kestrel_active_connections gauge
# HELP kestrel_active_connections Number of connections that are currently active on the server.
kestrel_active_connections{otel_scope_name="Microsoft.AspNetCore.Server.Kestrel",network_transport="tcp",network_type="ipv4",server_address="127.0.0.1",server_port="5000"} 1 1735894061045
# TYPE kestrel_connection_duration_seconds histogram
# UNIT kestrel_connection_duration_seconds seconds
# HELP kestrel_connection_duration_seconds The duration of connections on the server.
kestrel_connection_duration_seconds_bucket{otel_scope_name="Microsoft.AspNetCore.Server
. . .
2.6.7.1. Log output
The logging statements from the code are output using ILogger
. By default, the Console Provider is enabled so that output is directed to the console.
There are a couple of options for how logs can be egressed from .NET:
-
stdout
andstderr
output is redirected to log files by container systems such as Kubernetes. -
Using logging libraries that will integrate with ILogger, these include Serilog or NLog.
-
Using logging providers for OTel such as OTLP or the Azure Monitor exporter shown further below.
2.6.7.2. Access the metrics
You can access the metrics using the /metrics
endpoint.
$ curl -k http://localhost:5000/
Hello World!
$ curl -k http://localhost:5000/metrics
# TYPE greetings_count_total counter
# HELP greetings_count_total Counts the number of greetings
greetings_count_total{otel_scope_name="OtPrGrYa.Example",otel_scope_version="1.0.0"} 45 1735894061045
# TYPE kestrel_active_connections gauge
# HELP kestrel_active_connections Number of connections that are currently active on the server.
kestrel_active_connections{otel_scope_name="Microsoft.AspNetCore.Server.Kestrel",network_transport="tcp",network_type="ipv4",server_address="127.0.0.1",server_port="5000"} 1 1735894061045
# TYPE kestrel_connection_duration_seconds histogram
# UNIT kestrel_connection_duration_seconds seconds
# HELP kestrel_connection_duration_seconds The duration of connections on the server.
kestrel_connection_duration_seconds_bucket{otel_scope_name="Microsoft.AspNetCore.Server
. . .
2.6.7.3. Access the tracing
If you look at the console for the server, you’ll see the output from the console trace exporter, which outputs the information in a human readable format. This should show two activities, one from your custom ActivitySource
, and the other from ASP.NET Core:
Activity.TraceId: 9ef749f2829d7837e6edd163b8b6bb81
Activity.SpanId: 45e86b6601f6b09d
Activity.TraceFlags: Recorded
Activity.ParentSpanId: d1af72ebe3cd5dba
Activity.ActivitySourceName: OtPrGrJa.Example
Activity.DisplayName: GreeterActivity
Activity.Kind: Internal
Activity.StartTime: 2023-07-19T00:44:43.2738232Z
Activity.Duration: 00:00:00.0027491
Activity.Tags:
greeting: Hello World!
Resource associated with Activity:
service.name: OtPrGrJa.Example
service.instance.id: 11a771a5-d03b-4f66-baa0-2e968bd8b981
telemetry.sdk.name: opentelemetry
telemetry.sdk.language: dotnet
telemetry.sdk.version: 1.5.0
Activity.TraceId: 9ef749f2829d7837e6edd163b8b6bb81
Activity.SpanId: d1af72ebe3cd5dba
Activity.TraceFlags: Recorded
Activity.ActivitySourceName: OpenTelemetry.Instrumentation.AspNetCore
Activity.DisplayName: /
Activity.Kind: Server
Activity.StartTime: 2023-07-19T00:44:43.2443183Z
Activity.Duration: 00:00:00.0446847
Activity.Tags:
net.host.name: localhost
net.host.port: 5138
http.method: GET
http.scheme: http
http.target: /
http.url: http://localhost:5138/
http.flavor: 1.1
http.user_agent: curl/7.88.1
http.status_code: 200
Resource associated with Activity:
service.name: OtPrGrJa.Example
service.instance.id: 11a771a5-d03b-4f66-baa0-2e968bd8b981
telemetry.sdk.name: opentelemetry
telemetry.sdk.language: dotnet
telemetry.sdk.version: 1.5.0
The first is the inner custom activity you created. The second is created by ASP.NET for the request and includes tags for the HTTP request properties.
You will see that both have the same TraceId
, which identifies a single transaction and in a distributed system can be used to correlate the traces from each service involved in a transaction.
-
The IDs are transmitted as HTTP headers.
-
ASP.NET Core assigns a
TraceId
if none is present when it receives a request. -
HttpClient
includes the headers by default on outbound requests. Each activity has aSpanId
, which is the combination ofTraceId
andSpanId
that uniquely identify each activity. -
The
Greeter
activity is parented to the HTTP activity through itsParentSpanId
, which maps to theSpanId
of the HTTP activity.
2.6.8. Collect metrics with Prometheus
Prometheus is a metrics collection, aggregation, and time-series database system.
2.6.9. Use Grafana to create a metrics dashboard
Grafana is a dashboarding product that can create dashboards and alerts based on Prometheus or other data sources.
2.6.10. Distributed tracing with Jaeger
Jaeger (pronounced "Yay-ger") is an open-source, end-to-end distributed tracing system.