February 29, 2024

and

No items found.

OpenTelemetry Collector Under the Hood: Backpressure

What happens to your data when a receiver endpoint of the OpenTelemetry Controller becomes unavailable?

Understanding how your telemetry and observability tools behave in failure scenarios is essential for building a reliable telemetry pipeline. This post discusses what happens when a receiver endpoint of the OpenTelemetry Collector becomes temporarily unavailable, such as in the case of a network or service outage or misconfiguration.

We've included the relevant configuration files to help you get hands-on with OTel Collector in your environment.

About this blog series

We are passionate about open-source and building solutions on top of a solid foundation. Both the Axoflow Platform and our recently published project Telemetry Controller use the OpenTelemetry Collector under the hood. We've been testing the OpenTelemetry Controller intensively in different scenarios. This blog series discusses some of our findings about the behaviors, edge cases, and not-so-well-known features of the OpenTelemetry Collector. The first post is about backpressure.

What's backpressure?

Backpressure is a phenomenon that occurs when the OpenTelemetry Collector can't export the logs, metrics, or traces at the rate it ingested them. This occurs, for example, when the exporter is slow, or the destination is unavailable due to network issues.

How OpenTelemetry Collector handles backpressure

To find out how the OpenTelemetry Collector handles backpressure, we'll configure it to read logs from a file and send the logs to two separate endpoints (two other OpenTelemetry Collector instances). Then we'll shut down one of the endpoints to simulate a network outage and check the results.

Note that although reading log files is a pretty common scenario, the filelog receiver is an extension, so you need to install the OpenTelemetry Collector Contrib or the Axoflow distribution.

Default behavior

First, take a look at the default behavior of the collector. For that, generate some dummy input with the following command into the input_500k.log file. This file will be the input for the sender.

for i in $(seq -f "%06g" 1 500000); do echo line$i; done > input_500k.log

Configure an OpenTelemetry Collector (we'll refer to it as sender) to read the input_500k.log file and sends its contents to two OTLP endpoints.

receivers:
  filelog:
    include: [ input_500k.log ]
    start_at: beginning

exporters:
  otlp/1:
    endpoint: localhost:4317
    tls:
      insecure: true
  otlp/2:
    endpoint: localhost:4318
    tls:
      insecure: true

service:
  telemetry:
    metrics:
      level: detailed
      address: 0.0.0.0:8888
  pipelines:
    logs:
      receivers: [filelog]
      exporters: [otlp/1, otlp/2]

Two other OpenTelemetry Collectors (dubbed receiver_1 and receiver_2) will receive the data and write it to a log file (receiver_output_1.log and receiver_output_2.log, respectively):

receivers:
  otlp/from1:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4317

exporters:
  file/direct:
    path: ./receiver_output_1.log
service:
  telemetry:
    metrics:
      level: detailed
      address: 0.0.0.0:9998
  pipelines:
    logs/direct:
      receivers: [otlp/from1]
      exporters: [file/direct]

For the second receiver:

# Receiver 2 configuration receiver_2_config.yaml
receivers:
  otlp/from2:
    protocols:
      grpc:
        endpoint: 0.0.0.0:4318

exporters:
  file/direct:
    path: ./receiver_output_2.log
service:
  telemetry:
    metrics:
      level: detailed
      address: 0.0.0.0:9999
  pipelines:
    logs/direct:
      receivers: [otlp/from2]
      exporters: [file/direct]

While the sender collector is sending logs to the receivers, stop receiver_1 and check the logs of the sender to see what happens:

2024-02-20T14:42:25.627+0100    warn    zapgrpc/zapgrpc.go:195  [core] [Channel #3 SubChannel #4] grpc: addrConn.createTransport failed to connect to {Addr: "localhost:4317", ServerName: "localhost:4317", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:4317: connect: connection refused"      {"grpc_log": true}
# repeating lines omitted
2024-02-20T14:42:26.212+0100    error   exporterhelper/common.go:201    Exporting failed. Rejecting data.       {"kind": "exporter", "data_type": "logs", "name": "otlp/1", "error": "sending queue is full", "rejected_items": 100}
go.opentelemetry.io/collector/exporter/exporterhelper.(*baseExporter).send
        go.opentelemetry.io/collector/[email protected]/exporterhelper/common.go:201
# some stack trace
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/adapter.(*receiver).consumerLoop
        github.com/open-telemetry/opentelemetry-collector-contrib/pkg/[email protected]/adapter/receiver.go:125
2024-02-20T14:42:26.212+0100    error   consumerretry/logs.go:39        ConsumeLogs() failed. Enable retry_on_failure to slow down reading logs and avoid dropping.     {"kind": "receiver", "name": "filelog", "data_type": "logs", "error": "sending queue is full"}

The error message mentions the retry_on_failure option, let's have a look at it and see what it does!

Important takeaways

In the default use case, when an outage occurs:

The sender OpenTelemetry Collector will try to read some more logs into the sending queue, but once that's full, it starts dropping logs.
It won't attempt to resend any logs.
Also, the outage effectively blocks the other receiver as well: it will get some logs, but the throughput is drastically reduced.

Commands to run the collector instances

If you want to run the three collector instances with the above configuration snippets, run the following commands. Otherwise, you can skip to the next section.

mkdir -p working_dir

For receiver 1:

docker run -v $(pwd)/receiver_1_config.yaml:/etc/otelcol-contrib/config.yaml -v $(pwd)/working_dir:/working_dir/ --network=host --user $(id -u):$(id -g) ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:0.94.0

For receiver 2:

docker run -v $(pwd)/receiver_2_config.yaml:/etc/otelcol-contrib/config.yaml -v $(pwd)/working_dir:/working_dir/ --network=host --user $(id -u):$(id -g) ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:0.94.0

For the sender:

docker run -v $(pwd)/sender_config.yaml:/etc/otelcol-contrib/config.yaml -v $(pwd)/input_500k.log:/input_500k.log --network=host ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:0.94.0

Using `retry_on_failure`

The filelog receiver's README.md has a table for each configuration field with their defaults and short description. Currently we're interested in these two:

retry_on_failure.enabled: If true, the receiver will pause reading a file and attempt to resend the current batch of logs if it encounters an error from downstream components. (Default: false)
retry_on_failure.max_elapsed_time: Maximum amount of time (including retries) spent trying to send a logs batch to a downstream consumer. Once this value is reached, the data is discarded. Retrying never stops if set to 0. (Default: 5m)

Let's enable the retry_on_failure option on the sender, and disable the time limit of the retries, because a network outage can be way longer than the default five minutes.

# Updated sender_config.yaml
receivers:
  filelog:
    include: [ input_500k.log ]
    start_at: beginning
    retry_on_failure:
      enabled: true
      max_elapsed_time: 0

exporters:
  otlp/1:
    endpoint: localhost:4317
    tls:
      insecure: true
    retry_on_failure:
      enabled: true
      max_elapsed_time: 0
  otlp/2:
    endpoint: localhost:4318
    tls:
      insecure: true
    retry_on_failure:
      enabled: true
      max_elapsed_time: 0

service:
  telemetry:
    metrics:
      level: detailed
      address: 0.0.0.0:8888
  pipelines:
    logs:
      receivers: [filelog]
      exporters: [otlp/1, otlp/2]

With these modifications in place, restart all three OpenTelemetry Collectors, then stop one of the receivers. Check the logs of the sender:

2024-02-21T12:06:23.432+0100    info    fileconsumer/file.go:261        Started watching file   {"kind": "receiver", "name": "filelog", "data_type": "logs", "component": "fileconsumer", "path": "input_500k.log"}
2024-02-21T12:06:26.470+0100    warn    zapgrpc/zapgrpc.go:195  [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "localhost:4317", ServerName: "localhost:4317", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:4317: connect: connection refused"        {"grpc_log": true}
2024-02-21T12:06:26.470+0100    info    exporterhelper/retry_sender.go:118      Exporting failed. Will retry the request after interval.        {"kind": "exporter", "data_type": "logs", "name": "otlp/1", "error": "rpc error: code = Unavailable desc = connection error: desc = \\\\"transport: Error while dialing: dial tcp 127.0.0.1:4317: connect: connection refused\\\\"", "interval": "5.039505969s"}
# repeating errors
2024-02-21T12:06:27.052+0100    error   exporterhelper/common.go:201    Exporting failed. Rejecting data.       {"kind": "exporter", "data_type": "logs", "name": "otlp/1", "error":"sending queue is full", "rejected_items": 100}
go.opentelemetry.io/collector/exporter/exporterhelper.(*baseExporter).send
        go.opentelemetry.io/collector/[email protected]/exporterhelper/common.go:201
# rest of the stacktrace

As you can see, now the sender periodically tries to resend the logs to the endpoint. The outage still blocks the other endpoint. We will address this problem in the next section.

However, at the time of this writing (version 0.95.0) the collector shows some unusual behavior: when you restart the receiver_1 collector, it replaces the existing output file, so all data received before the outage gets lost. Hopefully, this will be fixed or at least configurable in the next release. (You can also check our suggested PR.)

But if you back up the output file before restart, and merge the backed up file with the current output file, you can compare them to the original input file:

cat receiver_output_1.log.bak receiver_output_1.log | grep -E 'line[0-9]+' -o -w | sort &gt; received.txt

git diff --exit-code  received.txt input_500k.log &amp;&amp; echo $?
0

You can see that despite the downtime, no logs were lost. Now check the uninterrupted receiver's output:

cat receiver_output_2.log | wc -l 501100

receiver_output_2.log</span></span> | wc -l
501100

git diff input_500k.log receiver_output_2.log | head -n 35
diff --git a/input_500k.log b/receiver_output_2.log
index 3a8b57a..3c7d775 100644
--- a/input_500k.log
+++ b/receiver_output_2.log
@@ -402010,104 +402010,1204 @@ line402009
 line402010
 line402011
 line402012
+line402012
+line402012
+line402012
+line402012
+line402012
+line402012
+line402012
+line402012
+line402012
+line402012
+line402012
+line402013
+line402013
+line402013
+line402013
+line402013
+line402013
+line402013
+line402013
+line402013
+line402013
 line402013
+line402013
+line402014
+line402014
+line402014
+line402014

As you can see, unfortunately there are a number of duplicates. It depends on your specific use case whether this behavior is acceptable or not. The next section shows a solution for that.

Important takeaways

When using the retry_on_failure option:

It's not enough to enable resending on failure, increase or disable the time limit.
The output file is truncated if the collector is restarted, potentially losing logs.
Resending the logs can cause duplication.

Possible mitigation

One way to avoid duplicating messages is to create a separate receiver and a separate pipeline for each exporter in the sender configuration:

receivers:
  filelog/1:
    include: [ input_500k.log ]
    start_at: beginning
    retry_on_failure:
      enabled: true
      max_elapsed_time: 0
  filelog/2:
    include: [ input_500k.log ]
    start_at: beginning
    retry_on_failure:
      enabled: true
      max_elapsed_time: 0

exporters:
  otlp/1:
    endpoint: localhost:4317
    tls:
      insecure: true
    retry_on_failure:
      enabled: true
      max_elapsed_time: 0

  otlp/2:
    endpoint: localhost:4318
    tls:
      insecure: true
    retry_on_failure:
      enabled: true
      max_elapsed_time: 0

service:
  telemetry:
    metrics:
      level: detailed
      address: 0.0.0.0:8888
  pipelines:
    logs/1:
      receivers: [filelog/1]
      exporters: [otlp/1]
    logs/2:
      receivers: [filelog/2]
      exporters: [otlp/2]

As this solution can be expensive memory-wise, it's advisable to configure the memory_limiter processor.

Closing thoughts

The current filelog receiver's default behavior might be surprising if you're used to the behavior of other aggregators. So if you are switching to OpenTelemetry Collector, make sure to double-check your assumptions and expectations. Fortunately, the OpenTelemetry Collector's documentation is exemplary and well organized: information is close-at-hand. Another surprising behavior is the filelog exporter's truncation, which at the moment isn't configurable.

If you are interested in our future posts about OpenTelemetry Collector, sign up to our newsetter!