About this blog series
We are passionate about open-source and building solutions on top of a solid foundation. Both the Axoflow Platform and our recently published project Telemetry Controller use the OpenTelemetry Collector under the hood. We’ve been testing the OpenTelemetry Controller intensively in different scenarios. This blog series discusses some of our findings about the behaviors, edge cases, and not-so-well-known features of the OpenTelemetry Collector. The first post is about backpressure.
What’s backpressure?
Backpressure is a phenomenon that occurs when the OpenTelemetry Collector can’t export the logs, metrics, or traces at the rate it ingested them. This occurs, for example, when the exporter is slow, or the destination is unavailable due to network issues.
How OpenTelemetry Collector handles backpressure
To find out how the OpenTelemetry Collector handles backpressure, we’ll configure it to read logs from a file and send the logs to two separate endpoints (two other OpenTelemetry Collector instances). Then we’ll shut down one of the endpoints to simulate a network outage and check the results.
Note that although reading log files is a pretty common scenario, the filelog
receiver is an extension, so you need to install the OpenTelemetry Collector Contrib or the Axoflow distribution.
Default behavior
First, take a look at the default behavior of the collector. For that, generate some dummy input with the following command into the input_500k.log
file. This file will be the input for the sender
.
for i in $(seq -f "%06g" 1 500000); do echo line$i; done > input_500k.log
Configure an OpenTelemetry Collector (we’ll refer to it as sender
) to read the input_500k.log
file and sends its contents to two OTLP endpoints.
# Sender configuration sender_config.yaml
receivers:
filelog:
include: [ input_500k.log ]
start_at: beginning
exporters:
otlp/1:
endpoint: localhost:4317
tls:
insecure: true
otlp/2:
endpoint: localhost:4318
tls:
insecure: true
service:
telemetry:
metrics:
level: detailed
address: 0.0.0.0:8888
pipelines:
logs:
receivers: [filelog]
exporters: [otlp/1, otlp/2]
Two other OpenTelemetry Collectors (dubbed receiver_1
and receiver_2
) will receive the data and write it to a log file (receiver_output_1.log
and receiver_output_2.log
, respectively):
# Receiver 1 configuration receiver_1_config.yaml
receivers:
otlp/from1:
protocols:
grpc:
endpoint: 0.0.0.0:4317
exporters:
file/direct:
path: ./receiver_output_1.log
service:
telemetry:
metrics:
level: detailed
address: 0.0.0.0:9998
pipelines:
logs/direct:
receivers: [otlp/from1]
exporters: [file/direct]
For the second receiver:
# Receiver 2 configuration receiver_2_config.yaml
receivers:
otlp/from2:
protocols:
grpc:
endpoint: 0.0.0.0:4318
exporters:
file/direct:
path: ./receiver_output_2.log
service:
telemetry:
metrics:
level: detailed
address: 0.0.0.0:9999
pipelines:
logs/direct:
receivers: [otlp/from2]
exporters: [file/direct]
While the sender collector is sending logs to the receivers, stop receiver_1
and check the logs of the sender
to see what happens:
2024-02-20T14:42:25.627+0100 warn zapgrpc/zapgrpc.go:195 [core] [Channel #3 SubChannel #4] grpc: addrConn.createTransport failed to connect to {Addr: "localhost:4317", ServerName: "localhost:4317", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:4317: connect: connection refused" {"grpc_log": true}
# repeating lines omitted
2024-02-20T14:42:26.212+0100 error exporterhelper/common.go:201 Exporting failed. Rejecting data. {"kind": "exporter", "data_type": "logs", "name": "otlp/1", "error": "sending queue is full", "rejected_items": 100}
go.opentelemetry.io/collector/exporter/exporterhelper.(*baseExporter).send
go.opentelemetry.io/collector/[email protected]/exporterhelper/common.go:201
# some stack trace
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/stanza/adapter.(*receiver).consumerLoop
github.com/open-telemetry/opentelemetry-collector-contrib/pkg/[email protected]/adapter/receiver.go:125
2024-02-20T14:42:26.212+0100 error consumerretry/logs.go:39 ConsumeLogs() failed. Enable retry_on_failure to slow down reading logs and avoid dropping. {"kind": "receiver", "name": "filelog", "data_type": "logs", "error": "sending queue is full"}
The error message mentions the retry_on_failure
option, let’s have a look at it and see what it does!
Important takeaways
In the default use case, when an outage occurs:
- The
sender
OpenTelemetry Collector will try to read some more logs into the sending queue, but once that’s full, it starts dropping logs. - It won’t attempt to resend any logs.
- Also, the outage effectively blocks the other receiver as well: it will get some logs, but the throughput is drastically reduced.
Commands to run the collector instances
If you want to run the three collector instances with the above configuration snippets, run the following commands. Otherwise, you can skip to the next section.
mkdir -p working_dir
For receiver 1:
docker run -v $(pwd)/receiver_1_config.yaml:/etc/otelcol-contrib/config.yaml -v $(pwd)/working_dir:/working_dir/ --network=host --user $(id -u):$(id -g) ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:0.94.0
For receiver 2:
docker run -v $(pwd)/receiver_2_config.yaml:/etc/otelcol-contrib/config.yaml -v $(pwd)/working_dir:/working_dir/ --network=host --user $(id -u):$(id -g) ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:0.94.0
For the sender:
docker run -v $(pwd)/sender_config.yaml:/etc/otelcol-contrib/config.yaml -v $(pwd)/input_500k.log:/input_500k.log --network=host ghcr.io/open-telemetry/opentelemetry-collector-releases/opentelemetry-collector-contrib:0.94.0
Using retry_on_failure
The filelog
receiver’s README.md has a table for each configuration field with their defaults and short description. Currently we’re interested in these two:
retry_on_failure.enabled
: If true, the receiver will pause reading a file and attempt to resend the current batch of logs if it encounters an error from downstream components. (Default: false)retry_on_failure.max_elapsed_time
: Maximum amount of time (including retries) spent trying to send a logs batch to a downstream consumer. Once this value is reached, the data is discarded. Retrying never stops if set to 0. (Default: 5m)
Let’s enable the retry_on_failure
option on the sender
, and disable the time limit of the retries, because a network outage can be way longer than the default five minutes.
# Updated sender_config.yaml
receivers:
filelog:
include: [ input_500k.log ]
start_at: beginning
retry_on_failure:
enabled: true
max_elapsed_time: 0
exporters:
otlp/1:
endpoint: localhost:4317
tls:
insecure: true
retry_on_failure:
enabled: true
max_elapsed_time: 0
otlp/2:
endpoint: localhost:4318
tls:
insecure: true
retry_on_failure:
enabled: true
max_elapsed_time: 0
service:
telemetry:
metrics:
level: detailed
address: 0.0.0.0:8888
pipelines:
logs:
receivers: [filelog]
exporters: [otlp/1, otlp/2]
With these modifications in place, restart all three OpenTelemetry Collectors, then stop one of the receivers. Check the logs of the sender
:
2024-02-21T12:06:23.432+0100 info fileconsumer/file.go:261 Started watching file {"kind": "receiver", "name": "filelog", "data_type": "logs", "component": "fileconsumer", "path": "input_500k.log"}
2024-02-21T12:06:26.470+0100 warn zapgrpc/zapgrpc.go:195 [core] [Channel #1 SubChannel #2] grpc: addrConn.createTransport failed to connect to {Addr: "localhost:4317", ServerName: "localhost:4317", }. Err: connection error: desc = "transport: Error while dialing: dial tcp 127.0.0.1:4317: connect: connection refused" {"grpc_log": true}
2024-02-21T12:06:26.470+0100 info exporterhelper/retry_sender.go:118 Exporting failed. Will retry the request after interval. {"kind": "exporter", "data_type": "logs", "name": "otlp/1", "error": "rpc error: code = Unavailable desc = connection error: desc = \\\\"transport: Error while dialing: dial tcp 127.0.0.1:4317: connect: connection refused\\\\"", "interval": "5.039505969s"}
# repeating errors
2024-02-21T12:06:27.052+0100 error exporterhelper/common.go:201 Exporting failed. Rejecting data. {"kind": "exporter", "data_type": "logs", "name": "otlp/1", "error":"sending queue is full", "rejected_items": 100}
go.opentelemetry.io/collector/exporter/exporterhelper.(*baseExporter).send
go.opentelemetry.io/collector/[email protected]/exporterhelper/common.go:201
# rest of the stacktrace
receiver_1
collector, it replaces the existing output file, so all data received before the outage gets lost. Hopefully, this will be fixed or at least configurable in the next release. (You can also check our suggested PR.)cat receiver_output_1.log.bak receiver_output_1.log | grep -E 'line[0-9]+' -o -w | sort > received.txt
git diff --exit-code received.txt input_500k.log && echo $?
0
You can see that despite the downtime, no logs were lost. Now check the uninterrupted receiver’s output:
cat receiver_output_2.log | wc -l
501100
git diff input_500k.log receiver_output_2.log | head -n 35
diff --git a/input_500k.log b/receiver_output_2.log
index 3a8b57a..3c7d775 100644
--- a/input_500k.log
+++ b/receiver_output_2.log
@@ -402010,104 +402010,1204 @@ line402009
line402010
line402011
line402012
+line402012
+line402012
+line402012
+line402012
+line402012
+line402012
+line402012
+line402012
+line402012
+line402012
+line402012
+line402013
+line402013
+line402013
+line402013
+line402013
+line402013
+line402013
+line402013
+line402013
+line402013
line402013
+line402013
+line402014
+line402014
+line402014
+line402014
Important takeaways
When using the retry_on_failure
option:
- It’s not enough to enable resending on failure, increase or disable the time limit.
- The output file is truncated if the collector is restarted, potentially losing logs.
- Resending the logs can cause duplication.
Possible mitigation
One way to avoid duplicating messages is to create a separate receiver and a separate pipeline for each exporter in the sender
configuration:
receivers:
filelog/1:
include: [ input_500k.log ]
start_at: beginning
retry_on_failure:
enabled: true
max_elapsed_time: 0
filelog/2:
include: [ input_500k.log ]
start_at: beginning
retry_on_failure:
enabled: true
max_elapsed_time: 0
exporters:
otlp/1:
endpoint: localhost:4317
tls:
insecure: true
retry_on_failure:
enabled: true
max_elapsed_time: 0
otlp/2:
endpoint: localhost:4318
tls:
insecure: true
retry_on_failure:
enabled: true
max_elapsed_time: 0
service:
telemetry:
metrics:
level: detailed
address: 0.0.0.0:8888
pipelines:
logs/1:
receivers: [filelog/1]
exporters: [otlp/1]
logs/2:
receivers: [filelog/2]
exporters: [otlp/2]
As this solution can be expensive memory-wise, it’s advisable to configure the memory_limiter
processor.
Closing thoughts
The current filelog
receiver’s default behavior might be surprising if you’re used to the behavior of other aggregators. So if you are switching to OpenTelemetry Collector, make sure to double-check your assumptions and expectations. Fortunately, the OpenTelemetry Collector’s documentation is exemplary and well organized: information is close-at-hand. Another surprising behavior is the filelog exporter’s truncation, which at the moment isn’t configurable.
If you are interested in our future posts about OpenTelemetry Collector, sign up to our newsletter!

Follow Our Progress!
We are excited to be realizing our vision above with a full Axoflow product suite.
