In distributed systems, one of the most critical requirement is ensuring that no telemetry data is lost, even during system failures or network issues. The OpenTelemetry Collector’s persistence mechanism reliably handles events like collector restarts or downstream unavailability, preventing telemetry data loss. Understanding how these work under the hood is crucial for administrators to ensure reliable telemetry data collection and transmission in their environments. Everything we’re discussing is fully applicable to OpenTelemetry Collector, but in the demos we’ll be using Telemetry Controller, as that’s easier to use and configure.
Since the v0.0.10
release, we’ve made awesome improvements to the Telemetry Controller (which uses our custom OpenTelemetry Collector distribution for log collection), focusing on its reliability and operational capabilities. Some of our key additions include:
- Authentication mechanisms was added with secret injection support for
OTLPHTTP
andOTLPGRPC
- With the introduction of the Bridge CR, advanced tenant routing functionality is possible
- Enhanced configuration validation to prevent misconfigurations
- Stabilized operator behavior with various reliability improvements
- Introduced the batch processor for optimized data handling
In this article, we’ll dive deep into one of the most critical features we’ve exposed in the Telemetry Controller: the persistence and retry mechanisms.
Why Do You Need Persistence?
The persistence feature primarily relies on storage extensions. For Telemetry Controller we use the file storage extension as that covers all our use cases at the moment. The OTEL Collector contrib distribution supports other storage extensions as well:
-
File Storage: The most commonly used persistence method is to store the data on the local filesystem. It offers:
- Configurable storage directory with customizable permissions
- File locking mechanism with adjustable timeouts
- Advanced compaction strategies to optimize storage space
- Data integrity protection through fsync capabilities
- Automatic directory creation and cleanup options
-
Database Storage: For environments requiring structured storage, supporting:
- Multiple SQL databases including SQLite and PostgreSQL
- Flexible driver configuration
- Transaction-safe operations
-
Redis Storage: For distributed environments, offering:
- Cluster support for high availability
- In-memory performance
Preventing Data Loss: Beyond Persistence
While persistence is crucial, the Exporter Helper functionality of the OpenTelemetry Collector provides additional mechanisms to prevent data loss. You can see these on the key components of the Consumer-Producer Architecture used by the Collector figure (Ref: https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/exporterhelper/README.md):
The key points of this architecture are the following:
- Multiple consumers can process the queued items concurrently
- Temporary and permanent failures are handled separately
- Smart batch dispatch mechanism
- Automatic queue management
The implementation offers the following important mechanisms:
-
Queue Mechanism options:
- In-memory queue with configurable size
- Multiple concurrent consumers
- Queue sizing based on traffic patterns
-
Retry Strategy:
- Option to control what will happen to telemetry data if it couldn’t be delivered
- Progressive backoff with customizable intervals
- Timeout controls for individual retry attempts
Testing persistency in OpenTelemetry Collector
During the development of the Telemetry Controller, our primary goal was to create a comprehensive solution that eliminates the complexity typically associated with telemetry collection.
A critical aspect of this reliability is persistence. In production environments, losing telemetry data isn’t just an inconvenience, it can mean:
- missing critical insights about your system’s behavior,
- losing important audit trails, or
- failing to catch performance issues.
Let’s explore how the OpenTelemetry Collector handles various scenarios in practice.
- We’ll create two tenants (
tenant-web
andtenant-db
). - Both tenants will collect logs from the corresponding namespace they are named after (
web
anddb
). - Each of them will have it’s own persistent file storage, which is used as storage for their receiver and exporters.
- Retry on failure is also set on every receiver and exporter with
max_elapsed_time
set to 0, which means that no logs should be thrown away at any point of time. - Each exporter will have a sending queue with a default queue size of a 100 batches.
The following figure shows the main configuration options of the components, as well as their effects and relationships.
We’ll test two cases:
- When a downstream output becomes unavailable
- When a collector failure occurs
Setup
NOTE: You can find the example manifests used in this demonstration in the Telemetry Controller’s repository
To make it easy to follow along and experiment with different scenarios, we’ll show you how to run our test cases in a local Kubernetes environment running on Kind. Everything we’re discussing is fully applicable to OpenTelemetry Collector, but in the demos we’ll be using Telemetry Controller, as that’s easier to use and configure.
- First, create a cluster and install the Telemetry Controller.
kind create cluster # This installs the Telemetry Controller, and opentelemetry-operator as a sub-chart. helm upgrade --install --wait --create-namespace --namespace telemetry-controller-system telemetry-controller oci://ghcr.io/kube-logging/helm-charts/telemetry-controller --version=0.0.15
- Deploy the Telemetry Controller components.
- Namespaces
kubectl apply -f https://raw.githubusercontent.com/kube-logging/telemetry-controller/refs/heads/main/docs/demos/persistence-and-retry-mechanisms/ns.yaml
- Telemetry Controller resources
kubectl apply -f https://raw.githubusercontent.com/kube-logging/telemetry-controller/refs/heads/main/docs/demos/persistence-and-retry-mechanisms/tc-resources/collector.yaml,https://raw.githubusercontent.com/kube-logging/telemetry-controller/refs/heads/main/docs/demos/persistence-and-retry-mechanisms/tc-resources/tc.yaml
Upon applying these resources, Telemetry Controller creates a collector with the following configuration.
- Namespaces
- To generate an exact amount of logs for the tests, deploy the Log Generator.
helm upgrade --install --wait --namespace web log-generator oci://ghcr.io/kube-logging/helm-charts/log-generator --set app.count=0 --set app.eventPerSec=50 helm upgrade --install --wait --namespace db log-generator oci://ghcr.io/kube-logging/helm-charts/log-generator --set app.count=0 --set app.eventPerSec=50
The
app.count=0
option is needed so log-generator doesn’t start producing logs when it’s initialized. - Deploy the receiver collectors. Two such collectors are deployed, the only difference is in the name and namespace fields (
web
anddb
). The receivers are getting logs usingOTLPGRPC
, and export the received data into a file.kubectl apply -f https://raw.githubusercontent.com/kube-logging/telemetry-controller/refs/heads/main/docs/demos/persistence-and-retry-mechanisms/receivers/receiver-db.yaml,https://raw.githubusercontent.com/kube-logging/telemetry-controller/refs/heads/main/docs/demos/persistence-and-retry-mechanisms/receivers/receiver-web.yaml
Now that everything is deployed and configured, let’s test the OTEL Collector and see how it handles different events that disrupt the telemetry pipeline.
Scenario 1: Unavailable Downstream Output
In this scenario, we’ll examine how the OpenTelemetry Collector behaves when a single output destination becomes unavailable. This is a common situation that can cause backpressure on upstream components, whether due to network issues, system maintenance, or service outages. We’ll see how the Collector’s persistence and retry mechanisms work together to ensure no data is lost during the downtime.
- Check that the previously deployed components are up and running (only showing relevant resources):
kubectl get pods -A NAMESPACE NAME READY STATUS collector otelcollector-cluster-collector-8t9vn 1/1 Running db collector-receiver-db-collector-5764f9fcb8-gjlbs 1/1 Running telemetry-controller-system telemetry-controller-9bfc7bd5-ntx99 1/1 Running telemetry-controller-system telemetry-controller-opentelemetry-operator-7f9f6d6c97-579p4 2/2 Running web collector-receiver-web-collector-8d78dc74d-nq59n 1/1 Running
and
kubectl get telemetry-all -A NAME TENANTS STATE collector.telemetry.kube-logging.dev/cluster ["tenant-db","tenant-web"] ready NAMESPACE NAME db output.telemetry.kube-logging.dev/output-db web output.telemetry.kube-logging.dev/output-web NAMESPACE NAME TENANT OUTPUTS STATE db subscription.telemetry.kube-logging.dev/sub-db tenant-db [{"name":"output-db","namespace":"db"}] ready web subscription.telemetry.kube-logging.dev/sub-web tenant-web [{"name":"output-web","namespace":"web"}] ready NAME SUBSCRIPTIONS LOGSOURCE NAMESPACES STATE tenant.telemetry.kube-logging.dev/tenant-db [{"name":"sub-db","namespace":"db"}] ["db"] ready tenant.telemetry.kube-logging.dev/tenant-web [{"name":"sub-web","namespace":"web"}] ["web"] ready
- Hit the log-generator’s
loggen
endpoint and make it produce 10k Apache access logs:kubectl port-forward --namespace "web" "deployments/log-generator" 11000:11000 & kubectl port-forward --namespace "db" "deployments/log-generator" 11001:11000 & curl --location --request POST '127.0.0.1:11000/loggen' --header 'Content-Type: application/json' --data-raw '{ "type": "web", "format": "apache", "count": 10000 }' curl --location --request POST '127.0.0.1:11001/loggen' --header 'Content-Type: application/json' --data-raw '{ "type": "web", "format": "apache", "count": 10000 }'
- Verify that both receivers are working by checking the contents of the file they’re writing into:
docker exec kind-control-plane head /etc/otelcol-contrib/export-web.log | jq docker exec kind-control-plane head /etc/otelcol-contrib/export-db.log | jq
Show example output (truncated)
{ "resourceLogs": [ { "scopeLogs": [ { "logRecords": [ { "timeUnixNano": "1734689097170297463", "observedTimeUnixNano": "1734689097283413254", "body": { "stringValue": "135.93.33.38 - - [20/Dec/2024:10:04:57 +0000] - \"PUT /index.html HTTP/1.1\" 200 18175 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/602.1.50 (KHTML, like Gecko) Version/10.0 Safari/602.1.50\" \"-\"" }, "attributes": [ { "key": "log.file.path", "value": { "stringValue": "/var/log/pods/db_log-generator-77797dff76-q8v5x_a7331723-b9b2-4356-9cc9-f08064da1c8e/log-generator/0.log" } }, { "key": "time", "value": { "stringValue": "2024-12-20T10:04:57.170297463Z" } }, { "key": "log.iostream", "value": { "stringValue": "stdout" } }, { "key": "logtag", "value": { "stringValue": "F" } }, { "key": "tenant", "value": { "stringValue": "tenant-db" } }, { "key": "subscription", "value": { "stringValue": "sub-db" } }, { "key": "exporter", "value": { "stringValue": "otlp/db_output-db" } } ], } ] } ] } ] }
- Let’s cause can outage by shutting down the db receiver:
kubectl delete opentelemetrycollector collector-receiver-db --namespace db
Because of this, the cluster collector starts reporting the outage:
kubectl logs -n collector daemonsets/otelcollector-cluster-collector
2024-12-20T10:14:41.789Z info internal/retry_sender.go:126 Exporting failed. Will retry the request after interval. { "kind": "exporter", "data_type": "logs", "name": "otlp/db_output-db", "error": "rpc error: code = DeadlineExceeded desc = received context error while waiting for new LB policy update: context deadline exceeded", "interval": "4.22153556s" } 2024-12-20T10:14:41.988Z info internal/retry_sender.go:126 Exporting failed. Will retry the request after interval. { "kind": "exporter", "data_type": "logs", "name": "otlp/db_output-db", "error": "rpc error: code = DeadlineExceeded desc = received context error while waiting for new LB policy update: context deadline exceeded", "interval": "3.713328578s" } 2024-12-20T10:14:42.192Z info internal/retry_sender.go:126 Exporting failed. Will retry the request after interval. { "kind": "exporter", "data_type": "logs", "name": "otlp/db_output-db", "error": "rpc error: code = DeadlineExceeded desc = received context error while waiting for new LB policy update: context deadline exceeded", "interval": "4.424550142s" }
Once the sending queue fills up entirely (it’s set to 100 batches), the exporter stops reading more data.
2024-12-20T10:16:52.639Z error internal/base_exporter.go:130 Exporting failed. Rejecting data. { "kind": "exporter", "data_type": "logs", "name": "otlp/db_output-db", "error": "sending queue is full", "rejected_items": 10 }
- Take a look at the file storage directories:
docker exec kind-control-plane ls /var/lib/otelcol/file_storage/tenant-web docker exec kind-control-plane ls /var/lib/otelcol/file_storage/tenant-db
Expected output:
exporter_otlp_web_output-web_logs receiver_filelog_tenant-web exporter_otlp_db_output-db_logs receiver_filelog_tenant-db
There are separate files for both tenants receiver and exporter. For this scenario, let’s check out the contents of the file created for the exporters:
docker exec kind-control-plane cat /var/lib/otelcol/file_storage/tenant-web/exporter_otlp_web_output-web_logs | tail -n 50 | strings docker exec kind-control-plane cat /var/lib/otelcol/file_storage/tenant-db/exporter_otlp_db_output-db_logs | tail -n 50 | strings
Output (truncated):
tenant-db2 subscription sub-db2 exporter otlp/db_output-dbJ ßHV6*O 160.38.112.115 - - [10/Jan/2025:10:49:34 +0000] - "POST /blog HTTP/1.1" 503 16019 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36" "-"2 logtag time 2025-01-10T10:49:34.046881959Z2 log.iostream stdout2{ log.file.path h/var/log/pods/db_log-generator-6b48b4fc68-hqdxm_c7fd09ad-a3d3-45b3-86ff-1e4b052e709f/log-generator/0.log2 tenant tenant-db2 subscription sub-db2 exporter otlp/db_output-dbJ 114.83.144.155 - - [10/Jan/2025:10:49:34 +0000] - "POST /blog HTTP/1.1" 200 13679 "-" "Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36" "-"2{ log.file.path h/var/log/pods/db_log-generator-6b48b4fc68-hqdxm_c7fd09ad-a3d3-45b3-86ff-1e4b052e709f/log-generator/0.log2 log.iostream stdout2( time 2025-01-10T10:49:34.067475959Z2 logtag tenant tenant-db2 subscription sub-db2 exporter otlp/db_output-dbJ ˆ—8*O 67.38.122.58 - - [10/Jan/2025:10:49:34 +0000] - "GET /index.html HTTP/1.1" 503 12376 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; MALNJS; rv:11.0) like Gecko" "-"2( time 2025-01-10T10:49:34.088541834Z2{ log.file.path h/var/log/pods/db_log-generator-6b48b4fc68-hqdxm_c7fd09ad-a3d3-45b3-86ff-1e4b052e709f/log-generator/0.log2 log.iostream stdout2 logtag tenant tenant-db2 subscription sub-db2 exporter otlp/db_output-dbJ
Key points: The data in the persisted files appears as a mix of readable text and binary content because:
- The OpenTelemetry Collector pipeline uses Protocol Buffers (protobuf) serialization for efficient storage:
- After data is received through a receiver component, it is converted into pipeline data or pdata for short.
- All data processing within the pipeline happens using this pdata format.
- The internal representation uses
OTLP ProtoBuf
structs. - Before transmission, the exporters convert the data to the appropriate format.
- Protobuf uses a compact binary encoding for all data, including metadata, control fields, and log content:
- Strings, such as log content, are stored as UTF-8 within the binary representation.
- This format ensures both storage efficiency and reliable state persistence.
NOTE: When viewed with strings, we see only the UTF-8 encoded portions, making the content appear fragmented.
- The OpenTelemetry Collector pipeline uses Protocol Buffers (protobuf) serialization for efficient storage:
- Restart the receiver. Once it’s up and running again, the collector stops logging export failures.
kubectl apply -f https://raw.githubusercontent.com/kube-logging/telemetry-controller/refs/heads/main/docs/demos/persistence-and-retry-mechanisms/receivers/receiver-db.yaml
- Verify that no logs were lost in either tenant. The following jq command parses each batch, outputting the exact number of logs exported.
docker exec kind-control-plane cat /etc/otelcol-contrib/export-web.log | jq -r '.resourceLogs[].scopeLogs[].logRecords[] | .body.stringValue' | wc -l docker exec kind-control-plane cat /etc/otelcol-contrib/export-db.log | jq -r '.resourceLogs[].scopeLogs[].logRecords[] | .body.stringValue' | wc -l
Expected output:
10000 10000
Results:
- The web tenant wasn’t affected by the outage, so it received 10k messages.
- The db tenant was also able to handle the outage, and transmitted all 10k messages.
Scenario 2: Collector Failure
Let’s see how the Collector handles a failure scenario when the Collector itself becomes unavailable. We’ll explore how the built-in resilience mechanisms ensure data integrity and recovery during such failures. (We assume that the previously deployed components are up and running.)
- Hit the log-generator’s
loggen
endpoint again and make it produce 10k Apache access logs.kubectl port-forward --namespace "web" "deployments/log-generator" 11000:11000 & kubectl port-forward --namespace "db" "deployments/log-generator" 11001:11000 & curl --location --request POST '127.0.0.1:11000/loggen' --header 'Content-Type: application/json' --data-raw '{ "type": "web", "format": "apache", "count": 10000 }' curl --location --request POST '127.0.0.1:11001/loggen' --header 'Content-Type: application/json' --data-raw '{ "type": "web", "format": "apache", "count": 10000 }'
- Verify that both receivers are working by checking the contents of the file they’re writing into.
docker exec kind-control-plane head /etc/otelcol-contrib/export-web.log | jq docker exec kind-control-plane head /etc/otelcol-contrib/export-db.log | jq
Show example output (truncated)
{ "resourceLogs": [ { "scopeLogs": [ { "logRecords": [ { "timeUnixNano": "1734689097170297463", "observedTimeUnixNano": "1734689097283413254", "body": { "stringValue": "135.93.33.38 - - [20/Dec/2024:10:04:57 +0000] - \"PUT /index.html HTTP/1.1\" 200 18175 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/602.1.50 (KHTML, like Gecko) Version/10.0 Safari/602.1.50\" \"-\"" }, "attributes": [ { "key": "log.file.path", "value": { "stringValue": "/var/log/pods/db_log-generator-77797dff76-q8v5x_a7331723-b9b2-4356-9cc9-f08064da1c8e/log-generator/0.log" } }, { "key": "time", "value": { "stringValue": "2024-12-20T10:04:57.170297463Z" } }, { "key": "log.iostream", "value": { "stringValue": "stdout" } }, { "key": "logtag", "value": { "stringValue": "F" } }, { "key": "tenant", "value": { "stringValue": "tenant-db" } }, { "key": "subscription", "value": { "stringValue": "sub-db" } }, { "key": "exporter", "value": { "stringValue": "otlp/db_output-db" } } ], } ] } ] } ] }
- Cause an outage by shutting down the collector.
kubectl delete collector cluster
- Take a look on the file storage directories.
docker exec kind-control-plane ls /var/lib/otelcol/file_storage/tenant-web docker exec kind-control-plane ls /var/lib/otelcol/file_storage/tenant-db
Expected output:
exporter_otlp_web_output-web_logs receiver_filelog_tenant-web exporter_otlp_db_output-db_logs receiver_filelog_tenant-db
For this scenario, let’s check out the contents of the file created for the receivers:
docker exec kind-control-plane cat /var/lib/otelcol/file_storage/tenant-web/receiver_filelog_tenant-web docker exec kind-control-plane cat /var/lib/otelcol/file_storage/tenant-db/receiver_filelog_tenant-db
Show output (truncated):
file_input.knownFiles2 [ { "Fingerprint": { "first_bytes": "MjAyNS0wMS0xMFQxMDo0ODo1OS42MTEzOTQyMjFaIHN0ZG91dCBGIFVzaW5nIGNvbmZpZzogL2NvbmYvY29uZmlnLnRvbWwKMjAyNS0wMS0xMFQxMDo0ODo1OS42MTE0NzMyNjNaIHN0ZG91dCBGIHRpbWU9IjIwMjUtMDEtMTBUMTA6NDg6NTlaIiBsZXZlbD1kZWJ1ZyBtc2c9ImFwaSBsaXN0ZW4gb246IDoxMTAwMCwgYmFzZVBhdGg6IC8iCjIwMjUtMDEtMTBUMTA6NDg6NTkuNjExNDc3NzYzWiBzdGRvdXQgRiBbR0lOLWRlYnVnXSBbV0FSTklOR10gUnVubmluZyBpbiAiZGVidWciIG1vZGUuIFN3aXRjaCB0byAicmVsZWFzZSIgbW9kZSBpbiBwcm9kdWN0aW9uLgoyMDI1LTAxLTEwVDEwOjQ4OjU5LjYxMTQ3ODU5Nlogc3Rkb3V0IEYgIC0gdXNpbmcgZW52OglleHBvcnQgR0lOX01PREU9cmVsZWFzZQoyMDI1LTAxLTEwVDEwOjQ4OjU5LjYxMTQ3OTEzOFogc3Rkb3V0IEYgIC0gdXNpbmcgY29kZToJZ2luLlNldE1vZGUoZ2luLlJlbGVhc2VNb2RlKQoyMDI1LTAxLTEwVDEwOjQ4OjU5LjYxMTQ3OTU5Nlogc3Rkb3V0IEYgCjIwMjUtMDEtMTBUMTA6NDg6NTkuNjExNDgyMzA0WiBzdGRvdXQgRiBbR0lOLWRlYnVnXSBHRVQgICAgL21ldHJpY3MgICAgICAgICAgICAgICAgICAtLT4gZ2l0aHViLmNvbS9rdWJlLWxvZ2dpbmcvbG9nLWdlbmVyYXRvci9tZXRyaWNzLkhhbmRsZXIuZnVuYzEgKDEgaGFuZGxlcnMpCjIwMjUtMDEtMTBUMTA6NDg6NTkuNjExNDg0OTI5WiBzdGRvdXQgRiBbR0lOLWRlYnVnXSBHRVQgICAgLyAgICAgICAgICAgICAgICAgICAgICAgICAtLT4gbWFpbi4oKlN0YXRlKS5zdGF0ZUdldEhhbmRsZXItZm0gKDEgaGFuZGxlcnMpCjIwMjUtMDEtMTBUMTA6NDg6NTkuNjExNDg1Mzg4WiBzdGRvdXQgRiBbR0lOLWRlYnVnXSBQQVRDSCAgLyAgICAgICAgICAgICAgICAgICAgICAgICAtLT4gbWFpbi4oKlN0YXRlKS5zdGF0ZVBhdGNoSGFuZGxlci1mbSAoMSBoYW5kbGVycykKMjAyNS0wMS0xMFQxMDo0ODo1OS42MTE0ODYxNzlaIHN0ZG91dCBGIFtHSQ==" }, "Offset": 2421397, "RecordNum": 0, "FileAttributes": { "log.file.path": "/var/log/pods/db_log-generator-6b48b4fc68-hqdxm_c7fd09ad-a3d3-45b3-86ff-1e4b052e709f/log-generator/0.log" }, "HeaderFinalized": false, "FlushState": { "LastDataChange": "2025-01-10T10:52:45.299068006Z", "LastDataLength": 0 } }, { "Fingerprint": { "first_bytes": "MjAyNS0wMS0xMVQxMjoyMjowMS4yNjQ4NTI4NzdaIHN0ZG91dCBGIFVzaW5nIGNvbmZpZzogL2NvbmYvY29uZmlnLnRvbWwKMjAyNS0wMS0xMVQxMjoyMjowMS4yNjU0ODIxNjlaIHN0ZG91dCBGIHRpbWU9IjIwMjUtMDEtMTFUMTI6MjI6MDFaIiBsZXZlbD1kZWJ1ZyBtc2c9ImFwaSBsaXN0ZW4gb246IDoxMTAwMCwgYmFzZVBhdGg6IC8iCjIwMjUtMDEtMTFUMTI6MjI6MDEuMjY1NDg3Mjk0WiBzdGRvdXQgRiBbR0lOLWRlYnVnXSBbV0FSTklOR10gUnVubmluZyBpbiAiZGVidWciIG1vZGUuIFN3aXRjaCB0byAicmVsZWFzZSIgbW9kZSBpbiBwcm9kdWN0aW9uLgoyMDI1LTAxLTExVDEyOjIyOjAxLjI2NTQ4ODI1Mlogc3Rkb3V0IEYgIC0gdXNpbmcgZW52OglleHBvcnQgR0lOX01PREU9cmVsZWFzZQoyMDI1LTAxLTExVDEyOjIyOjAxLjI2NTQ4ODgzNVogc3Rkb3V0IEYgIC0gdXNpbmcgY29kZToJZ2luLlNldE1vZGUoZ2luLlJlbGVhc2VNb2RlKQoyMDI1LTAxLTExVDEyOjIyOjAxLjI2NTQ4OTI5NFogc3Rkb3V0IEYgCjIwMjUtMDEtMTFUMTI6MjI6MDEuMjY1NDg5ODc3WiBzdGRvdXQgRiBbR0lOLWRlYnVnXSBHRVQgICAgL21ldHJpY3MgICAgICAgICAgICAgICAgICAtLT4gZ2l0aHViLmNvbS9rdWJlLWxvZ2dpbmcvbG9nLWdlbmVyYXRvci9tZXRyaWNzLkhhbmRsZXIuZnVuYzEgKDEgaGFuZGxlcnMpCjIwMjUtMDEtMTFUMTI6MjI6MDEuMjY1NDkxMzM1WiBzdGRvdXQgRiBbR0lOLWRlYnVnXSBHRVQgICAgLyAgICAgICAgICAgICAgICAgICAgICAgICAtLT4gbWFpbi4oKlN0YXRlKS5zdGF0ZUdldEhhbmRsZXItZm0gKDEgaGFuZGxlcnMpCjIwMjUtMDEtMTFUMTI6MjI6MDEuMjY1NDkxODM1WiBzdGRvdXQgRiBbR0lOLWRlYnVnXSBQQVRDSCAgLyAgICAgICAgICAgICAgICAgICAgICAgICAtLT4gbWFpbi4oKlN0YXRlKS5zdGF0ZVBhdGNoSGFuZGxlci1mbSAoMSBoYW5kbGVycykKMjAyNS0wMS0xMVQxMjoyMjowMS4yNjU0OTI3MVogc3Rkb3V0IEYgW0dJTg==" }, "Offset": 3404, "RecordNum": 0, "FileAttributes": { "log.file.path": "/var/log/pods/db_log-generator-6b48b4fc68-hqdxm_c7fd09ad-a3d3-45b3-86ff-1e4b052e709f/log-generator/1.log" }, "HeaderFinalized": false, "FlushState": { "LastDataChange": "2025-01-11T12:22:04.144002128Z", "LastDataLength": 0 } } ] default%
Key points: Because the
storage
option is set for thefilelog receiver's
, it enables offset tracking for log-source files, which works as follows:- Tracked Files: The total count of files being monitored (knownFiles).
- Metadata for Each File:
- Fingerprint: A unique identifier created from the file’s initial bytes (Fingerprint.first_bytes).
- Offset: The position (in bytes) where the log receiver resumes reading from the file (Offset).
- File Attributes: Information such as the file’s name and path (FileAttributes).
NOTE: The serialization of the data depends on the storage extension of choice.
- Wait a couple of seconds, then restart the collector.
kubectl apply -f https://raw.githubusercontent.com/kube-logging/telemetry-controller/refs/heads/main/docs/demos/persistence-and-retry-mechanisms/tc-resources/collector.yaml
- Once it’s up and running check the logs of the collector:
kubectl logs -n collector daemonsets/otelcollector-cluster-collector
You’ll see that the collector detected that persistence is turned on and started reading from the
beginning
of the previously recordedoffset
.2024-12-20T12:20:29.851Z info adapter/receiver.go:49 Starting stanza receiver { "kind": "receiver", "name": "filelog/tenant-db", "data_type": "logs" } 2024-12-20T12:20:29.851Z info fileconsumer/file.go:62 Resuming from previously known offset(s). 'start_at' setting is not applicable. { "kind": "receiver", "name": "filelog/tenant-db", "data_type": "logs", "component": "fileconsumer" } 2024-12-20T12:20:29.851Z info adapter/receiver.go:49 Starting stanza receiver { "kind": "receiver", "name": "filelog/tenant-web", "data_type": "logs" } 2024-12-20T12:20:29.851Z info fileconsumer/file.go:62 Resuming from previously known offset(s). 'start_at' setting is not applicable. { "kind": "receiver", "name": "filelog/tenant-web", "data_type": "logs", "component": "fileconsumer" } 2024-12-20T12:20:29.851Z info [email protected]/service.go:261 Everything is ready. Begin running and processing data.
- Verify that no logs were lost in either tenants.
docker exec kind-control-plane cat /etc/otelcol-contrib/export-web.log | jq -r '.resourceLogs[].scopeLogs[].logRecords[] | .body.stringValue' | wc -l docker exec kind-control-plane cat /etc/otelcol-contrib/export-db.log | jq -r '.resourceLogs[].scopeLogs[].logRecords[] | .body.stringValue' | wc -l
Expected output:
10000 10000
Results:
- Neither tenant suffered from the collector outage.
- No duplicate logs were produced.
Conclusion
The combination of queuing, retry mechanisms, and persistent storage in OpenTelemetry Collector creates a resilient telemetry pipeline that protects your data in critical scenarios, such as:
- Network interruptions won’t result in data loss – the persistence storage keeps your telemetry data safe until connectivity is restored
- When your downstream systems are overwhelmed, the queuing mechanism provides a buffer, preventing data loss due to backpressure
- If the collector crashes unexpectedly, the persisted data remains safe on disk and will be processed once the system recovers
Overall, the OpenTelemetry Collector’s persistent storage and retry mechanisms enable the robust handling of both expected and unexpected disruptions. However, you should monitor disk usage, and configure appropriate limits to prevent storage exhaustion under sustained heavy loads. The OpenTelemetry Collector exposes specific metrics that can help in such cases, like the metrics regarding the sending-queue
and retry-on-failure
options:
# HELP otelcol_exporter_queue_capacity Fixed capacity of the retry queue (in batches)
# TYPE otelcol_exporter_queue_capacity gauge
otelcol_exporter_queue_capacity{exporter="otlp/output-db",service_instance_id="a7d40d49-8a05-4039-9cb1-e6ef849fcb3b",service_name="axoflow-otel-collector",service_version="0.112.0"} 1000
otelcol_exporter_queue_capacity{exporter="otlp/output-web",service_instance_id="a7d40d49-8a05-4039-9cb1-e6ef849fcb3b",service_name="axoflow-otel-collector",service_version="0.112.0"} 1000
# HELP otelcol_exporter_queue_size Current size of the retry queue (in batches)
# TYPE otelcol_exporter_queue_size gauge
otelcol_exporter_queue_size{data_type="logs",exporter="otlp/output-db",service_instance_id="a7d40d49-8a05-4039-9cb1-e6ef849fcb3b",service_name="axoflow-otel-collector",service_version="0.112.0"} 0
otelcol_exporter_queue_size{data_type="logs",exporter="otlp/output-web",service_instance_id="a7d40d49-8a05-4039-9cb1-e6ef849fcb3b",service_name="axoflow-otel-collector",service_version="0.112.0"} 0
# HELP otelcol_exporter_send_failed_log_records Number of log records in failed attempts to send to destination.
# TYPE otelcol_exporter_send_failed_log_records counter
otelcol_exporter_send_failed_log_records{exporter="otlp/output-db",service_instance_id="a7d40d49-8a05-4039-9cb1-e6ef849fcb3b",service_name="axoflow-otel-collector",service_version="0.112.0"} 0
otelcol_exporter_send_failed_log_records{exporter="otlp/output-web,service_instance_id="a7d40d49-8a05-4039-9cb1-e6ef849fcb3b",service_name="axoflow-otel-collector",service_version="0.112.0"} 0

Follow Our Progress!
We are excited to be realizing our vision above with a full Axoflow product suite.
