OpenTelemetry Collector persistence and retry mechanisms under the hood

In distributed systems, one of the most critical requirement is ensuring that no telemetry data is lost, even during system failures or network issues. The OpenTelemetry Collector’s persistence mechanism reliably handles events like collector restarts or downstream unavailability, preventing telemetry data loss. Understanding how these work under the hood is crucial for administrators to ensure reliable telemetry data collection and transmission in their environments. Everything we’re discussing is fully applicable to OpenTelemetry Collector, but in the demos we’ll be using Telemetry Controller, as that’s easier to use and configure.

Since the v0.0.10 release, we’ve made awesome improvements to the Telemetry Controller (which uses our custom OpenTelemetry Collector distribution for log collection), focusing on its reliability and operational capabilities. Some of our key additions include:

  • Authentication mechanisms was added with secret injection support for OTLPHTTP and OTLPGRPC
  • With the introduction of the Bridge CR, advanced tenant routing functionality is possible
  • Enhanced configuration validation to prevent misconfigurations
  • Stabilized operator behavior with various reliability improvements
  • Introduced the batch processor for optimized data handling

In this article, we’ll dive deep into one of the most critical features we’ve exposed in the Telemetry Controller: the persistence and retry mechanisms.

Why Do You Need Persistence?

The persistence feature primarily relies on storage extensions. For Telemetry Controller we use the file storage extension as that covers all our use cases at the moment. The OTEL Collector contrib distribution supports other storage extensions as well:

  • File Storage: The most commonly used persistence method is to store the data on the local filesystem. It offers:
    • Configurable storage directory with customizable permissions
    • File locking mechanism with adjustable timeouts
    • Advanced compaction strategies to optimize storage space
    • Data integrity protection through fsync capabilities
    • Automatic directory creation and cleanup options
  • Database Storage: For environments requiring structured storage, supporting:
    • Multiple SQL databases including SQLite and PostgreSQL
    • Flexible driver configuration
    • Transaction-safe operations
  • Redis Storage: For distributed environments, offering:
    • Cluster support for high availability
    • In-memory performance

Preventing Data Loss: Beyond Persistence

While persistence is crucial, the Exporter Helper functionality of the OpenTelemetry Collector provides additional mechanisms to prevent data loss. You can see these on the key components of the Consumer-Producer Architecture used by the Collector figure (Ref: https://github.com/open-telemetry/opentelemetry-collector/blob/main/exporter/exporterhelper/README.md):

Persistence and retry mechanisms of OpenTelemetry Collector in the consumer-producer architecture

The key points of this architecture are the following:

  • Multiple consumers can process the queued items concurrently
  • Temporary and permanent failures are handled separately
  • Smart batch dispatch mechanism
  • Automatic queue management

The implementation offers the following important mechanisms:

  • Queue Mechanism options:
    • In-memory queue with configurable size
    • Multiple concurrent consumers
    • Queue sizing based on traffic patterns
  • Retry Strategy:
    • Option to control what will happen to telemetry data if it couldn’t be delivered
    • Progressive backoff with customizable intervals
    • Timeout controls for individual retry attempts

Testing persistency in OpenTelemetry Collector

During the development of the Telemetry Controller, our primary goal was to create a comprehensive solution that eliminates the complexity typically associated with telemetry collection.
A critical aspect of this reliability is persistence. In production environments, losing telemetry data isn’t just an inconvenience, it can mean:

  • missing critical insights about your system’s behavior,
  • losing important audit trails, or
  • failing to catch performance issues.

Let’s explore how the OpenTelemetry Collector handles various scenarios in practice.

  • We’ll create two tenants (tenant-web and tenant-db).
  • Both tenants will collect logs from the corresponding namespace they are named after (web and db).
  • Each of them will have it’s own persistent file storage, which is used as storage for their receiver and exporters.
  • Retry on failure is also set on every receiver and exporter with max_elapsed_time set to 0, which means that no logs should be thrown away at any point of time.
  • Each exporter will have a sending queue with a default queue size of a 100 batches.

The following figure shows the main configuration options of the components, as well as their effects and relationships.

OpenTelemetry Collector persistence and retry mechanisms: configuration options and their effects

We’ll test two cases:

  1. When a downstream output becomes unavailable
  2. When a collector failure occurs

Setup

NOTE: You can find the example manifests used in this demonstration in the Telemetry Controller’s repository

To make it easy to follow along and experiment with different scenarios, we’ll show you how to run our test cases in a local Kubernetes environment running on Kind. Everything we’re discussing is fully applicable to OpenTelemetry Collector, but in the demos we’ll be using Telemetry Controller, as that’s easier to use and configure.

1. First, create a cluster and install the Telemetry Controller.

kind create cluster

# This installs the Telemetry Controller, and opentelemetry-operator as a sub-chart.
helm upgrade --install --wait --create-namespace --namespace telemetry-controller-system telemetry-controller oci://ghcr.io/kube-logging/helm-charts/telemetry-controller --version=0.0.15

2. Deploy the Telemetry Controller components.

kubectl apply -f https://raw.githubusercontent.com/kube-logging/telemetry-controller/refs/heads/main/docs/demos/persistence-and-retry-mechanisms/ns.yaml
kubectl apply -f https://raw.githubusercontent.com/kube-logging/telemetry-controller/refs/heads/main/docs/demos/persistence-and-retry-mechanisms/tc-resources/collector.yaml,https://raw.githubusercontent.com/kube-logging/telemetry-controller/refs/heads/main/docs/demos/persistence-and-retry-mechanisms/tc-resources/tc.yaml

Upon applying these resources, Telemetry Controller creates a collector with the following configuration.

3. To generate an exact amount of logs for the tests, deploy the Log Generator.

helm upgrade --install --wait --namespace web log-generator oci://ghcr.io/kube-logging/helm-charts/log-generator --set app.count=0 --set app.eventPerSec=50
helm upgrade --install --wait --namespace db log-generator oci://ghcr.io/kube-logging/helm-charts/log-generator --set app.count=0 --set app.eventPerSec=50

The app.count=0 option is needed so log-generator doesn’t start producing logs when it’s initialized.

4. Deploy the receiver collectors. Two such collectors are deployed, the only difference is in the name and namespace fields (web and db). The receivers are getting logs using OTLPGRPC, and export the received data into a file.

kubectl apply -f https://raw.githubusercontent.com/kube-logging/telemetry-controller/refs/heads/main/docs/demos/persistence-and-retry-mechanisms/receivers/receiver-db.yaml,https://raw.githubusercontent.com/kube-logging/telemetry-controller/refs/heads/main/docs/demos/persistence-and-retry-mechanisms/receivers/receiver-web.yaml

Now that everything is deployed and configured, let’s test the OTEL Collector and see how it handles different events that disrupt the telemetry pipeline.

Scenario 1: Unavailable Downstream Output

Scenario 1: Unavailable Downstream Output

In this scenario, we’ll examine how the OpenTelemetry Collector behaves when a single output destination becomes unavailable. This is a common situation that can cause backpressure on upstream components, whether due to network issues, system maintenance, or service outages. We’ll see how the Collector’s persistence and retry mechanisms work together to ensure no data is lost during the downtime.

1. Check that the previously deployed components are up and running (only showing relevant resources):

kubectl get pods -A

NAMESPACE                     NAME                                                           READY   STATUS 
collector                     otelcollector-cluster-collector-8t9vn                          1/1     Running
db                            collector-receiver-db-collector-5764f9fcb8-gjlbs               1/1     Running
telemetry-controller-system   telemetry-controller-9bfc7bd5-ntx99                            1/1     Running
telemetry-controller-system   telemetry-controller-opentelemetry-operator-7f9f6d6c97-579p4   2/2     Running
web                           collector-receiver-web-collector-8d78dc74d-nq59n               1/1     Running

and

kubectl get telemetry-all -A

NAME                                           TENANTS                      STATE
collector.telemetry.kube-logging.dev/cluster   ["tenant-db","tenant-web"]   ready

NAMESPACE   NAME                                        
db          output.telemetry.kube-logging.dev/output-db 
web         output.telemetry.kube-logging.dev/output-web

NAMESPACE   NAME                                              TENANT       OUTPUTS                                     STATE
db          subscription.telemetry.kube-logging.dev/sub-db    tenant-db    [{"name":"output-db","namespace":"db"}]     ready
web         subscription.telemetry.kube-logging.dev/sub-web   tenant-web   [{"name":"output-web","namespace":"web"}]   ready

NAME                                           SUBSCRIPTIONS                            LOGSOURCE NAMESPACES   STATE
tenant.telemetry.kube-logging.dev/tenant-db    [{"name":"sub-db","namespace":"db"}]     ["db"]                 ready
tenant.telemetry.kube-logging.dev/tenant-web   [{"name":"sub-web","namespace":"web"}]   ["web"]                ready

2. Hit the log-generator’s loggen endpoint and make it produce 10k Apache access logs:

kubectl port-forward --namespace "web" "deployments/log-generator" 11000:11000 &
kubectl port-forward --namespace "db" "deployments/log-generator" 11001:11000 &

curl --location --request POST '127.0.0.1:11000/loggen' --header 'Content-Type: application/json' --data-raw '{ "type": "web", "format": "apache", "count": 10000 }'
curl --location --request POST '127.0.0.1:11001/loggen' --header 'Content-Type: application/json' --data-raw '{ "type": "web", "format": "apache", "count": 10000 }'

3. Verify that both receivers are working by checking the contents of the file they’re writing into:

docker exec kind-control-plane head /etc/otelcol-contrib/export-web.log | jq
docker exec kind-control-plane head /etc/otelcol-contrib/export-db.log | jq

Example output:

{
  "resourceLogs": [
    {
      "scopeLogs": [
        {
          "logRecords": [
            {
              "timeUnixNano": "1734689097170297463",
              "observedTimeUnixNano": "1734689097283413254",
              "body": {
                "stringValue": "135.93.33.38 - - [20/Dec/2024:10:04:57 +0000] - \"PUT /index.html HTTP/1.1\" 200 18175 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/602.1.50 (KHTML, like Gecko) Version/10.0 Safari/602.1.50\" \"-\""
              },
              "attributes": [
                {
                  "key": "log.file.path",
                  "value": {
                    "stringValue": "/var/log/pods/db_log-generator-77797dff76-q8v5x_a7331723-b9b2-4356-9cc9-f08064da1c8e/log-generator/0.log"
                  }
                },
                {
                  "key": "time",
                  "value": {
                    "stringValue": "2024-12-20T10:04:57.170297463Z"
                  }
                },
                {
                  "key": "log.iostream",
                  "value": {
                    "stringValue": "stdout"
                  }
                },
                {
                  "key": "logtag",
                  "value": {
                    "stringValue": "F"
                  }
                },
                {
                  "key": "tenant",
                  "value": {
                    "stringValue": "tenant-db"
                  }
                },
                {
                  "key": "subscription",
                  "value": {
                    "stringValue": "sub-db"
                  }
                },
                {
                  "key": "exporter",
                  "value": {
                    "stringValue": "otlp/db_output-db"
                  }
                }
              ],
            }
          ]
        }
      ]
    }
  ]
}

4. Let’s cause can outage by shutting down the db receiver:

kubectl delete opentelemetrycollector collector-receiver-db --namespace db

Because of this, the cluster collector starts reporting the outage:

kubectl logs -n collector daemonsets/otelcollector-cluster-collector
2024-12-20T10:14:41.789Z    info    internal/retry_sender.go:126    Exporting failed. Will retry the request after interval.    
{
    "kind": "exporter",
    "data_type": "logs",
    "name": "otlp/db_output-db",
    "error": "rpc error: code = DeadlineExceeded desc = received context error while waiting for new LB policy update: context deadline exceeded",
    "interval": "4.22153556s"
}

2024-12-20T10:14:41.988Z    info    internal/retry_sender.go:126    Exporting failed. Will retry the request after interval.    
{
    "kind": "exporter",
    "data_type": "logs",
    "name": "otlp/db_output-db",
    "error": "rpc error: code = DeadlineExceeded desc = received context error while waiting for new LB policy update: context deadline exceeded",
    "interval": "3.713328578s"
}

2024-12-20T10:14:42.192Z    info    internal/retry_sender.go:126    Exporting failed. Will retry the request after interval.    
{
    "kind": "exporter",
    "data_type": "logs",
    "name": "otlp/db_output-db",
    "error": "rpc error: code = DeadlineExceeded desc = received context error while waiting for new LB policy update: context deadline exceeded",
    "interval": "4.424550142s"
}

Once the sending queue fills up entirely (it’s set to 100 batches), the exporter stops reading more data.

2024-12-20T10:16:52.639Z    error    internal/base_exporter.go:130    Exporting failed. Rejecting data.
{
    "kind": "exporter",
    "data_type": "logs",
    "name": "otlp/db_output-db",
    "error": "sending queue is full",
    "rejected_items": 10
}

5. Take a look at the file storage directories:

docker exec kind-control-plane ls /var/lib/otelcol/file_storage/tenant-web
docker exec kind-control-plane ls /var/lib/otelcol/file_storage/tenant-db

Expected output:

exporter_otlp_web_output-web_logs
receiver_filelog_tenant-web
exporter_otlp_db_output-db_logs
receiver_filelog_tenant-db

There are separate files for both tenants receiver and exporter. For this scenario, let’s check out the contents of the file created for the exporters:

docker exec kind-control-plane cat /var/lib/otelcol/file_storage/tenant-web/exporter_otlp_web_output-web_logs | tail -n 50 | strings
docker exec kind-control-plane cat /var/lib/otelcol/file_storage/tenant-db/exporter_otlp_db_output-db_logs | tail -n 50 | strings

Output:

tenant-db2
subscription
sub-db2
exporter
otlp/db_output-dbJ
ßHV6*O
160.38.112.115 - - [10/Jan/2025:10:49:34 +0000] - "POST /blog HTTP/1.1" 503 16019 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/54.0.2840.98 Safari/537.36" "-"2
logtag
time
2025-01-10T10:49:34.046881959Z2
log.iostream
stdout2{
log.file.path
h/var/log/pods/db_log-generator-6b48b4fc68-hqdxm_c7fd09ad-a3d3-45b3-86ff-1e4b052e709f/log-generator/0.log2
tenant
tenant-db2
subscription
sub-db2
exporter
otlp/db_output-dbJ
114.83.144.155 - - [10/Jan/2025:10:49:34 +0000] - "POST /blog HTTP/1.1" 200 13679 "-" "Mozilla/5.0 (X11; Fedora; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/52.0.2743.116 Safari/537.36" "-"2{
log.file.path
h/var/log/pods/db_log-generator-6b48b4fc68-hqdxm_c7fd09ad-a3d3-45b3-86ff-1e4b052e709f/log-generator/0.log2
log.iostream
stdout2(
time
2025-01-10T10:49:34.067475959Z2
logtag
tenant
tenant-db2
subscription
sub-db2
exporter
otlp/db_output-dbJ
ˆ—8*O
67.38.122.58 - - [10/Jan/2025:10:49:34 +0000] - "GET /index.html HTTP/1.1" 503 12376 "-" "Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; Touch; MALNJS; rv:11.0) like Gecko" "-"2(
time
2025-01-10T10:49:34.088541834Z2{
log.file.path
h/var/log/pods/db_log-generator-6b48b4fc68-hqdxm_c7fd09ad-a3d3-45b3-86ff-1e4b052e709f/log-generator/0.log2
log.iostream
stdout2
logtag
tenant
tenant-db2
subscription
sub-db2
exporter
otlp/db_output-dbJ

Key points: The data in the persisted files appears as a mix of readable text and binary content because:

  1. The OpenTelemetry Collector pipeline uses Protocol Buffers (protobuf) serialization for efficient storage:
    • After data is received through a receiver component, it is converted into pipeline data or pdata for short.
    • All data processing within the pipeline happens using this pdata format.
    • The internal representation uses OTLP ProtoBuf structs.
    • Before transmission, the exporters convert the data to the appropriate format.
  2. Protobuf uses a compact binary encoding for all data, including metadata, control fields, and log content:
    • Strings, such as log content, are stored as UTF-8 within the binary representation.
    • This format ensures both storage efficiency and reliable state persistence.

NOTE: When viewed with strings, we see only the UTF-8 encoded portions, making the content appear fragmented.

6. Restart the receiver. Once it’s up and running again, the collector stops logging export failures.

kubectl apply -f https://raw.githubusercontent.com/kube-logging/telemetry-controller/refs/heads/main/docs/demos/persistence-and-retry-mechanisms/receivers/receiver-db.yaml

7. Verify that no logs were lost in either tenant. The following jq command parses each batch, outputting the exact number of logs exported.

docker exec kind-control-plane cat /etc/otelcol-contrib/export-web.log | jq -r '.resourceLogs[].scopeLogs[].logRecords[] | .body.stringValue' | wc -l
docker exec kind-control-plane cat /etc/otelcol-contrib/export-db.log | jq -r '.resourceLogs[].scopeLogs[].logRecords[] | .body.stringValue' | wc -l

Expected output:

10000
10000

Results:

  • The web tenant wasn’t affected by the outage, so it received 10k messages.
  • The db tenant was also able to handle the outage, and transmitted all 10k messages.

Scenario 2: Collector Failure

Let’s see how the Collector handles a failure scenario when the Collector itself becomes unavailable. We’ll explore how the built-in resilience mechanisms ensure data integrity and recovery during such failures. (We assume that the previously deployed components are up and running.)

1. Hit the log-generator’s loggen endpoint again and make it produce 10k Apache access logs.

kubectl port-forward --namespace "web" "deployments/log-generator" 11000:11000 &
kubectl port-forward --namespace "db" "deployments/log-generator" 11001:11000 &

curl --location --request POST '127.0.0.1:11000/loggen' --header 'Content-Type: application/json' --data-raw '{ "type": "web", "format": "apache", "count": 10000 }'
curl --location --request POST '127.0.0.1:11001/loggen' --header 'Content-Type: application/json' --data-raw '{ "type": "web", "format": "apache", "count": 10000 }'

2. Verify that both receivers are working by checking the contents of the file they’re writing into.

docker exec kind-control-plane head /etc/otelcol-contrib/export-web.log | jq
docker exec kind-control-plane head /etc/otelcol-contrib/export-db.log | jq

Example output:

{
  "resourceLogs": [
    {
      "scopeLogs": [
        {
          "logRecords": [
            {
              "timeUnixNano": "1734689097170297463",
              "observedTimeUnixNano": "1734689097283413254",
              "body": {
                "stringValue": "135.93.33.38 - - [20/Dec/2024:10:04:57 +0000] - \"PUT /index.html HTTP/1.1\" 200 18175 \"-\" \"Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/602.1.50 (KHTML, like Gecko) Version/10.0 Safari/602.1.50\" \"-\""
              },
              "attributes": [
                {
                  "key": "log.file.path",
                  "value": {
                    "stringValue": "/var/log/pods/db_log-generator-77797dff76-q8v5x_a7331723-b9b2-4356-9cc9-f08064da1c8e/log-generator/0.log"
                  }
                },
                {
                  "key": "time",
                  "value": {
                    "stringValue": "2024-12-20T10:04:57.170297463Z"
                  }
                },
                {
                  "key": "log.iostream",
                  "value": {
                    "stringValue": "stdout"
                  }
                },
                {
                  "key": "logtag",
                  "value": {
                    "stringValue": "F"
                  }
                },
                {
                  "key": "tenant",
                  "value": {
                    "stringValue": "tenant-db"
                  }
                },
                {
                  "key": "subscription",
                  "value": {
                    "stringValue": "sub-db"
                  }
                },
                {
                  "key": "exporter",
                  "value": {
                    "stringValue": "otlp/db_output-db"
                  }
                }
              ],
            }
          ]
        }
      ]
    }
  ]
}

3. Cause an outage by shutting down the collector.

kubectl delete collector cluster

4. Take a look on the file storage directories.

docker exec kind-control-plane ls /var/lib/otelcol/file_storage/tenant-web
docker exec kind-control-plane ls /var/lib/otelcol/file_storage/tenant-db

Expected output:

exporter_otlp_web_output-web_logs
receiver_filelog_tenant-web
exporter_otlp_db_output-db_logs
receiver_filelog_tenant-db

For this scenario, let’s check out the contents of the file created for the receivers:

docker exec kind-control-plane cat /var/lib/otelcol/file_storage/tenant-web/receiver_filelog_tenant-web
docker exec kind-control-plane cat /var/lib/otelcol/file_storage/tenant-db/receiver_filelog_tenant-db

Output:

file_input.knownFiles2
[
  {
    "Fingerprint": {
      "first_bytes": "MjAyNS0wMS0xMFQxMDo0ODo1OS42MTEzOTQyMjFaIHN0ZG91dCBGIFVzaW5nIGNvbmZpZzogL2NvbmYvY29uZmlnLnRvbWwKMjAyNS0wMS0xMFQxMDo0ODo1OS42MTE0NzMyNjNaIHN0ZG91dCBGIHRpbWU9IjIwMjUtMDEtMTBUMTA6NDg6NTlaIiBsZXZlbD1kZWJ1ZyBtc2c9ImFwaSBsaXN0ZW4gb246IDoxMTAwMCwgYmFzZVBhdGg6IC8iCjIwMjUtMDEtMTBUMTA6NDg6NTkuNjExNDc3NzYzWiBzdGRvdXQgRiBbR0lOLWRlYnVnXSBbV0FSTklOR10gUnVubmluZyBpbiAiZGVidWciIG1vZGUuIFN3aXRjaCB0byAicmVsZWFzZSIgbW9kZSBpbiBwcm9kdWN0aW9uLgoyMDI1LTAxLTEwVDEwOjQ4OjU5LjYxMTQ3ODU5Nlogc3Rkb3V0IEYgIC0gdXNpbmcgZW52OglleHBvcnQgR0lOX01PREU9cmVsZWFzZQoyMDI1LTAxLTEwVDEwOjQ4OjU5LjYxMTQ3OTEzOFogc3Rkb3V0IEYgIC0gdXNpbmcgY29kZToJZ2luLlNldE1vZGUoZ2luLlJlbGVhc2VNb2RlKQoyMDI1LTAxLTEwVDEwOjQ4OjU5LjYxMTQ3OTU5Nlogc3Rkb3V0IEYgCjIwMjUtMDEtMTBUMTA6NDg6NTkuNjExNDgyMzA0WiBzdGRvdXQgRiBbR0lOLWRlYnVnXSBHRVQgICAgL21ldHJpY3MgICAgICAgICAgICAgICAgICAtLT4gZ2l0aHViLmNvbS9rdWJlLWxvZ2dpbmcvbG9nLWdlbmVyYXRvci9tZXRyaWNzLkhhbmRsZXIuZnVuYzEgKDEgaGFuZGxlcnMpCjIwMjUtMDEtMTBUMTA6NDg6NTkuNjExNDg0OTI5WiBzdGRvdXQgRiBbR0lOLWRlYnVnXSBHRVQgICAgLyAgICAgICAgICAgICAgICAgICAgICAgICAtLT4gbWFpbi4oKlN0YXRlKS5zdGF0ZUdldEhhbmRsZXItZm0gKDEgaGFuZGxlcnMpCjIwMjUtMDEtMTBUMTA6NDg6NTkuNjExNDg1Mzg4WiBzdGRvdXQgRiBbR0lOLWRlYnVnXSBQQVRDSCAgLyAgICAgICAgICAgICAgICAgICAgICAgICAtLT4gbWFpbi4oKlN0YXRlKS5zdGF0ZVBhdGNoSGFuZGxlci1mbSAoMSBoYW5kbGVycykKMjAyNS0wMS0xMFQxMDo0ODo1OS42MTE0ODYxNzlaIHN0ZG91dCBGIFtHSQ=="
    },
    "Offset": 2421397,
    "RecordNum": 0,
    "FileAttributes": {
      "log.file.path": "/var/log/pods/db_log-generator-6b48b4fc68-hqdxm_c7fd09ad-a3d3-45b3-86ff-1e4b052e709f/log-generator/0.log"
    },
    "HeaderFinalized": false,
    "FlushState": {
      "LastDataChange": "2025-01-10T10:52:45.299068006Z",
      "LastDataLength": 0
    }
  },
  {
    "Fingerprint": {
      "first_bytes": "MjAyNS0wMS0xMVQxMjoyMjowMS4yNjQ4NTI4NzdaIHN0ZG91dCBGIFVzaW5nIGNvbmZpZzogL2NvbmYvY29uZmlnLnRvbWwKMjAyNS0wMS0xMVQxMjoyMjowMS4yNjU0ODIxNjlaIHN0ZG91dCBGIHRpbWU9IjIwMjUtMDEtMTFUMTI6MjI6MDFaIiBsZXZlbD1kZWJ1ZyBtc2c9ImFwaSBsaXN0ZW4gb246IDoxMTAwMCwgYmFzZVBhdGg6IC8iCjIwMjUtMDEtMTFUMTI6MjI6MDEuMjY1NDg3Mjk0WiBzdGRvdXQgRiBbR0lOLWRlYnVnXSBbV0FSTklOR10gUnVubmluZyBpbiAiZGVidWciIG1vZGUuIFN3aXRjaCB0byAicmVsZWFzZSIgbW9kZSBpbiBwcm9kdWN0aW9uLgoyMDI1LTAxLTExVDEyOjIyOjAxLjI2NTQ4ODI1Mlogc3Rkb3V0IEYgIC0gdXNpbmcgZW52OglleHBvcnQgR0lOX01PREU9cmVsZWFzZQoyMDI1LTAxLTExVDEyOjIyOjAxLjI2NTQ4ODgzNVogc3Rkb3V0IEYgIC0gdXNpbmcgY29kZToJZ2luLlNldE1vZGUoZ2luLlJlbGVhc2VNb2RlKQoyMDI1LTAxLTExVDEyOjIyOjAxLjI2NTQ4OTI5NFogc3Rkb3V0IEYgCjIwMjUtMDEtMTFUMTI6MjI6MDEuMjY1NDg5ODc3WiBzdGRvdXQgRiBbR0lOLWRlYnVnXSBHRVQgICAgL21ldHJpY3MgICAgICAgICAgICAgICAgICAtLT4gZ2l0aHViLmNvbS9rdWJlLWxvZ2dpbmcvbG9nLWdlbmVyYXRvci9tZXRyaWNzLkhhbmRsZXIuZnVuYzEgKDEgaGFuZGxlcnMpCjIwMjUtMDEtMTFUMTI6MjI6MDEuMjY1NDkxMzM1WiBzdGRvdXQgRiBbR0lOLWRlYnVnXSBHRVQgICAgLyAgICAgICAgICAgICAgICAgICAgICAgICAtLT4gbWFpbi4oKlN0YXRlKS5zdGF0ZUdldEhhbmRsZXItZm0gKDEgaGFuZGxlcnMpCjIwMjUtMDEtMTFUMTI6MjI6MDEuMjY1NDkxODM1WiBzdGRvdXQgRiBbR0lOLWRlYnVnXSBQQVRDSCAgLyAgICAgICAgICAgICAgICAgICAgICAgICAtLT4gbWFpbi4oKlN0YXRlKS5zdGF0ZVBhdGNoSGFuZGxlci1mbSAoMSBoYW5kbGVycykKMjAyNS0wMS0xMVQxMjoyMjowMS4yNjU0OTI3MVogc3Rkb3V0IEYgW0dJTg=="
    },
    "Offset": 3404,
    "RecordNum": 0,
    "FileAttributes": {
      "log.file.path": "/var/log/pods/db_log-generator-6b48b4fc68-hqdxm_c7fd09ad-a3d3-45b3-86ff-1e4b052e709f/log-generator/1.log"
    },
    "HeaderFinalized": false,
    "FlushState": {
      "LastDataChange": "2025-01-11T12:22:04.144002128Z",
      "LastDataLength": 0
    }
  }
]
default%

Key points: Because the storage option is set for the filelog receiver's, it enables offset tracking for log-source files, which works as follows:

  • Tracked Files: The total count of files being monitored (knownFiles).
  • Metadata for Each File:
    • Fingerprint: A unique identifier created from the file’s initial bytes (Fingerprint.first_bytes).
    • Offset: The position (in bytes) where the log receiver resumes reading from the file (Offset).
    • File Attributes: Information such as the file’s name and path (FileAttributes).

NOTE: The serialization of the data depends on the storage extension of choice.

5. Wait a couple of seconds, then restart the collector.

kubectl apply -f https://raw.githubusercontent.com/kube-logging/telemetry-controller/refs/heads/main/docs/demos/persistence-and-retry-mechanisms/tc-resources/collector.yaml

6. Once it’s up and running check the logs of the collector:

kubectl logs -n collector daemonsets/otelcollector-cluster-collector

You’ll see that the collector detected that persistence is turned on and started reading from the beginning of the previously recorded offset.

2024-12-20T12:20:29.851Z    info    adapter/receiver.go:49    Starting stanza receiver
{
    "kind": "receiver",
    "name": "filelog/tenant-db",
    "data_type": "logs"
}
2024-12-20T12:20:29.851Z    info    fileconsumer/file.go:62    Resuming from previously known offset(s). 'start_at' setting is not applicable.
{
    "kind": "receiver",
    "name": "filelog/tenant-db",
    "data_type": "logs",
    "component": "fileconsumer"
}
2024-12-20T12:20:29.851Z    info    adapter/receiver.go:49    Starting stanza receiver
{
    "kind": "receiver",
    "name": "filelog/tenant-web",
    "data_type": "logs"
}
2024-12-20T12:20:29.851Z    info    fileconsumer/file.go:62    Resuming from previously known offset(s). 'start_at' setting is not applicable.
{
    "kind": "receiver",
    "name": "filelog/tenant-web",
    "data_type": "logs",
    "component": "fileconsumer"
}
2024-12-20T12:20:29.851Z    info    [email protected]/service.go:261    Everything is ready. Begin running and processing data.

7. Verify that no logs were lost in either tenants.

docker exec kind-control-plane cat /etc/otelcol-contrib/export-web.log | jq -r '.resourceLogs[].scopeLogs[].logRecords[] | .body.stringValue' | wc -l
docker exec kind-control-plane cat /etc/otelcol-contrib/export-db.log | jq -r '.resourceLogs[].scopeLogs[].logRecords[] | .body.stringValue' | wc -l

Expected output:

10000
10000

Results:

  • Neither tenant suffered from the collector outage.
  • No duplicate logs were produced.

Conclusion

The combination of queuing, retry mechanisms, and persistent storage in OpenTelemetry Collector creates a resilient telemetry pipeline that protects your data in critical scenarios, such as:

  • Network interruptions won’t result in data loss – the persistence storage keeps your telemetry data safe until connectivity is restored
  • When your downstream systems are overwhelmed, the queuing mechanism provides a buffer, preventing data loss due to backpressure
  • If the collector crashes unexpectedly, the persisted data remains safe on disk and will be processed once the system recovers

Overall, the OpenTelemetry Collector’s persistent storage and retry mechanisms enable the robust handling of both expected and unexpected disruptions. However, you should monitor disk usage, and configure appropriate limits to prevent storage exhaustion under sustained heavy loads. The OpenTelemetry Collector exposes specific metrics that can help in such cases, like the metrics regarding the sending-queue and retry-on-failure options:

# HELP otelcol_exporter_queue_capacity Fixed capacity of the retry queue (in batches)
# TYPE otelcol_exporter_queue_capacity gauge
otelcol_exporter_queue_capacity{exporter="otlp/output-db",service_instance_id="a7d40d49-8a05-4039-9cb1-e6ef849fcb3b",service_name="axoflow-otel-collector",service_version="0.112.0"} 1000
otelcol_exporter_queue_capacity{exporter="otlp/output-web",service_instance_id="a7d40d49-8a05-4039-9cb1-e6ef849fcb3b",service_name="axoflow-otel-collector",service_version="0.112.0"} 1000

# HELP otelcol_exporter_queue_size Current size of the retry queue (in batches)
# TYPE otelcol_exporter_queue_size gauge
otelcol_exporter_queue_size{data_type="logs",exporter="otlp/output-db",service_instance_id="a7d40d49-8a05-4039-9cb1-e6ef849fcb3b",service_name="axoflow-otel-collector",service_version="0.112.0"} 0
otelcol_exporter_queue_size{data_type="logs",exporter="otlp/output-web",service_instance_id="a7d40d49-8a05-4039-9cb1-e6ef849fcb3b",service_name="axoflow-otel-collector",service_version="0.112.0"} 0

# HELP otelcol_exporter_send_failed_log_records Number of log records in failed attempts to send to destination.
# TYPE otelcol_exporter_send_failed_log_records counter
otelcol_exporter_send_failed_log_records{exporter="otlp/output-db",service_instance_id="a7d40d49-8a05-4039-9cb1-e6ef849fcb3b",service_name="axoflow-otel-collector",service_version="0.112.0"} 0
otelcol_exporter_send_failed_log_records{exporter="otlp/output-web,service_instance_id="a7d40d49-8a05-4039-9cb1-e6ef849fcb3b",service_name="axoflow-otel-collector",service_version="0.112.0"} 0
webinar_labelswebinar_labels

Follow Our Progress!

We are excited to be realizing our vision above with a full Axoflow product suite.

Sign me up
This button is added to each code block on the live site, then its parent is removed from here.

Recent posts

Why Policy-Based Routing Beats Static Rules
Classify security data in transit: improve data quality and reduce costs
Ways to break data ingestion of your SIEM
AxoRouter Opens Windows! (WEC Edition)
How high-quality data saves you $$$$

Any Questions?

We are here to answer!

Stay in Touch?

Sign up to our newsletter!