Beyond Cutting Cost: Why Data Quality Makes Security Pipelines Strategic

From Centralization to Access: The New Philosophy of Security Data

For years, the industry had one answer to the question “how to store security data?”. The answer has been: in a centralized repository. Centralized log management, from the early days of syslog-ng through modern SIEM systems, was built on that principle. Centralization made sense at the time. It simplified compliance, correlation, and detection.

As I pointed out in a recent podcast, the model belonged to a different era. Networks were slower, infrastructure was simpler, and it was easier to reason about a single central data store than many distributed ones. Today, the world looks different. Cloud-native architectures, elastic storage, and distributed data services have changed the equation.

Data still needs to move off endpoints, since most devices can’t store their logs for long. However, it doesn’t have to live in a single monolithic repository. It can remain distributed, as long as it is easily accessible and searchable. This shift requires strong federated access rather than forced aggregation. In this context, access means the ability to search, retrieve, and use data wherever it lives, without forcing it into a single storage system.

This is an increasingly popular idea: leaving data on the original sources and relying on federated search or “AI agents” to query it in place. In practice, these devices are not designed for retention, and critical evidence will be lost. Data must be moved off the source to be preserved, even if it does not end up in a single centralized repository.

That change, however, introduces a new challenge: making distributed data usable. And that starts with data quality.

Configuration: The Weak Link in Traditional Pipelines

Older data pipelines were fragile and hard to manage. The configuration process was manual, repetitive, and prone to errors. Operators had to open specific ports, configure each device, and ensure that logs were directed to the correct destination. One small mistake could send data into the wrong index or schema, breaking visibility.

Even now, many organizations still use this approach. Users are often instructed to configure systems like Zscaler or Palo Alto manually. If anything is wrong, the data ends up in the wrong place, causing subsequent stages of SecOps to miss the data completely, effectively causing blind spots.

We no longer manage modern infrastructure this way. Tools like Kubernetes and cloud orchestration rely on declarative methods that describe the desired state, allowing automation to handle the rest. Pipelines should work similarly, defining what the data should look like rather than managing every connection manually. Manual configuration not only creates delays; it also introduces inconsistencies that degrade detection reliability.

The Real Problem: Data Quality and Classification

The more serious problem isn’t where the data resides, but rather its condition. Security data often lacks the oversight that business data normally receives.

Because managing data collection is so complex, many organizations simply collect as much data as possible. The result is an overload of unstructured data that is difficult to interpret or use.

Classification is the foundation of data quality. A pipeline should be intelligent enough to differentiate between data sources and apply specific processing automatically, thereby reducing the burden on its users. The pipeline should automatically adapt to schema changes and identify fields and formats. Without this automation, teams are left to manually sort through different data types, wasting time and missing valuable insights.

Poor classification and inconsistent data formats have held back security operations for years. Fixing this requires automation and pipelines that understand context.

When Bad Data Becomes a Security Risk

Data quality issues are not just technical annoyances. They directly undermine security outcomes.

Common examples include log formats that don’t match their specifications or are truncated in transit. Palo Alto’s official CEF configuration, for instance, can generate malformed logs or duplicate field names. Some devices even cut logs short, producing messages that cannot be parsed at all once they reach the SIEM.

Rejecting bad data is not a solution, as it is then lost. Instead, pipelines must take responsibility for quality. They should detect, surface, and repair problems automatically wherever possible.

One of the reasons this problem has persisted for so long is that SIEM platforms themselves are rarely incentivized to solve it. Their business models often reward higher data volumes, not cleaner data, which means poor quality or malformed telemetry is simply ingested as-is. The responsibility for improving data quality must therefore shift to the pipeline layer, which is the only component with both the visibility and the motivation to fix it.

This mindset represents a major shift. The industry has spent two decades moving broken data into centralized systems. The real opportunity is to fix the data before it ever gets there.

Beyond Cost Cutting: The Real Strategic Value

Many people still view pipelines primarily as a means to save money on SIEM ingestion. While that is one benefit, it’s not the most important one.

The real strategic value of a pipeline lies in its ability to clean up data, bringing it into the right format and schema so it can be used immediately. Normalization ensures that data is consistent and structured, regardless of its origins or where it is stored. Pipelines occupy a unique point in the architecture. They see the raw data before any truncation, transformation, or indexing occurs. They are also the only layer with both the context and the incentive to repair quality issues before they become permanent.

When normalization occurs within the pipeline, data becomes portable and interoperable across architectures, whether centralized or decentralized. Teams can enrich events as they arrive, map between schemas such as CEF, OCSF, ECS, and UDM, and start using the data right away without waiting for downstream corrections.

Cost reduction is a side effect. The real benefit is data that’s ready to use from the moment it is collected.

Enrichment at the Right Stage

Traditionally, enrichment occurs within the SIEM after ingestion. But moving enrichment into the pipeline offers major advantages.

When enrichment occurs early, in transit, the data can capture valuable context that might otherwise be lost – for example, device location, asset metadata, or whether an IP address belongs to a trusted range. Performing this work closer to the source makes enrichment more accurate and ensures that the data arrives complete and meaningful.

This approach turns the pipeline into more than a transport system. It becomes a layer of intelligence that improves every downstream use case.

The Eureka Moment

During the discussion, Anton Chuvakin described a moment of realization. As an analyst, he had long viewed pipelines as tools for cost control – something you buy to cut your SIEM bill.

When the conversation shifted to data quality, the perspective changed. He realized that pipelines are the missing infrastructure for fixing data problems before they reach the SIEM. He said:

"They're not cost-cutting technology; they're data quality technology."

This shift reframes the entire role of the pipeline. Deciding which logs to collect is not enough; you need confidence that the logs you collect are correct and usable. That is the true role of a modern data pipeline.

Building Data That Works

Organizations focused solely on cutting SIEM costs are looking at the wrong metric. The real question is whether the data they collect is reliable, well-structured, and usable.

As I said it in the podcast:

“When I built syslog-ng in 1998, the goal was to move bits reliably. Today, the goal is to make those bits meaningful.”

The future of security data is not defined by where it’s stored, but by how easily it can be used. Data pipelines built for normalization and quality are the key to achieving this.

The industry is moving toward architectures where data quality is non-negotiable. Modern pipelines make this possible by ensuring that the data arriving in your security tools is complete, consistent, and ready for immediate use.