Overwhelmed by security data? Learn why CISOs and SOC teams are rethinking SIEM pipelines. Discover how intelligent data pipelines cut costs, improve log quality, eliminate blind spots, and empower security operations with automation and AI.

Drowning in Security Data: Why SOCs and CISOs are Rethinking the Pipeline

CISOs and senior security engineers face an ever-evolving landscape of threats and operational complexities. While the focus often remains on sophisticated attacks and advanced defense mechanisms, a foundational and increasingly critical challenge lies in the very data that fuels our security operations. The sheer volume, inconsistent quality, and the subsequent cost implications of security data are creating significant pain points, impacting SOC team efficiency, SIEM effectiveness, and overall organizational resilience.

The Unrelenting Deluge: Data Overload and Escalating Costs

Security data volumes are growing at an astonishing rate, with estimates suggesting a tenfold increase over the last three to four years, effectively doubling every two to three years. This exponential growth, driven by an expanding digital footprint across hybrid infrastructures - encompassing cloud, on prem, and a multitude of devices - presents an alarming challenge. As Vikas Gopal, a Senior Security Consultant and expert in the data world, points out, "the volume that is being produced, it's enormous".

This data deluge directly translates into escalating costs. Many Managed Security Service Providers (MSSPs) and SIEM vendors have shifted their pricing models from enterprise size to data volume, meaning organizations are now paying directly for every byte ingested. Another security consultant highlights that "Most of the companies... are sending maybe 50% more than what is needed for managing SIEM," leading to system overload and increased expenses.

This raises a critical question for CISOs: how much of the data currently being ingested is truly valuable, and how much is merely inflating operational budgets? Ignoring this can lead to an unsustainable cost model, where the pursuit of comprehensive visibility clashes with financial realities.

Moving Beyond the Swamp: Lakehouse and Federated Storage

Cory Minton argues that the answer is not to double down on the idea that the SIEM must contain everything. That mindset creates what he calls a swamp of data, where logs accumulate without a clear relationship to outcomes. Instead, he advocates a Lakehouse-style architecture in which data is curated based on its utility. High-value telemetry that drives real-time detections and operations stays in the SIEM or other hot analytics tiers, while compliance-focused or low utility logs flow to lower-cost object storage and data warehouses.

This approach depends on being able to query data where it lives. Federated search and analytics allow SOC teams to stitch together context across SIEM, data lake, and warehouse systems without constantly rehydrating cold data back into the SIEM and paying to ingest it again. In practice, this means that volume and cost concerns can be addressed without sacrificing the breadth of evidence available for investigations, provided that the organization has a pipeline that can route, tag, and prepare data correctly for each tier.

The Parsing Predicament: Inconsistent Schemas and Blind Spots

Beyond volume, the quality and consistency of security data pose a persistent headache. Anton Chuvakin, a recognized expert in the SIEM space and Security Advisor at the Office of the CISO, Google Cloud, laments the "mess in our logs today". Despite decades of effort, including initiatives like Common Event Expression (CEE) and the rise of multiple "competing standards" like OCSF, CIM, ECS, and CEF, a universal log standard remains elusive.

Filip Stojkovski, Staff Security Engineer and founder of SecOps Unpacked, notes that each organization often grapples with custom log sources and internally developed platforms, leading to "a complete mess" of unstructured logs that are difficult to normalize.

This lack of standardization has several severe consequences:

  • Complex Parsing: Raw logs from diverse sources (network gear, old servers, mainframes) are often not in modern, structured formats like JSON, lacking essential identifiers like asset IDs or user identities. Translating this raw data into a usable format for a SIEM is a labor-intensive, error-prone process that often requires "programming in the pipeline".
  • Vendor Lock-in and Migration Pain: Organizations that heavily invest in custom parsers for a specific SIEM face "ungodly pain" when attempting to migrate to a new platform. All the custom parsing logic, representing significant "sum costs," essentially "goes in the trash," forcing security teams to rebuild from scratch. This creates a strong vendor lock-in that stifles innovation and agility.
  • Data Quality Issues: Even when data is parsed, critical information can be missing or corrupted. As Vikas Gopal observes, key fields like user, process, or file information are often "blank" or "lost somewhere," leading to false positives or hindering effective detection. Incorrect formats, such as missing time zones, further complicate analysis. These data quality issues can result in "blind spots" in the SIEM, leaving organizations vulnerable.
  • The "Shift Left" for Data Processing: The consensus emerging from industry experts is that addressing data problems in the pipeline ("shift left") is far more efficient and cost-effective than trying to fix them once they've landed in the SIEM. This early intervention can provide significant cost savings by reducing the volume of data that needs to be stored and processed by expensive SIEM resources.

Standardization as the Backbone of Federation and Correlation

Rob Gil notes that these parsing and quality problems become even more painful in a federated world. It is one thing to normalize logs for a single SIEM. It is another to correlate identity, hosts, and activity across a SIEM, a Snowflake warehouse, and an S3 archive, each with different conventions for the same entities. From his work on the Elastic Common Schema, he emphasizes that federation without standardization simply does not work. You must agree on core fields such as source, destination, and identity if you want correlation to be reliable.

Gil also observes that application teams rarely treat logging as a first-class design concern. As a result, security teams frequently find themselves trying to reconcile field names and values in the middle of an incident, which is the worst possible time to attempt data engineering. This is why he points to pipeline technologies such as Axoflow as essential. These tools provide a place to enforce common schemas, maintain mappings between models like ECS and OCSF, and standardize identity and context before data spreads across multiple systems.

Beyond SIEM: The Critical Role of Intelligent Data Pipelines

Given these challenges, the conversation increasingly points towards the necessity of an intelligent data pipeline layer positioned strategically before the SIEM and, in modern architectures, before data lakes and warehouses as well. Such a pipeline transforms raw, disparate logs into a normalized, enriched, and relevant dataset, optimized for downstream security analytics and for federated search.

Key capabilities of an intelligent data pipeline include:

  • Universal Data Ingestion and Classification: It acts as a universal intake for all security data, regardless of its source or format. Through sophisticated classification, it can identify the exact product, for example, a Palo Alto firewall and even the specific log type, such as threat, traffic, or system, providing immediate context. This ensures that data is usable and accessible from the moment it enters the pipeline.
  • Real-time Normalization and Enrichment: It translates diverse log formats into a common, structured language, enriching data with crucial metadata and context while in transit. This means consistent parsing, regardless of the quirks of individual devices, and the ability to add missing information or context from other sources. The same normalization work also enables correlation across SIEM, lake, and warehouse environments.
  • Intelligent Volume Reduction: The pipeline can apply smart filtering, aggregation, and deduplication logic to reduce data volume without sacrificing critical forensic detail. For instance, it can summarize thousands of identical informational logs into a single, comprehensive record, while retaining the underlying raw data for audit or forensic needs in a cost-effective archive. This addresses the core cost problem associated with data volume and supports tiered storage strategies where only the most valuable data is kept hot.
  • Ensuring Data Coverage and Integrity: A critical function of this layer is to monitor data flow and alert on missing sources or changes in expected data patterns. As Vikas Gopal notes, "the pipeline should be responsible for data sources and whether they go missing or not, because the pipeline has the information, whereas the SIEM doesn't". This proactive monitoring turns potential "blind spots" into actionable security incidents, a responsibility often overlooked by traditional SIEMs.
  • Facilitating SIEM Migrations: By centralizing parsing and normalization, an intelligent data pipeline decouples data processing from the SIEM itself. This dramatically simplifies migrations between SIEM platforms, as the pipeline can feed the new SIEM with data in its required format, eliminating the need to re-engineer custom parsers from scratch. The same pattern applies when adopting new data warehouses or analytics engines, which reduces vendor lock-in and improves organizational agility.

Agentic AI and the Federated SOC of the Future

Once data is consistently structured and available across tiers, organizations can start to realize the potential of agentic AI. Cory Minton describes a model in which AI agents use the Model Context Protocol to discover and retrieve the right data from multiple systems. In this pattern, an agent might pull telemetry from Splunk, business context from Snowflake, and historical evidence from Amazon S3, then present an analyst with a unified, context-rich view of an incident.

Rob Gil cautions that this kind of AI-assisted investigation still lives or dies on data quality. If identity logs are fragmented or inconsistent, AI will struggle with the same herding cats problem that slows down human analysts. The promise of agentic AI is therefore tightly coupled to the quality of the pipeline. Clean schemas, reliable identity fields, and clear mappings across platforms determine whether AI accelerates investigations or simply adds another layer of confusion.

The Human Factor and Automation's Promise

The ultimate goal of such data pipeline optimization is to empower SOC teams. Security operations centers are plagued by challenges like analyst fatigue and talent retention, often due to "routine mind-numbing work". Automation, particularly when driven by Artificial Intelligence (AI) and Machine Learning (ML), holds immense promise in alleviating this burden.

SOAR (Security Orchestration, Automation, and Response) platforms already automate a significant portion - up to 50-60% - of alert responses. Anton Chuvakin highlights the vision for "SOAR 2.0 powered by AI when you don't write the playbook," where machines dynamically build and adjust playbooks based on threat context and environment specifics. AI/ML can also significantly assist in anomaly detection, summarizing incidents, and generating queries, drastically speeding up analysis.

However, the adoption of AI must be approached with caution. Boris Kogan, a CISO in financial services, advises a "go slow school" approach, emphasizing the critical need for human oversight and validation of AI-generated outputs. As he points out, "if an AI tool... starts building detection and activities based on such detections, it can be dangerous," potentially aiding threat actors if incorrect. The "human in the loop" remains essential to validate AI's suggestions, measure metrics, and provide feedback, ensuring accuracy and preventing the system from "learning how to SOC badly". By offloading the mundane tasks to intelligent automation and pipelines, human analysts can focus on higher-value activities like threat hunting and detection engineering, enriching their roles and improving retention.

Conclusion

In an era where security data is both an invaluable asset and an overwhelming burden, CISOs and senior security engineers must shift their focus from merely collecting logs to intelligently managing them across different storage tiers, platforms, and AI-driven tools. An optimized and standardized data pipeline, capable of feeding SIEM, data lakes, and federated analytics, provides the critical foundation for effective security operations, ensuring data quality, reducing costs, eliminating blind spots, and preparing the organization for agentic AI while empowering SOC teams to focus on real threats rather than data wrangling.

To learn more about Axoflow's autonomous data pipeline, check out our Axoflow Zero to Hero: Stream Security Data Anywhere blog post, or request a demo. If you are interested in our storage solutions, just read this blog post.

This post was a summary of recent episodes of Balazs Scheidler's podcast series where he talks with security experts about the challenges of SOCs.

Listen to the podcast here.

Follow Our Progress!

We are excited to be realizing our vision above with a full Axoflow product suite.

Sign Me Up
This button is added to each code block on the live site, then its parent is removed from here.

Fighting data Loss?

Balázs Scheidler

Book a free 30-min consultation with syslog-ng creator Balázs Scheidler

Recent Posts

Breaking Free from Vendor Lock-in: Cutting Splunk Ingestion Costs with a Security Data Pipeline
10x search improvement? Optimize Splunk fields with Axoflow
The Autonomous Data Layer: Control Your Data, Cost, and Cyber Risk