Security Data Pipeline Management

Managing logs and data pipelines effectively is a growing challenge for IT and security teams. Traditional approaches that simply funnel all data into SIEMs compromise data quality, hampering visibility and actionable insights. Data pipeline management solutions like Axoflow streamline data processing, help security teams to work more effectively, and reduce ingestion costs.

Recently, Allie Mellen (Principal Analyst at Forrester) pointed out pain points in her blog “If You’re Not Using Data Pipeline Management For Security And IT, You Need To” that extremely resonates with what we're doing at Axoflow (as we are a security data pipeline management solution).

The security data quality problem

"Send everything into the SIEM and it'll be good for you", states the decade-long premise of SIEM vendors. More data means more visibility, leading to more insights and better security, right? Unfortunately, this isn’t necessarily true. In recent years it has become apparent that SIEMs don't deliver on their value-added promises because they receive low-quality data from the data sources.

SIEMs have more than 18% of their rules that are broken and will never fire an alert due to common issues such as misconfigured data sources, missing fields, and parsing errors.

4th Annual Report: State of SIEM Detection Risk 2024 Edition (CardinalOps)

Or as Allie put it: “Visibility without actionability is an expensive waste of time.”

Data quality issues cause problems in your SOC for your detection engineers. They come in various forms, for example:

Unclassified data sources: The SIEM doesn't recognize the sender, consequently doesn't apply the usual processing steps for the data. For example, this can happen when the sourcetype is not set in Splunk.
Unstructured data: Important information often comes in the unstructured part of the payload, for example, in the text part of syslog messages or event logs. These data points are difficult to work with unless you parse the unstructured part of the message.
Noise: Often large volume of data is sent to the SIEM that has absolutely no security relevance. Such data is useless, gets in the way of your detection engineers, and also increases storage and processing requirements.
Unnormalized data: Similar events (for example, logs of different firewall products) should be normalized so you can extract the same information from them.
Parsing issues: Generic data parsers often fail to process the incoming data if doesn't exactly comply with the specified format. For example, syslog data sources often produce invalidly formatted messages.

So, where does all this low-quality data come from? According to our experience, the average composition of data ingested into the SIEM looks like:

In this post I’m focusing on syslog, but that doesn’t mean that other data sources automatically provide better quality data. Data pipeline management can improve the quality all data types.

The syslog data quality problem

About 50% of all data ingested into the SIEM is syslog data, and a surprisingly large number of syslog data sources (like applications and appliances) produce malformed or invalid syslog messages.

You (and the SIEM vendors) could say that you can fix up your data in the SIEM. Well, fixing such data problems in the SIEM is a problematic, ineffective, and mostly manual chore that constantly hampers security teams' work. Also, let's not forget that even if you could solve data issues in the SIEM, that's not effective, because in the SIEM you can't solve:

Log size reduction. The data is already in the SIEM and you've paid for it at ingestion time.
Routing. A SIEM is most often a final destination for the data, and you can't really forward the parts that you'd rather need somewhere else, for example, to a monitoring or analytics system, low-cost storage, or nowhere (because it's not security-relevant, for example, you don't want to send debug-level logs into the SIEM).
Vendor lock-in. Managing to solve all your data issues in the SIEM would also mean a huge dependency on that particular SIEM, because all those fixes, processing rules, and data tweaks work only in that SIEM. This leaves you to the whims of pricing changes of that particular vendor. Also, should a new vendor appear with a superior feature set, many enterprises would be reluctant to migrate because of the immense work of having to reimplement everything in the new tool.

Another important and expensive aspect of the data quality problem is its cost: ingesting data into security information and event management (SIEM) has become way too expensive. At over 25% average YoY increase in the collected data, organizations are overspending because of volume-based licensing.

The solution is in the pipeline

All these issues can be remedied most effectively by shifting data processing left, into the data pipeline. (I know, technically, the most effective way would be at the data source, but I haven't seen that we're getting any nearer to that in the last 20 years, so I'm beyond being optimistic in that regard. Plus, if you want to be realistic, pushing routing decisions from a central location to a wide range of data sources and appliances from different vendors sounds far-fetched even for me.)

These tools are categorized under different names, including data pipeline management (DPM) (Forrester), data fabric, and telemetry pipeline management (Gartner). Security data pipeline management tools (like our own Axoflow Platform) specifically address the use cases that solve data pipeline management for security teams.

So as I've said, shifting left and moving data processing to the data pipeline allows you to parse, and process the data, to:

Fix the incoming data and add missing information, like hostname or timestamp,
classify the data to identify the source,
redact sensitive information before it gets saved in a SIEM or storage, like PII information,
enrich the data with contextual information, like adding labels based on the source or content of the data,
use all the above to route the data to the appropriate destinations, and finally
transform the data into an optimized format that the destination can reliably and effortlessly consume. This includes mapping your data to multiple different schemas if you use multiple analytic tools.

To sum up, you get to send far less but way better quality data to every consumer, including the SIEM (or SIEMs), monitoring tools. For details about the real-life benefits, see our case studies.

Other benefits of Axoflow:

Data collection from applications, cloud services, cloud-native sources (OpenTelemetry, Kubernetes), traditional Linux/Unix (syslog-ng) sources, Windows hosts.
Host attribution.
Manage and monitor your security data pipeline and get real-time metrics and alerts about system health, data volume, data dropouts, data bursts, and critically, transport costs.

What your teams get from data pipeline management

For security teams to work, you need the right people, processes, and technology. Technology and tools must support and improve your processes and make your team members' work more effective.

So, how do your security teams benefit from data pipeline management?

The immediate value is reduced data volume (log size), which means decreased ingest costs. Depending on your exact data mix, the volume reduction is over 50%.
Data routing adds to more aspects of cost reduction:
- by not sending unneeded data into the SIEM, and
- redirecting data that needs to be stored to low-cost bulk storage solutions.

But the main benefit for the people on your security teams is better data quality: data quality improvements in the pipeline lead to better data in the SIEM. Better quality data means that:

your content management teams can create better, more accurate alerts and queries, and
your security analysts have more contextual and metadata available when investigating incidents, leading to faster response and resolution.

Automation in the pipeline

Most data pipeline management tools provide ways to improve data quality. However, in most solutions these are only possibilities, meaning that your teams have to manually configure these tools. This requires in-depth know-how about the data sources (like application and appliance logs), and that creates a continuous workload for your team.

The real added value that helps your teams is a security data pipeline management solution that automates the data quality improvement and enrichment steps. Axoflow is such an opinionated solution. Axoflow can automatically identify and classify the incoming log messages and provide metadata and contextual data that allow you to formulate high-level routing policies, instead of having to manually set routing rules for each source. We maintain the database required to do that, so your teams can spend their time on other things.

Shifting data quality improvements and routing into the data pipeline also makes you more independent from your SIEM, so every consumer receives the same benefits and improved data. On the one hand, this means improved support for multi-tool scenarios (for example, when you feed your data both into a SIEM and a monitoring/analytics tool). On the other hand, it also simplifies multi-SIEM scenarios (which are on the rise), and is an immense help when you’re migrating to another SIEM.

Conclusion

Traditional SIEM-centric approaches are no longer sufficient to handle the growing volume, diversity, and complexity of data, nor do they address the rising costs and operational inefficiencies caused by data quality issues. By adopting a robust security data pipeline management solution like Axoflow, organizations can improve both data quality and the operational agility of their security analysts and related teams.

Axoflow not only reduces data volume and ingestion costs but also empowers teams with cleaner, richer, and more actionable data. It supports multi-tool and multi-SIEM environments, enhances incident response times, and ensures that your security architecture remains flexible and future-proof. See our case studies for a showcase of its real-life benefits.