Long-Term Log Storage Without SIEM Costs: The Axoflow Storage Layer

If you work for an enterprise in a highly regulated industry, you know that logs and security data must be retained for years — typically 3–5, and often 7–10 or more — to satisfy compliance requirements. At the same time, data volumes keep climbing, and storing all of it long-term in a SIEM becomes extremely expensive: you end up paying premium, analytics-grade prices to warehouse data that mostly just sits there. Plenty of storage solutions exist, each with plenty of features, but at this scale two problems dominate: runaway storage costs and vendor lock-in.

What you actually need for long-term retention is different from what you need for analytics. Heavy analytics use cases are well served by a SIEM with short retention, but long-term storage calls for something built around efficient writes, cost-effective capacity, and the ability to query and rehydrate logs when you need them — years down the line.

Long-term storage of sensitive data like this is a hard problem. Let's look at what the current landscape offers, and why we decided to build a solution tailored to this use case.

Why long-term log storage is hard: SIEMs, ClickHouse, and cost problems

For better or worse, SIEMs are the most common natural habitat for security events. SIEMs provide crucial insight into security events, threat analysis, detection features, alerts, metrics, dashboards. The main use case for SIEMs is analytics, for which they have to store logs. However, because of all those features, SIEMs can get quite expensive for long-term storage, especially if you consider that ~90% of the data never gets queried.

ClickHouse is a natural choice for handling large-scale data, it was built for real-time data analytics. It is open source and self-hostable, so the cost is only your infrastructure bill. ClickHouse can store massive amounts of data efficiently while keeping it quickly queryable.

The tradeoff is that ClickHouse uses a proprietary data format for storing data on object stores. That data can only be accessed via ClickHouse, using indexes which ClickHouse stores locally on the machine where it is running. For the data to be accessible after 3-5 years, you need a comprehensive backup strategy.

SIEMs and analytical database solutions are great at what they were designed to do: ingesting and storing data to provide real-time analytics, optimised for frequent queries. That allows for features like dashboards and alerts. But these features are overkill for simple long-term storage use cases.

As Axoflow already provides real-time analytics in the data pipeline, we needed a solution for the long-term storage of data that:

comes without vendor lock-in,
works on-prem, in the cloud, and also in air-gapped environments, and
has good query performance.

We decided to create our own storage layer to best meet these needs.

Axoflow Storage Layer: Parquet files on object storage under your control

Our philosophy is simple: your logs belong to you. Axoflow writes them as open-format Parquet files directly onto your own object storage, and from that point they're entirely yours to keep, move, or query however you want. Those files are completely self-contained. There's no SaaS to subscribe to, no database to run beside them, and no local index to maintain. The files are all you need, and they'll still open 10 or 20 years from now, long after today's tools have come and gone.

Why we chose Apache Parquet for log storage

We’ve designed our storage layer around storing Apache Parquet files in object stores. This has been an industry standard practice for years in data engineering, it has spawned a vibrant ecosystem of tooling, and today most data analysis tools support reading Parquet from object stores. Parquet is an open format, and the files in themselves contain everything required to read them, there’s no external index or database for metadata. You will be able to read them even 10+ years from now.

Columnar storage: read only what you query

Parquet is a columnar format, which means that it is possible to read only one column from the file when querying, without loading unrelated columns. This is really helpful when your queries can be filtered on low-cardinality columns. After reading a column and identifying which rows need reading, the query only needs to read the relevant filtered rows. Compression and encryption also works column-by-column, so only the relevant columns need to be uncompressed and decrypted. Columnar compression is also very efficient, resulting in ~90% reduction in storage space.

Built-in statistics that skip files you don't need

Parquet files contain statistics in their metadata, which is crucial for query efficiency. These column statistics can tell the minimum and maximum values of a column in the file. This allows you to speed up queries, because you only read this tiny metadata of the file first. Based on the metadata you can tell for example whether the file contains the timestamp or host label you are looking for. If not, then you can just skip the rest of the file. Plus, you don’t have to implement any of this, there are awesome query engines which have this and a bunch of other optimisations built into them.

Separating storage from the query engine

The big data ecosystem had a shift recently: instead of relying on traditional dedicated databases, they separated the storage layer from the query engine. For example, with Apache Spark you could provision hundreds of machines to create a distributed query on the Parquet dataset you stored in S3. And even more recently, embeddable query engines popped up, like DuckDB and Apache Datafusion.

Embeddable query engines: no database to run

Embeddable query engines are like SQLite: they are a library which you can include in your application, and when you query a dataset using them, they are a function call in your application. You don’t need to run and maintain a dedicated database to query the data. You can even limit their resource usage and they can complete their queries with a few hundred MB of RAM, with the help of spilling intermediary values to disk. This is revolutionary: in some cases even small servers can execute queries, there is no need to provision huge Spark clusters.

The Secret Sauce: how to stream logs into Parquet

Parquet is a great format for log storage, but it wasn’t designed for continuous streaming. A Parquet file can only be queried after the metadata and statistics have been written into the footer and the file has been closed. This also means that you cannot append new records to an existing Parquet file. One solution is to close Parquet files after a few seconds, which makes them available for querying, but having a lot of small Parquet files doesn't take advantage of the good compression the columnar format provides, and it makes the queries spread out over a lot of files.

We solved this problem by using a two-step process:

First writing the columnar data to local disk.
When enough records have accumulated, we combine those records into a Parquet file and save it to the object store.

This approach gives us the best of both worlds: we will create large Parquet files that take advantage of good compression while keeping incoming records available instantly for querying. Instant querying comes from using an intermediary file format that the embedded query engines we use can query. Parquet is the long-term destination of the data, we just use the intermediary files to create the most optimal Parquet files we can, which then can sit in storage for multiple years.

High write performance with Apache Arrow Flight

Our other secret is great write performance with minimal overhead. As we’ve outlined in the beginning, log storage is mostly written to, so write performance is absolutely crucial.

We’ve created an Apache Arrow Flight destination for AxoSyslog (the foundation of our AxoRouter), which uses the columnar Arrow IPC data format. AxoSyslog already contains the message in memory after processing; we can easily assemble the data in a columnar format and send it to our storage service without any costly serialisation. Our storage service is Arrow native, it can take those record batches and write them to our intermediary files very efficiently.

Open data formats for the future: query your logs with anything

Once your Parquet files land in storage, they're yours to use however you like. The obvious path is the Axoflow Console, where you can query and rehydrate logs directly, but nothing locks you into it. You can point DuckDB at the dataset and run SQL against it, open it in a Python notebook with Pandas or Polars for your next audit, or hand it to an AI agent to run analytics on. Because the format is open and self-describing, the tooling is your choice — not ours.

That's also what makes this approach future-proof. Seven years from now there will be data analysis tools we can't even imagine today, and the safe bet is that they'll read Parquet — just as today's tools read formats defined a decade ago. Your archive doesn't age into a proprietary dead end, but stays readable as the ecosystem moves forward.

And it runs wherever you do. On-prem or in the cloud, on local disk, your own S3-compatible hardware, a cloud object store, or even NFS — the storage layer doesn't care. There's no proprietary index to back up, no database to keep alive, nothing to migrate when a vendor changes direction.

That's the whole point: no vendor lock-in, low and predictable storage costs, and query performance that holds up for years. You're in control of your storage, and you're in control of your logs.